diff mbox series

[v7,4/9] KVM: Support dirty ring in conjunction with bitmap

Message ID 20221031003621.164306-5-gshan@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: Enable ring-based dirty memory tracking | expand

Commit Message

Gavin Shan Oct. 31, 2022, 12:36 a.m. UTC
ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is
enabled. It's conflicting with that ring-based dirty page tracking always
requires a running VCPU context.

Introduce a new flavor of dirty ring that requires the use of both VCPU
dirty rings and a dirty bitmap. The expectation is that for non-VCPU
sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to
the dirty bitmap. Userspace should scan the dirty bitmap before migrating
the VM to the target.

Use an additional capability to advertise this behavior. The newly added
capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before
KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added
capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL.

Suggested-by: Marc Zyngier <maz@kernel.org>
Suggested-by: Peter Xu <peterx@redhat.com>
Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 Documentation/virt/kvm/api.rst | 31 ++++++++++++++++++++++++-------
 include/linux/kvm_dirty_ring.h |  6 ++++++
 include/linux/kvm_host.h       |  1 +
 include/uapi/linux/kvm.h       |  1 +
 virt/kvm/Kconfig               |  8 ++++++++
 virt/kvm/dirty_ring.c          |  5 +++++
 virt/kvm/kvm_main.c            | 31 ++++++++++++++++++++++---------
 7 files changed, 67 insertions(+), 16 deletions(-)

Comments

Peter Xu Nov. 3, 2022, 7:33 p.m. UTC | #1
On Mon, Oct 31, 2022 at 08:36:16AM +0800, Gavin Shan wrote:
> ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is
> enabled. It's conflicting with that ring-based dirty page tracking always
> requires a running VCPU context.
> 
> Introduce a new flavor of dirty ring that requires the use of both VCPU
> dirty rings and a dirty bitmap. The expectation is that for non-VCPU
> sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to
> the dirty bitmap. Userspace should scan the dirty bitmap before migrating
> the VM to the target.
> 
> Use an additional capability to advertise this behavior. The newly added
> capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before
> KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added
> capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL.
> 
> Suggested-by: Marc Zyngier <maz@kernel.org>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
> Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
> Signed-off-by: Gavin Shan <gshan@redhat.com>

Acked-by: Peter Xu <peterx@redhat.com>
Oliver Upton Nov. 3, 2022, 11:32 p.m. UTC | #2
On Mon, Oct 31, 2022 at 08:36:16AM +0800, Gavin Shan wrote:
> ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is
> enabled. It's conflicting with that ring-based dirty page tracking always
> requires a running VCPU context.
> 
> Introduce a new flavor of dirty ring that requires the use of both VCPU
> dirty rings and a dirty bitmap. The expectation is that for non-VCPU
> sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to
> the dirty bitmap. Userspace should scan the dirty bitmap before migrating
> the VM to the target.
> 
> Use an additional capability to advertise this behavior. The newly added
> capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before
> KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added
> capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL.

Whatever ordering requirements we settle on between these capabilities
needs to be documented as well.

[...]

> @@ -4588,6 +4594,13 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  			return -EINVAL;
>  
>  		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
> +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
> +		if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) ||
> +		    !kvm->dirty_ring_size)

I believe this ordering requirement is problematic, as it piles on top
of an existing problem w.r.t. KVM_CAP_DIRTY_LOG_RING v. memslot
creation.

Example:
 - Enable KVM_CAP_DIRTY_LOG_RING
 - Create some memslots w/ dirty logging enabled (note that the bitmap
   is _not_ allocated)
 - Enable KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
 - Save ITS tables and get a NULL dereference in
   mark_page_dirty_in_slot():

                if (vcpu && kvm->dirty_ring_size)
                        kvm_dirty_ring_push(&vcpu->dirty_ring,
                                            slot, rel_gfn);
                else
------->		set_bit_le(rel_gfn, memslot->dirty_bitmap);

Similarly, KVM may unnecessarily allocate bitmaps if dirty logging is
enabled on memslots before KVM_CAP_DIRTY_LOG_RING is enabled.

You could paper over this issue by disallowing DIRTY_RING_WITH_BITMAP if
DIRTY_LOG_RING has already been enabled, but the better approach would
be to explicitly check kvm_memslots_empty() such that the real
dependency is obvious. Peter, hadn't you mentioned something about
checking against memslots in an earlier revision?

--
Thanks,
Oliver
Gavin Shan Nov. 4, 2022, 12:12 a.m. UTC | #3
Hi Oliver,

On 11/4/22 7:32 AM, Oliver Upton wrote:
> On Mon, Oct 31, 2022 at 08:36:16AM +0800, Gavin Shan wrote:
>> ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is
>> enabled. It's conflicting with that ring-based dirty page tracking always
>> requires a running VCPU context.
>>
>> Introduce a new flavor of dirty ring that requires the use of both VCPU
>> dirty rings and a dirty bitmap. The expectation is that for non-VCPU
>> sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to
>> the dirty bitmap. Userspace should scan the dirty bitmap before migrating
>> the VM to the target.
>>
>> Use an additional capability to advertise this behavior. The newly added
>> capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before
>> KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added
>> capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL.
> 
> Whatever ordering requirements we settle on between these capabilities
> needs to be documented as well.
> 
> [...]
> 

It's mentioned in 'Documentation/virt/kvm/api.rst' as below.

   After using the dirty rings, the userspace needs to detect the capability
   of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the ring structures
   need to be backed by per-slot bitmaps. With this capability advertised
   and supported, it means the architecture can dirty guest pages without
   vcpu/ring context, so that some of the dirty information will still be
   maintained in the bitmap structure.

The description may be not obvious about the ordering. For this, I can
add the following sentence at end of the section.

   The capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP can't be enabled
   until the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL has been enabled.

>> @@ -4588,6 +4594,13 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>>   			return -EINVAL;
>>   
>>   		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
>> +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
>> +		if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) ||
>> +		    !kvm->dirty_ring_size)
> 
> I believe this ordering requirement is problematic, as it piles on top
> of an existing problem w.r.t. KVM_CAP_DIRTY_LOG_RING v. memslot
> creation.
> 
> Example:
>   - Enable KVM_CAP_DIRTY_LOG_RING
>   - Create some memslots w/ dirty logging enabled (note that the bitmap
>     is _not_ allocated)
>   - Enable KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
>   - Save ITS tables and get a NULL dereference in
>     mark_page_dirty_in_slot():
> 
>                  if (vcpu && kvm->dirty_ring_size)
>                          kvm_dirty_ring_push(&vcpu->dirty_ring,
>                                              slot, rel_gfn);
>                  else
> ------->		set_bit_le(rel_gfn, memslot->dirty_bitmap);
> 
> Similarly, KVM may unnecessarily allocate bitmaps if dirty logging is
> enabled on memslots before KVM_CAP_DIRTY_LOG_RING is enabled.
> 
> You could paper over this issue by disallowing DIRTY_RING_WITH_BITMAP if
> DIRTY_LOG_RING has already been enabled, but the better approach would
> be to explicitly check kvm_memslots_empty() such that the real
> dependency is obvious. Peter, hadn't you mentioned something about
> checking against memslots in an earlier revision?
> 

The userspace (QEMU) needs to ensure that no dirty bitmap is created
before the capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is enabled.
It's unknown by QEMU that vgic/its is used when KVM_CAP_DIRTY_LOG_RING_ACQ_REL
is enabled.

    kvm_initialization
      enable_KVM_CAP_DIRTY_LOG_RING_ACQ_REL        // Where KVM_CAP_DIRTY_LOG_RING is enabled
    board_initialization                           // Where QEMU knows if vgic/its is used
      add_memory_slots
    kvm_post_initialization
      enable_KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
    :
    start_migration
      enable_dirty_page_tracking
        create_dirty_bitmap                       // With KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP enabled

Thanks,
Gavin
Oliver Upton Nov. 4, 2022, 1:06 a.m. UTC | #4
On Fri, Nov 04, 2022 at 08:12:21AM +0800, Gavin Shan wrote:
> Hi Oliver,
> 
> On 11/4/22 7:32 AM, Oliver Upton wrote:
> > On Mon, Oct 31, 2022 at 08:36:16AM +0800, Gavin Shan wrote:
> > > ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is
> > > enabled. It's conflicting with that ring-based dirty page tracking always
> > > requires a running VCPU context.
> > > 
> > > Introduce a new flavor of dirty ring that requires the use of both VCPU
> > > dirty rings and a dirty bitmap. The expectation is that for non-VCPU
> > > sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to
> > > the dirty bitmap. Userspace should scan the dirty bitmap before migrating
> > > the VM to the target.
> > > 
> > > Use an additional capability to advertise this behavior. The newly added
> > > capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before
> > > KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added
> > > capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL.
> > 
> > Whatever ordering requirements we settle on between these capabilities
> > needs to be documented as well.
> > 
> > [...]
> > 
> 
> It's mentioned in 'Documentation/virt/kvm/api.rst' as below.
> 
>   After using the dirty rings, the userspace needs to detect the capability
>   of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the ring structures
>   need to be backed by per-slot bitmaps. With this capability advertised
>   and supported, it means the architecture can dirty guest pages without
>   vcpu/ring context, so that some of the dirty information will still be
>   maintained in the bitmap structure.
> 
> The description may be not obvious about the ordering. For this, I can
> add the following sentence at end of the section.
> 
>   The capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP can't be enabled
>   until the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL has been enabled.
> 
> > > @@ -4588,6 +4594,13 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
> > >   			return -EINVAL;
> > >   		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
> > > +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
> > > +		if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) ||
> > > +		    !kvm->dirty_ring_size)
> > 
> > I believe this ordering requirement is problematic, as it piles on top
> > of an existing problem w.r.t. KVM_CAP_DIRTY_LOG_RING v. memslot
> > creation.
> > 
> > Example:
> >   - Enable KVM_CAP_DIRTY_LOG_RING
> >   - Create some memslots w/ dirty logging enabled (note that the bitmap
> >     is _not_ allocated)
> >   - Enable KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
> >   - Save ITS tables and get a NULL dereference in
> >     mark_page_dirty_in_slot():
> > 
> >                  if (vcpu && kvm->dirty_ring_size)
> >                          kvm_dirty_ring_push(&vcpu->dirty_ring,
> >                                              slot, rel_gfn);
> >                  else
> > ------->		set_bit_le(rel_gfn, memslot->dirty_bitmap);
> > 
> > Similarly, KVM may unnecessarily allocate bitmaps if dirty logging is
> > enabled on memslots before KVM_CAP_DIRTY_LOG_RING is enabled.
> > 
> > You could paper over this issue by disallowing DIRTY_RING_WITH_BITMAP if
> > DIRTY_LOG_RING has already been enabled, but the better approach would
> > be to explicitly check kvm_memslots_empty() such that the real
> > dependency is obvious. Peter, hadn't you mentioned something about
> > checking against memslots in an earlier revision?
> > 
> 
> The userspace (QEMU) needs to ensure that no dirty bitmap is created
> before the capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is enabled.
> It's unknown by QEMU that vgic/its is used when KVM_CAP_DIRTY_LOG_RING_ACQ_REL
> is enabled.

I'm not worried about what QEMU (or any particular VMM for that matter)
does with the UAPI. The problem is this patch provides a trivial way for
userspace to cause a NULL dereference in the kernel. Imposing ordering
between the cap and memslot creation avoids the problem altogether.

So, looking at your example:

>    kvm_initialization
>      enable_KVM_CAP_DIRTY_LOG_RING_ACQ_REL        // Where KVM_CAP_DIRTY_LOG_RING is enabled
>    board_initialization                           // Where QEMU knows if vgic/its is used

Is it possible that QEMU could hoist enabling RING_WITH_BITMAP here?
Based on your description QEMU has decided to use the vGIC ITS but
hasn't yet added any memslots.

>      add_memory_slots
>    kvm_post_initialization
>      enable_KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
>    :
>    start_migration
>      enable_dirty_page_tracking
>        create_dirty_bitmap                       // With KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP enabled

Just to make sure we're on the same page, there's two issues:

 (1) If DIRTY_LOG_RING is enabled before memslot creation and
     RING_WITH_BITMAP is enabled after memslots have been created w/
     dirty logging enabled, memslot->dirty_bitmap == NULL and the
     kernel will fault when attempting to save the ITS tables.

 (2) Not your code, but a similar issue. If DIRTY_LOG_RING[_ACQ_REL] is
     enabled after memslots have been created w/ dirty logging enabled,
     memslot->dirty_bitmap != NULL and that memory is wasted until the
     memslot is freed.

I don't expect you to fix #2, though I've mentioned it because using the
same approach to #1 and #2 would be nice.

--
Thanks,
Oliver
Gavin Shan Nov. 4, 2022, 6:57 a.m. UTC | #5
Hi Oliver,

On 11/4/22 9:06 AM, Oliver Upton wrote:
> On Fri, Nov 04, 2022 at 08:12:21AM +0800, Gavin Shan wrote:
>> On 11/4/22 7:32 AM, Oliver Upton wrote:
>>> On Mon, Oct 31, 2022 at 08:36:16AM +0800, Gavin Shan wrote:
>>>> ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is
>>>> enabled. It's conflicting with that ring-based dirty page tracking always
>>>> requires a running VCPU context.
>>>>
>>>> Introduce a new flavor of dirty ring that requires the use of both VCPU
>>>> dirty rings and a dirty bitmap. The expectation is that for non-VCPU
>>>> sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to
>>>> the dirty bitmap. Userspace should scan the dirty bitmap before migrating
>>>> the VM to the target.
>>>>
>>>> Use an additional capability to advertise this behavior. The newly added
>>>> capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before
>>>> KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added
>>>> capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL.
>>>
>>> Whatever ordering requirements we settle on between these capabilities
>>> needs to be documented as well.
>>>
>>> [...]
>>>
>>
>> It's mentioned in 'Documentation/virt/kvm/api.rst' as below.
>>
>>    After using the dirty rings, the userspace needs to detect the capability
>>    of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the ring structures
>>    need to be backed by per-slot bitmaps. With this capability advertised
>>    and supported, it means the architecture can dirty guest pages without
>>    vcpu/ring context, so that some of the dirty information will still be
>>    maintained in the bitmap structure.
>>
>> The description may be not obvious about the ordering. For this, I can
>> add the following sentence at end of the section.
>>
>>    The capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP can't be enabled
>>    until the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL has been enabled.
>>
>>>> @@ -4588,6 +4594,13 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>>>>    			return -EINVAL;
>>>>    		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
>>>> +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
>>>> +		if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) ||
>>>> +		    !kvm->dirty_ring_size)
>>>
>>> I believe this ordering requirement is problematic, as it piles on top
>>> of an existing problem w.r.t. KVM_CAP_DIRTY_LOG_RING v. memslot
>>> creation.
>>>
>>> Example:
>>>    - Enable KVM_CAP_DIRTY_LOG_RING
>>>    - Create some memslots w/ dirty logging enabled (note that the bitmap
>>>      is _not_ allocated)
>>>    - Enable KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
>>>    - Save ITS tables and get a NULL dereference in
>>>      mark_page_dirty_in_slot():
>>>
>>>                   if (vcpu && kvm->dirty_ring_size)
>>>                           kvm_dirty_ring_push(&vcpu->dirty_ring,
>>>                                               slot, rel_gfn);
>>>                   else
>>> ------->		set_bit_le(rel_gfn, memslot->dirty_bitmap);
>>>
>>> Similarly, KVM may unnecessarily allocate bitmaps if dirty logging is
>>> enabled on memslots before KVM_CAP_DIRTY_LOG_RING is enabled.
>>>
>>> You could paper over this issue by disallowing DIRTY_RING_WITH_BITMAP if
>>> DIRTY_LOG_RING has already been enabled, but the better approach would
>>> be to explicitly check kvm_memslots_empty() such that the real
>>> dependency is obvious. Peter, hadn't you mentioned something about
>>> checking against memslots in an earlier revision?
>>>
>>
>> The userspace (QEMU) needs to ensure that no dirty bitmap is created
>> before the capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is enabled.
>> It's unknown by QEMU that vgic/its is used when KVM_CAP_DIRTY_LOG_RING_ACQ_REL
>> is enabled.
> 
> I'm not worried about what QEMU (or any particular VMM for that matter)
> does with the UAPI. The problem is this patch provides a trivial way for
> userspace to cause a NULL dereference in the kernel. Imposing ordering
> between the cap and memslot creation avoids the problem altogether.
> 
> So, looking at your example:
> 
>>     kvm_initialization
>>       enable_KVM_CAP_DIRTY_LOG_RING_ACQ_REL        // Where KVM_CAP_DIRTY_LOG_RING is enabled
>>     board_initialization                           // Where QEMU knows if vgic/its is used
> 
> Is it possible that QEMU could hoist enabling RING_WITH_BITMAP here?
> Based on your description QEMU has decided to use the vGIC ITS but
> hasn't yet added any memslots.
> 

It's possible to add ARM specific helper kvm_arm_enable_dirty_ring_with_bitmap()
in qemu/target/arm.c, to enable RING_WITH_BITMAP if needed. The newly added
function can be called here when vgic/its is used.

>>       add_memory_slots
>>     kvm_post_initialization
>>       enable_KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
>>     :
>>     start_migration
>>       enable_dirty_page_tracking
>>         create_dirty_bitmap                       // With KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP enabled
> 
> Just to make sure we're on the same page, there's two issues:
> 
>   (1) If DIRTY_LOG_RING is enabled before memslot creation and
>       RING_WITH_BITMAP is enabled after memslots have been created w/
>       dirty logging enabled, memslot->dirty_bitmap == NULL and the
>       kernel will fault when attempting to save the ITS tables.
> 
>   (2) Not your code, but a similar issue. If DIRTY_LOG_RING[_ACQ_REL] is
>       enabled after memslots have been created w/ dirty logging enabled,
>       memslot->dirty_bitmap != NULL and that memory is wasted until the
>       memslot is freed.
> 
> I don't expect you to fix #2, though I've mentioned it because using the
> same approach to #1 and #2 would be nice.
> 

Yes, I got your points. Case (2) is still possible to happen with QEMU
excluded. However, QEMU is always enabling DIRTY_LOG_RING[_ACQ_REL] before
any memory slot is created. I agree that we need to ensure there are no
memory slots when DIRTY_LOG_RING[_ACQ_REL] is enabled.

For case (1), we can ensure RING_WTIH_BITMAP is enabled before any memory
slot is added, as below. QEMU needs a new helper (as above) to enable it
on board's level.

Lets fix both with a new helper in PATCH[v8 4/9] like below?

   static inline bool kvm_vm_has_memslot_pages(struct kvm *kvm)
   {
       bool has_memslot_pages;

       mutex_lock(&kvm->slots_lock);

       has_memslot_pages = !!kvm->nr_memslot_pages;

       mutex_unlock(&kvm->slots_lock);

       return has_memslot_pages;
   }

   static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
                                              struct kvm_enable_cap *cap)
   {
       :
       switch (cap->cap) {
       case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
           if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) ||
               !kvm->dirty_ring_size || kvm_vm_has_memslot_pages(kvm))
               return -EINVAL;

           kvm->dirty_ring_with_bitmap = true;

           return 0;
       default:
           return kvm_vm_ioctl_enable_cap(kvm, cap);
       }
   }

   static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
   {
       :
       /* We only allow it to set once */
       if (kvm->dirty_ring_size)
           return -EINVAL;

       if (kvm_vm_has_memslot_pages(kvm))
           return -EINVAL;
       :
   }

Thanks,
Gavin
Oliver Upton Nov. 4, 2022, 8:12 p.m. UTC | #6
On Fri, Nov 04, 2022 at 02:57:15PM +0800, Gavin Shan wrote:
> On 11/4/22 9:06 AM, Oliver Upton wrote:

[...]

> > Just to make sure we're on the same page, there's two issues:
> > 
> >   (1) If DIRTY_LOG_RING is enabled before memslot creation and
> >       RING_WITH_BITMAP is enabled after memslots have been created w/
> >       dirty logging enabled, memslot->dirty_bitmap == NULL and the
> >       kernel will fault when attempting to save the ITS tables.
> > 
> >   (2) Not your code, but a similar issue. If DIRTY_LOG_RING[_ACQ_REL] is
> >       enabled after memslots have been created w/ dirty logging enabled,
> >       memslot->dirty_bitmap != NULL and that memory is wasted until the
> >       memslot is freed.
> > 
> > I don't expect you to fix #2, though I've mentioned it because using the
> > same approach to #1 and #2 would be nice.
> > 
> 
> Yes, I got your points. Case (2) is still possible to happen with QEMU
> excluded. However, QEMU is always enabling DIRTY_LOG_RING[_ACQ_REL] before
> any memory slot is created. I agree that we need to ensure there are no
> memory slots when DIRTY_LOG_RING[_ACQ_REL] is enabled.
> 
> For case (1), we can ensure RING_WTIH_BITMAP is enabled before any memory
> slot is added, as below. QEMU needs a new helper (as above) to enable it
> on board's level.
> 
> Lets fix both with a new helper in PATCH[v8 4/9] like below?

I agree that we should address (1) like this, but in (2) requiring that
no memslots were created before enabling the existing capabilities would
be a change in ABI. If we can get away with that, great, but otherwise
we may need to delete the bitmaps associated with all memslots when the
cap is enabled.

>   static inline bool kvm_vm_has_memslot_pages(struct kvm *kvm)
>   {
>       bool has_memslot_pages;
> 
>       mutex_lock(&kvm->slots_lock);
> 
>       has_memslot_pages = !!kvm->nr_memslot_pages;
> 
>       mutex_unlock(&kvm->slots_lock);
> 
>       return has_memslot_pages;
>   }

Do we need to build another helper for this? kvm_memslots_empty() will
tell you whether or not a memslot has been created by checking the gfn
tree.

On top of that, the memslot check and setting
kvm->dirty_ring_with_bitmap must happen behind the slots_lock. Otherwise
you could still wind up creating memslots w/o bitmaps.


Something like:


diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 91cf51a25394..420cc101a16e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4588,6 +4588,32 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 			return -EINVAL;
 
 		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
+
+	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: {
+		struct kvm_memslots *slots;
+		int r = -EINVAL;
+
+		if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) ||
+		    !kvm->dirty_ring_size)
+			return r;
+
+		mutex_lock(&kvm->slots_lock);
+
+		slots = kvm_memslots(kvm);
+
+		/*
+		 * Avoid a race between memslot creation and enabling the ring +
+		 * bitmap capability to guarantee that no memslots have been
+		 * created without a bitmap.
+		 */
+		if (kvm_memslots_empty(slots)) {
+			kvm->dirty_ring_with_bitmap = cap->args[0];
+			r = 0;
+		}
+
+		mutex_unlock(&kvm->slots_lock);
+		return r;
+	}
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}

--
Thanks,
Oliver
Gavin Shan Nov. 4, 2022, 9:57 p.m. UTC | #7
Hi Oliver,

On 11/5/22 4:12 AM, Oliver Upton wrote:
> On Fri, Nov 04, 2022 at 02:57:15PM +0800, Gavin Shan wrote:
>> On 11/4/22 9:06 AM, Oliver Upton wrote:
> 
> [...]
> 
>>> Just to make sure we're on the same page, there's two issues:
>>>
>>>    (1) If DIRTY_LOG_RING is enabled before memslot creation and
>>>        RING_WITH_BITMAP is enabled after memslots have been created w/
>>>        dirty logging enabled, memslot->dirty_bitmap == NULL and the
>>>        kernel will fault when attempting to save the ITS tables.
>>>
>>>    (2) Not your code, but a similar issue. If DIRTY_LOG_RING[_ACQ_REL] is
>>>        enabled after memslots have been created w/ dirty logging enabled,
>>>        memslot->dirty_bitmap != NULL and that memory is wasted until the
>>>        memslot is freed.
>>>
>>> I don't expect you to fix #2, though I've mentioned it because using the
>>> same approach to #1 and #2 would be nice.
>>>
>>
>> Yes, I got your points. Case (2) is still possible to happen with QEMU
>> excluded. However, QEMU is always enabling DIRTY_LOG_RING[_ACQ_REL] before
>> any memory slot is created. I agree that we need to ensure there are no
>> memory slots when DIRTY_LOG_RING[_ACQ_REL] is enabled.
>>
>> For case (1), we can ensure RING_WTIH_BITMAP is enabled before any memory
>> slot is added, as below. QEMU needs a new helper (as above) to enable it
>> on board's level.
>>
>> Lets fix both with a new helper in PATCH[v8 4/9] like below?
> 
> I agree that we should address (1) like this, but in (2) requiring that
> no memslots were created before enabling the existing capabilities would
> be a change in ABI. If we can get away with that, great, but otherwise
> we may need to delete the bitmaps associated with all memslots when the
> cap is enabled.
> 

I had the assumption QEMU and kvm/selftests are the only consumers to
use DIRTY_RING. In this case, requiring that no memslots were created
to enable DIRTY_RING won't break userspace. Following your thoughts,
the tracked dirty pages in the bitmap also need to be synchronized to
the per-vcpu-ring before the bitmap can be destroyed. We don't have
per-vcpu-ring at this stage.

>>    static inline bool kvm_vm_has_memslot_pages(struct kvm *kvm)
>>    {
>>        bool has_memslot_pages;
>>
>>        mutex_lock(&kvm->slots_lock);
>>
>>        has_memslot_pages = !!kvm->nr_memslot_pages;
>>
>>        mutex_unlock(&kvm->slots_lock);
>>
>>        return has_memslot_pages;
>>    }
> 
> Do we need to build another helper for this? kvm_memslots_empty() will
> tell you whether or not a memslot has been created by checking the gfn
> tree.
> 

The helper was introduced to be shared when DIRTY_RING[_ACQ_REL] and
DIRTY_RING_WITH_BITMAP are enabled. Since the issue (2) isn't concern
to us, lets put it aside and the helper isn't needed. kvm_memslots_empty()
has same effect as to 'kvm->nr_memslot_pages', it's fine to use
kvm_memslots_empty(), which is more generic.

> On top of that, the memslot check and setting
> kvm->dirty_ring_with_bitmap must happen behind the slots_lock. Otherwise
> you could still wind up creating memslots w/o bitmaps.
> 

Agree.

> 
> Something like:
> 
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 91cf51a25394..420cc101a16e 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4588,6 +4588,32 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>   			return -EINVAL;
>   
>   		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
> +
> +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: {
> +		struct kvm_memslots *slots;
> +		int r = -EINVAL;
> +
> +		if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) ||
> +		    !kvm->dirty_ring_size)
> +			return r;
> +
> +		mutex_lock(&kvm->slots_lock);
> +
> +		slots = kvm_memslots(kvm);
> +
> +		/*
> +		 * Avoid a race between memslot creation and enabling the ring +
> +		 * bitmap capability to guarantee that no memslots have been
> +		 * created without a bitmap.
> +		 */
> +		if (kvm_memslots_empty(slots)) {
> +			kvm->dirty_ring_with_bitmap = cap->args[0];
> +			r = 0;
> +		}
> +
> +		mutex_unlock(&kvm->slots_lock);
> +		return r;
> +	}
>   	default:
>   		return kvm_vm_ioctl_enable_cap(kvm, cap);
>   	}
> 

The proposed changes look good to me. It will be integrated to PATCH[v8 4/9].
By the way, v8 will be posted shortly.

Thanks,
Gavin
Oliver Upton Nov. 4, 2022, 10:23 p.m. UTC | #8
On Sat, Nov 05, 2022 at 05:57:33AM +0800, Gavin Shan wrote:
> On 11/5/22 4:12 AM, Oliver Upton wrote:
> > I agree that we should address (1) like this, but in (2) requiring that
> > no memslots were created before enabling the existing capabilities would
> > be a change in ABI. If we can get away with that, great, but otherwise
> > we may need to delete the bitmaps associated with all memslots when the
> > cap is enabled.
> > 
> 
> I had the assumption QEMU and kvm/selftests are the only consumers to
> use DIRTY_RING. In this case, requiring that no memslots were created
> to enable DIRTY_RING won't break userspace.
> Following your thoughts, the tracked dirty pages in the bitmap also
> need to be synchronized to the per-vcpu-ring before the bitmap can be
> destroyed.

Eh, I don't think we'd need to go that far. No matter what, any dirty
bits that were present in the bitmap could never be read again anyway,
as we reject KVM_GET_DIRTY_LOG if kvm->dirty_ring_size != 0.

> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 91cf51a25394..420cc101a16e 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -4588,6 +4588,32 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
> >   			return -EINVAL;
> >   		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
> > +
> > +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: {
> > +		struct kvm_memslots *slots;
> > +		int r = -EINVAL;
> > +
> > +		if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) ||
> > +		    !kvm->dirty_ring_size)
> > +			return r;
> > +
> > +		mutex_lock(&kvm->slots_lock);
> > +
> > +		slots = kvm_memslots(kvm);
> > +
> > +		/*
> > +		 * Avoid a race between memslot creation and enabling the ring +
> > +		 * bitmap capability to guarantee that no memslots have been
> > +		 * created without a bitmap.
> > +		 */
> > +		if (kvm_memslots_empty(slots)) {
> > +			kvm->dirty_ring_with_bitmap = cap->args[0];
> > +			r = 0;
> > +		}
> > +
> > +		mutex_unlock(&kvm->slots_lock);
> > +		return r;
> > +	}
> >   	default:
> >   		return kvm_vm_ioctl_enable_cap(kvm, cap);
> >   	}
> > 
> 
> The proposed changes look good to me. It will be integrated to PATCH[v8 4/9].
> By the way, v8 will be posted shortly.

Excellent, thanks!

--
Best,
Oliver
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index eee9f857a986..4d4eeb5c3c5a 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8003,13 +8003,6 @@  flushing is done by the KVM_GET_DIRTY_LOG ioctl).  To achieve that, one
 needs to kick the vcpu out of KVM_RUN using a signal.  The resulting
 vmexit ensures that all dirty GFNs are flushed to the dirty rings.
 
-NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding
-ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls
-KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG.  After enabling
-KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual
-machine will switch to ring-buffer dirty page tracking and further
-KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail.
-
 NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that
 should be exposed by weakly ordered architecture, in order to indicate
 the additional memory ordering requirements imposed on userspace when
@@ -8018,6 +8011,30 @@  Architecture with TSO-like ordering (such as x86) are allowed to
 expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL
 to userspace.
 
+After using the dirty rings, the userspace needs to detect the capability
+of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the ring structures
+need to be backed by per-slot bitmaps. With this capability advertised
+and supported, it means the architecture can dirty guest pages without
+vcpu/ring context, so that some of the dirty information will still be
+maintained in the bitmap structure.
+
+Note that the bitmap here is only a backup of the ring structure, and
+normally should only contain a very small amount of dirty pages, which
+needs to be transferred during VM downtime. Collecting the dirty bitmap
+should be the very last thing that the VMM does before transmitting state
+to the target VM. VMM needs to ensure that the dirty state is final and
+avoid missing dirty pages from another ioctl ordered after the bitmap
+collection.
+
+To collect dirty bits in the backup bitmap, the userspace can use the
+same KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG shouldn't be needed
+and its behavior is undefined since collecting the dirty bitmap always
+happens in the last phase of VM's migration.
+
+NOTE: One example of using the backup bitmap is saving arm64 vgic/its
+tables through KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} command on
+KVM device "kvm-arm-vgic-its" during VM's migration.
+
 8.30 KVM_CAP_XEN_HVM
 --------------------
 
diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
index 04290eda0852..b08b9afd8bdb 100644
--- a/include/linux/kvm_dirty_ring.h
+++ b/include/linux/kvm_dirty_ring.h
@@ -37,6 +37,11 @@  static inline u32 kvm_dirty_ring_get_rsvd_entries(void)
 	return 0;
 }
 
+static inline bool kvm_use_dirty_bitmap(struct kvm *kvm)
+{
+	return true;
+}
+
 static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
 				       int index, u32 size)
 {
@@ -66,6 +71,7 @@  static inline void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
 #else /* CONFIG_HAVE_KVM_DIRTY_RING */
 
 int kvm_cpu_dirty_log_size(void);
+bool kvm_use_dirty_bitmap(struct kvm *kvm);
 u32 kvm_dirty_ring_get_rsvd_entries(void);
 int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 648d663f32c4..db83f63f4e61 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -779,6 +779,7 @@  struct kvm {
 	pid_t userspace_pid;
 	unsigned int max_halt_poll_ns;
 	u32 dirty_ring_size;
+	bool dirty_ring_with_bitmap;
 	bool vm_bugged;
 	bool vm_dead;
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0d5d4419139a..c87b5882d7ae 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1178,6 +1178,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_ZPCI_OP 221
 #define KVM_CAP_S390_CPU_TOPOLOGY 222
 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
+#define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 224
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 800f9470e36b..228be1145cf3 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -33,6 +33,14 @@  config HAVE_KVM_DIRTY_RING_ACQ_REL
        bool
        select HAVE_KVM_DIRTY_RING
 
+# Only architectures that need to dirty memory outside of a vCPU
+# context should select this, advertising to userspace the
+# requirement to use a dirty bitmap in addition to the vCPU dirty
+# ring.
+config HAVE_KVM_DIRTY_RING_WITH_BITMAP
+	bool
+	depends on HAVE_KVM_DIRTY_RING
+
 config HAVE_KVM_EVENTFD
        bool
        select EVENTFD
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 6091e1403bc8..7ce6a5f81c98 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -21,6 +21,11 @@  u32 kvm_dirty_ring_get_rsvd_entries(void)
 	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
 }
 
+bool kvm_use_dirty_bitmap(struct kvm *kvm)
+{
+	return !kvm->dirty_ring_size || kvm->dirty_ring_with_bitmap;
+}
+
 static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
 {
 	return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 91cf51a25394..0351c8fb41b9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1617,7 +1617,7 @@  static int kvm_prepare_memory_region(struct kvm *kvm,
 			new->dirty_bitmap = NULL;
 		else if (old && old->dirty_bitmap)
 			new->dirty_bitmap = old->dirty_bitmap;
-		else if (!kvm->dirty_ring_size) {
+		else if (kvm_use_dirty_bitmap(kvm)) {
 			r = kvm_alloc_dirty_bitmap(new);
 			if (r)
 				return r;
@@ -2060,8 +2060,8 @@  int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
 	unsigned long n;
 	unsigned long any = 0;
 
-	/* Dirty ring tracking is exclusive to dirty log tracking */
-	if (kvm->dirty_ring_size)
+	/* Dirty ring tracking may be exclusive to dirty log tracking */
+	if (!kvm_use_dirty_bitmap(kvm))
 		return -ENXIO;
 
 	*memslot = NULL;
@@ -2125,8 +2125,8 @@  static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 	unsigned long *dirty_bitmap_buffer;
 	bool flush;
 
-	/* Dirty ring tracking is exclusive to dirty log tracking */
-	if (kvm->dirty_ring_size)
+	/* Dirty ring tracking may be exclusive to dirty log tracking */
+	if (!kvm_use_dirty_bitmap(kvm))
 		return -ENXIO;
 
 	as_id = log->slot >> 16;
@@ -2237,8 +2237,8 @@  static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	unsigned long *dirty_bitmap_buffer;
 	bool flush;
 
-	/* Dirty ring tracking is exclusive to dirty log tracking */
-	if (kvm->dirty_ring_size)
+	/* Dirty ring tracking may be exclusive to dirty log tracking */
+	if (!kvm_use_dirty_bitmap(kvm))
 		return -ENXIO;
 
 	as_id = log->slot >> 16;
@@ -3305,7 +3305,10 @@  void mark_page_dirty_in_slot(struct kvm *kvm,
 	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
 #ifdef CONFIG_HAVE_KVM_DIRTY_RING
-	if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
+	if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
+		return;
+
+	if (WARN_ON_ONCE(!kvm->dirty_ring_with_bitmap && !vcpu))
 		return;
 #endif
 
@@ -3313,7 +3316,7 @@  void mark_page_dirty_in_slot(struct kvm *kvm,
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 		u32 slot = (memslot->as_id << 16) | memslot->id;
 
-		if (kvm->dirty_ring_size)
+		if (kvm->dirty_ring_size && vcpu)
 			kvm_dirty_ring_push(vcpu, slot, rel_gfn);
 		else
 			set_bit_le(rel_gfn, memslot->dirty_bitmap);
@@ -4482,6 +4485,9 @@  static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 		return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn);
 #else
 		return 0;
+#endif
+#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
+	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
 #endif
 	case KVM_CAP_BINARY_STATS_FD:
 	case KVM_CAP_SYSTEM_EVENT_DATA:
@@ -4588,6 +4594,13 @@  static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 			return -EINVAL;
 
 		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
+	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
+		if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) ||
+		    !kvm->dirty_ring_size)
+			return -EINVAL;
+
+		kvm->dirty_ring_with_bitmap = true;
+		return 0;
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}