diff mbox series

[v5,3/7] KVM: x86: Allow to use bitmap in ring-based dirty page tracking

Message ID 20221005004154.83502-4-gshan@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: Enable ring-based dirty memory tracking | expand

Commit Message

Gavin Shan Oct. 5, 2022, 12:41 a.m. UTC
There is no running vcpu and available per-vcpu dirty ring when
pages become dirty in some cases. One example is to save arm64's
vgic/its tables during migration. This leads to our lost tracking
on these dirty pages.

Fix the issue by reusing the bitmap to track those dirty pages.
The bitmap is visited by KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG
ioctls. KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP is used to advertise the
new capability.

Suggested-by: Marc Zyngier <maz@kernel.org>
Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 Documentation/virt/kvm/api.rst | 17 ++++++++---------
 include/uapi/linux/kvm.h       |  1 +
 virt/kvm/kvm_main.c            | 24 +++++++++---------------
 3 files changed, 18 insertions(+), 24 deletions(-)

Comments

Peter Xu Oct. 6, 2022, 8:28 p.m. UTC | #1
On Wed, Oct 05, 2022 at 08:41:50AM +0800, Gavin Shan wrote:
> -8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
> -----------------------------------------------------------
> +8.29 KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL, RING_ALLOW_BITMAP}
> +--------------------------------------------------------------

Shall we make it a standalone cap, just to rely on DIRTY_RING[_ACQ_REL]
being enabled first, instead of making the three caps at the same level?

E.g. we can skip creating bitmap for DIRTY_RING[_ACQ_REL] && !_ALLOW_BITMAP
(x86).

> @@ -2060,10 +2060,6 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
>  	unsigned long n;
>  	unsigned long any = 0;
>  
> -	/* Dirty ring tracking is exclusive to dirty log tracking */
> -	if (kvm->dirty_ring_size)
> -		return -ENXIO;

Then we can also have one dirty_ring_exclusive(), with something like:

bool dirty_ring_exclusive(struct kvm *kvm)
{
        return kvm->dirty_ring_size && !kvm->dirty_ring_allow_bitmap;
}

Does it make sense?  Thanks,
Gavin Shan Oct. 6, 2022, 11:38 p.m. UTC | #2
Hi Peter,

On 10/7/22 4:28 AM, Peter Xu wrote:
> On Wed, Oct 05, 2022 at 08:41:50AM +0800, Gavin Shan wrote:
>> -8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
>> -----------------------------------------------------------
>> +8.29 KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL, RING_ALLOW_BITMAP}
>> +--------------------------------------------------------------
> 
> Shall we make it a standalone cap, just to rely on DIRTY_RING[_ACQ_REL]
> being enabled first, instead of making the three caps at the same level?
> 
> E.g. we can skip creating bitmap for DIRTY_RING[_ACQ_REL] && !_ALLOW_BITMAP
> (x86).
> 

Both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LONG_RING_ACQ_REL are available
to x86. So KVM_CAP_DIRTY_LONG_RING_ACQ_REL can be enabled on x86 in theory.
However, the idea to disallow bitmap for KVM_CAP_DIRTY_LOG_RING (x86) makes
sense to me. I think you may be suggesting something like below.

- bool struct kvm::dirty_ring_allow_bitmap

- In kvm_vm_ioctl_enable_dirty_log_ring(), set 'dirty_ring_allow_bitmap' to
   true when the capability is KVM_CAP_DIRTY_LONG_RING_ACQ_REL

   static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 cap, u32 size)
   {
     :
     mutex_lock(&kvm->lock);

     if (kvm->created_vcpus) {
        /* We don't allow to change this value after vcpu created */
        r = -EINVAL;
     } else {
        kvm->dirty_ring_size = size;
        kvm->dirty_ring_allow_bitmap = (cap == KVM_CAP_DIRTY_LOG_RING_ACQ_REL);
        r = 0;
     }

     mutex_unlock(&kvm->lock);
     return r;
   }
   
- In kvm_vm_ioctl_check_extension_generic(), KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP
   is always flase until KVM_CAP_DIRTY_LOG_RING_ACQ_REL is enabled.

   static long kvm_vm_ioctl_check_extension_generic(...)
   {
     :
     case KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP:
         return kvm->dirty_ring_allow_bitmap ? 1 : 0;
   }

- The suggested dirty_ring_exclusive() is used.

>> @@ -2060,10 +2060,6 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
>>   	unsigned long n;
>>   	unsigned long any = 0;
>>   
>> -	/* Dirty ring tracking is exclusive to dirty log tracking */
>> -	if (kvm->dirty_ring_size)
>> -		return -ENXIO;
> 
> Then we can also have one dirty_ring_exclusive(), with something like:
> 
> bool dirty_ring_exclusive(struct kvm *kvm)
> {
>          return kvm->dirty_ring_size && !kvm->dirty_ring_allow_bitmap;
> }
> 
> Does it make sense?  Thanks,
> 

Thanks,
Gavin
Peter Xu Oct. 7, 2022, 2:31 p.m. UTC | #3
On Fri, Oct 07, 2022 at 07:38:19AM +0800, Gavin Shan wrote:
> Hi Peter,

Hi, Gavin,

> 
> On 10/7/22 4:28 AM, Peter Xu wrote:
> > On Wed, Oct 05, 2022 at 08:41:50AM +0800, Gavin Shan wrote:
> > > -8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
> > > -----------------------------------------------------------
> > > +8.29 KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL, RING_ALLOW_BITMAP}
> > > +--------------------------------------------------------------
> > 
> > Shall we make it a standalone cap, just to rely on DIRTY_RING[_ACQ_REL]
> > being enabled first, instead of making the three caps at the same level?
> > 
> > E.g. we can skip creating bitmap for DIRTY_RING[_ACQ_REL] && !_ALLOW_BITMAP
> > (x86).
> > 
> 
> Both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LONG_RING_ACQ_REL are available
> to x86. So KVM_CAP_DIRTY_LONG_RING_ACQ_REL can be enabled on x86 in theory.
> However, the idea to disallow bitmap for KVM_CAP_DIRTY_LOG_RING (x86) makes
> sense to me. I think you may be suggesting something like below.
> 
> - bool struct kvm::dirty_ring_allow_bitmap
> 
> - In kvm_vm_ioctl_enable_dirty_log_ring(), set 'dirty_ring_allow_bitmap' to
>   true when the capability is KVM_CAP_DIRTY_LONG_RING_ACQ_REL

What I wanted to do is to decouple the ACQ_REL with ALLOW_BITMAP, so mostly
as what you suggested, except..

> 
>   static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 cap, u32 size)
>   {
>     :
>     mutex_lock(&kvm->lock);
> 
>     if (kvm->created_vcpus) {
>        /* We don't allow to change this value after vcpu created */
>        r = -EINVAL;
>     } else {
>        kvm->dirty_ring_size = size;

.. here I'd not set dirty_ring_allow_bitmap at all so I'd drop below line,
instead..

>        kvm->dirty_ring_allow_bitmap = (cap == KVM_CAP_DIRTY_LOG_RING_ACQ_REL);
>        r = 0;
>     }
> 
>     mutex_unlock(&kvm->lock);
>     return r;
>   }
> - In kvm_vm_ioctl_check_extension_generic(), KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP
>   is always flase until KVM_CAP_DIRTY_LOG_RING_ACQ_REL is enabled.
> 
>   static long kvm_vm_ioctl_check_extension_generic(...)
>   {
>     :
>     case KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP:
>         return kvm->dirty_ring_allow_bitmap ? 1 : 0;

... here we always return 1, OTOH in kvm_vm_ioctl_enable_cap_generic():

      case KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP:
           if (kvm->dirty_ring_size)
                return -EINVAL;
           kvm->dirty_ring_allow_bitmap = true;
           return 0;

A side effect of checking dirty_ring_size is then we'll be sure to have no
vcpu created too.  Maybe we should also check no memslot created to make
sure the bitmaps are not created.

Then if the userspace wants to use the bitmap altogether with the ring, it
needs to first detect KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP and enable it
before it enables KVM_CAP_DIRTY_LOG_RING.

One trick on ALLOW_BITMAP is in mark_page_dirty_in_slot() - after we allow
!vcpu case we'll need to make sure it won't accidentally try to set bitmap
for !ALLOW_BITMAP, because in that case the bitmap pointer is NULL so
set_bit_le() will directly crash the kernel.

We could keep the old flavor of having a WARN_ON_ONCE(!vcpu &&
!ALLOW_BITMAP) then return, but since now the userspace can easily trigger
this (e.g. on ARM, a malicious userapp can have DIRTY_RING &&
!ALLOW_BITMAP, then it can simply trigger the gic ioctl to trigger host
warning), I think the better approach is we can kill the process in that
case.  Not sure whether there's anything better we can do.

>   }
> 
> - The suggested dirty_ring_exclusive() is used.
> 
> > > @@ -2060,10 +2060,6 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
> > >   	unsigned long n;
> > >   	unsigned long any = 0;
> > > -	/* Dirty ring tracking is exclusive to dirty log tracking */
> > > -	if (kvm->dirty_ring_size)
> > > -		return -ENXIO;
> > 
> > Then we can also have one dirty_ring_exclusive(), with something like:
> > 
> > bool dirty_ring_exclusive(struct kvm *kvm)
> > {
> >          return kvm->dirty_ring_size && !kvm->dirty_ring_allow_bitmap;
> > }
> > 
> > Does it make sense?  Thanks,
> > 
> 
> Thanks,
> Gavin
>
Oliver Upton Oct. 10, 2022, 11:18 p.m. UTC | #4
On Fri, Oct 07, 2022 at 10:31:49AM -0400, Peter Xu wrote:

[...]

> > - In kvm_vm_ioctl_enable_dirty_log_ring(), set 'dirty_ring_allow_bitmap' to
> >   true when the capability is KVM_CAP_DIRTY_LONG_RING_ACQ_REL
> 
> What I wanted to do is to decouple the ACQ_REL with ALLOW_BITMAP, so mostly
> as what you suggested, except..

+1

> > 
> >   static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 cap, u32 size)
> >   {
> >     :
> >     mutex_lock(&kvm->lock);
> > 
> >     if (kvm->created_vcpus) {
> >        /* We don't allow to change this value after vcpu created */
> >        r = -EINVAL;
> >     } else {
> >        kvm->dirty_ring_size = size;
> 
> .. here I'd not set dirty_ring_allow_bitmap at all so I'd drop below line,
> instead..
> 
> >        kvm->dirty_ring_allow_bitmap = (cap == KVM_CAP_DIRTY_LOG_RING_ACQ_REL);
> >        r = 0;
> >     }
> > 
> >     mutex_unlock(&kvm->lock);
> >     return r;
> >   }
> > - In kvm_vm_ioctl_check_extension_generic(), KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP
> >   is always flase until KVM_CAP_DIRTY_LOG_RING_ACQ_REL is enabled.
> > 
> >   static long kvm_vm_ioctl_check_extension_generic(...)
> >   {
> >     :
> >     case KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP:
> >         return kvm->dirty_ring_allow_bitmap ? 1 : 0;
> 
> ... here we always return 1, OTOH in kvm_vm_ioctl_enable_cap_generic():
> 
>       case KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP:
>            if (kvm->dirty_ring_size)
>                 return -EINVAL;
>            kvm->dirty_ring_allow_bitmap = true;
>            return 0;
> 
> A side effect of checking dirty_ring_size is then we'll be sure to have no
> vcpu created too.  Maybe we should also check no memslot created to make
> sure the bitmaps are not created.

I'm not sure I follow... What prevents userspace from creating a vCPU
between enabling the two caps?

> Then if the userspace wants to use the bitmap altogether with the ring, it
> needs to first detect KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP and enable it
> before it enables KVM_CAP_DIRTY_LOG_RING.
> 
> One trick on ALLOW_BITMAP is in mark_page_dirty_in_slot() - after we allow
> !vcpu case we'll need to make sure it won't accidentally try to set bitmap
> for !ALLOW_BITMAP, because in that case the bitmap pointer is NULL so
> set_bit_le() will directly crash the kernel.
> 
> We could keep the old flavor of having a WARN_ON_ONCE(!vcpu &&
> !ALLOW_BITMAP) then return, but since now the userspace can easily trigger
> this (e.g. on ARM, a malicious userapp can have DIRTY_RING &&
> !ALLOW_BITMAP, then it can simply trigger the gic ioctl to trigger host
> warning), I think the better approach is we can kill the process in that
> case.  Not sure whether there's anything better we can do.

I don't believe !ALLOW_BITMAP && DIRTY_RING is a valid configuration for
arm64 given the fact that we'll dirty memory outside of a vCPU context.

Could ALLOW_BITMAP be a requirement of DIRTY_RING, thereby making
userspace fail fast? Otherwise (at least on arm64) your VM is DOA on the
target. With that the old WARN() could be preserved, as you suggest. On
top of that there would no longer be a need to test for memslot creation
when userspace attempts to enable KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP.

--
Thanks,
Oliver
Oliver Upton Oct. 10, 2022, 11:43 p.m. UTC | #5
On Mon, Oct 10, 2022 at 11:18:55PM +0000, Oliver Upton wrote:
> On Fri, Oct 07, 2022 at 10:31:49AM -0400, Peter Xu wrote:
> 
> [...]
> 
> > > - In kvm_vm_ioctl_enable_dirty_log_ring(), set 'dirty_ring_allow_bitmap' to
> > >   true when the capability is KVM_CAP_DIRTY_LONG_RING_ACQ_REL
> > 
> > What I wanted to do is to decouple the ACQ_REL with ALLOW_BITMAP, so mostly
> > as what you suggested, except..
> 
> +1
> 
> > > 
> > >   static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 cap, u32 size)
> > >   {
> > >     :
> > >     mutex_lock(&kvm->lock);
> > > 
> > >     if (kvm->created_vcpus) {
> > >        /* We don't allow to change this value after vcpu created */
> > >        r = -EINVAL;
> > >     } else {
> > >        kvm->dirty_ring_size = size;
> > 
> > .. here I'd not set dirty_ring_allow_bitmap at all so I'd drop below line,
> > instead..
> > 
> > >        kvm->dirty_ring_allow_bitmap = (cap == KVM_CAP_DIRTY_LOG_RING_ACQ_REL);
> > >        r = 0;
> > >     }
> > > 
> > >     mutex_unlock(&kvm->lock);
> > >     return r;
> > >   }
> > > - In kvm_vm_ioctl_check_extension_generic(), KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP
> > >   is always flase until KVM_CAP_DIRTY_LOG_RING_ACQ_REL is enabled.
> > > 
> > >   static long kvm_vm_ioctl_check_extension_generic(...)
> > >   {
> > >     :
> > >     case KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP:
> > >         return kvm->dirty_ring_allow_bitmap ? 1 : 0;
> > 
> > ... here we always return 1, OTOH in kvm_vm_ioctl_enable_cap_generic():
> > 
> >       case KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP:
> >            if (kvm->dirty_ring_size)
> >                 return -EINVAL;
> >            kvm->dirty_ring_allow_bitmap = true;
> >            return 0;
> > 
> > A side effect of checking dirty_ring_size is then we'll be sure to have no
> > vcpu created too.  Maybe we should also check no memslot created to make
> > sure the bitmaps are not created.
> 
> I'm not sure I follow... What prevents userspace from creating a vCPU
> between enabling the two caps?
> 
> > Then if the userspace wants to use the bitmap altogether with the ring, it
> > needs to first detect KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP and enable it
> > before it enables KVM_CAP_DIRTY_LOG_RING.
> > 
> > One trick on ALLOW_BITMAP is in mark_page_dirty_in_slot() - after we allow
> > !vcpu case we'll need to make sure it won't accidentally try to set bitmap
> > for !ALLOW_BITMAP, because in that case the bitmap pointer is NULL so
> > set_bit_le() will directly crash the kernel.
> > 
> > We could keep the old flavor of having a WARN_ON_ONCE(!vcpu &&
> > !ALLOW_BITMAP) then return, but since now the userspace can easily trigger
> > this (e.g. on ARM, a malicious userapp can have DIRTY_RING &&
> > !ALLOW_BITMAP, then it can simply trigger the gic ioctl to trigger host
> > warning), I think the better approach is we can kill the process in that
> > case.  Not sure whether there's anything better we can do.
> 
> I don't believe !ALLOW_BITMAP && DIRTY_RING is a valid configuration for
> arm64 given the fact that we'll dirty memory outside of a vCPU context.
> 
> Could ALLOW_BITMAP be a requirement of DIRTY_RING, thereby making
> userspace fail fast? Otherwise (at least on arm64) your VM is DOA on the
> target. With that the old WARN() could be preserved, as you suggest. On
> top of that there would no longer be a need to test for memslot creation
> when userspace attempts to enable KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP.

Just to be explicit...

I don't believe ALLOW_BITMAP needs to be generally advertized on
architectures that select DIRTY_RING. Instead, architectures (just arm64
right now) should select ALLOW_BITMAP if they need to dirty memory
outside of a vCPU.

When ALLOW_BITMAP is selected, KVM_CAP_DIRTY_LOG_RING[_ACQ_REL] has the
additional restriction that KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP has been
enabled first.

--
Thanks,
Oliver
Peter Xu Oct. 10, 2022, 11:49 p.m. UTC | #6
On Mon, Oct 10, 2022 at 11:18:55PM +0000, Oliver Upton wrote:
> On Fri, Oct 07, 2022 at 10:31:49AM -0400, Peter Xu wrote:
> 
> [...]
> 
> > > - In kvm_vm_ioctl_enable_dirty_log_ring(), set 'dirty_ring_allow_bitmap' to
> > >   true when the capability is KVM_CAP_DIRTY_LONG_RING_ACQ_REL
> > 
> > What I wanted to do is to decouple the ACQ_REL with ALLOW_BITMAP, so mostly
> > as what you suggested, except..
> 
> +1
> 
> > > 
> > >   static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 cap, u32 size)
> > >   {
> > >     :
> > >     mutex_lock(&kvm->lock);
> > > 
> > >     if (kvm->created_vcpus) {
> > >        /* We don't allow to change this value after vcpu created */
> > >        r = -EINVAL;
> > >     } else {
> > >        kvm->dirty_ring_size = size;
> > 
> > .. here I'd not set dirty_ring_allow_bitmap at all so I'd drop below line,
> > instead..
> > 
> > >        kvm->dirty_ring_allow_bitmap = (cap == KVM_CAP_DIRTY_LOG_RING_ACQ_REL);
> > >        r = 0;
> > >     }
> > > 
> > >     mutex_unlock(&kvm->lock);
> > >     return r;
> > >   }
> > > - In kvm_vm_ioctl_check_extension_generic(), KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP
> > >   is always flase until KVM_CAP_DIRTY_LOG_RING_ACQ_REL is enabled.
> > > 
> > >   static long kvm_vm_ioctl_check_extension_generic(...)
> > >   {
> > >     :
> > >     case KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP:
> > >         return kvm->dirty_ring_allow_bitmap ? 1 : 0;
> > 
> > ... here we always return 1, OTOH in kvm_vm_ioctl_enable_cap_generic():
> > 
> >       case KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP:
> >            if (kvm->dirty_ring_size)
> >                 return -EINVAL;
> >            kvm->dirty_ring_allow_bitmap = true;
> >            return 0;
> > 
> > A side effect of checking dirty_ring_size is then we'll be sure to have no
> > vcpu created too.  Maybe we should also check no memslot created to make
> > sure the bitmaps are not created.
> 
> I'm not sure I follow... What prevents userspace from creating a vCPU
> between enabling the two caps?

Enabling of dirty ring requires no vcpu created, so as to make sure all the
vcpus will have the ring structures allocated as long as ring enabled for
the vm.  Done in kvm_vm_ioctl_enable_dirty_log_ring():

	if (kvm->created_vcpus) {
		/* We don't allow to change this value after vcpu created */
		r = -EINVAL;
	} else {
		kvm->dirty_ring_size = size;
		r = 0;
	}

Then if we have KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP checking
dirty_ring_size first then we make sure we need to configure both
ALLOW_BITMAP and DIRTY_RING before any vcpu creation.

> 
> > Then if the userspace wants to use the bitmap altogether with the ring, it
> > needs to first detect KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP and enable it
> > before it enables KVM_CAP_DIRTY_LOG_RING.
> > 
> > One trick on ALLOW_BITMAP is in mark_page_dirty_in_slot() - after we allow
> > !vcpu case we'll need to make sure it won't accidentally try to set bitmap
> > for !ALLOW_BITMAP, because in that case the bitmap pointer is NULL so
> > set_bit_le() will directly crash the kernel.
> > 
> > We could keep the old flavor of having a WARN_ON_ONCE(!vcpu &&
> > !ALLOW_BITMAP) then return, but since now the userspace can easily trigger
> > this (e.g. on ARM, a malicious userapp can have DIRTY_RING &&
> > !ALLOW_BITMAP, then it can simply trigger the gic ioctl to trigger host
> > warning), I think the better approach is we can kill the process in that
> > case.  Not sure whether there's anything better we can do.
> 
> I don't believe !ALLOW_BITMAP && DIRTY_RING is a valid configuration for
> arm64 given the fact that we'll dirty memory outside of a vCPU context.

Yes it's not, but after Gavin's current series it'll be possible, IOW a
malicious app can leverage this to trigger host warning, which is IMHO not
wanted.

> 
> Could ALLOW_BITMAP be a requirement of DIRTY_RING, thereby making
> userspace fail fast? Otherwise (at least on arm64) your VM is DOA on the
> target. With that the old WARN() could be preserved, as you suggest.

It's just that x86 doesn't need the bitmap, so it'll be a pure waste there
otherwise.  It's not only about the memory that will be wasted (that's
guest mem size / 32k), but also the sync() process for x86 will be all
zeros and totally meaningless - note that the sync() of bitmap will be part
of VM downtime in this case (we need to sync() after turning VM off), so it
will make x86 downtime larger but without any benefit.

> On
> top of that there would no longer be a need to test for memslot creation
> when userspace attempts to enable KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP.
Gavin Shan Oct. 10, 2022, 11:58 p.m. UTC | #7
Hi Peter and Oliver,

On 10/11/22 7:49 AM, Peter Xu wrote:
> On Mon, Oct 10, 2022 at 11:18:55PM +0000, Oliver Upton wrote:
>> On Fri, Oct 07, 2022 at 10:31:49AM -0400, Peter Xu wrote:
>>
>> [...]
>>
>>>> - In kvm_vm_ioctl_enable_dirty_log_ring(), set 'dirty_ring_allow_bitmap' to
>>>>    true when the capability is KVM_CAP_DIRTY_LONG_RING_ACQ_REL
>>>
>>> What I wanted to do is to decouple the ACQ_REL with ALLOW_BITMAP, so mostly
>>> as what you suggested, except..
>>
>> +1
>>

Agreed.

[...]

>>
>>> Then if the userspace wants to use the bitmap altogether with the ring, it
>>> needs to first detect KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP and enable it
>>> before it enables KVM_CAP_DIRTY_LOG_RING.
>>>
>>> One trick on ALLOW_BITMAP is in mark_page_dirty_in_slot() - after we allow
>>> !vcpu case we'll need to make sure it won't accidentally try to set bitmap
>>> for !ALLOW_BITMAP, because in that case the bitmap pointer is NULL so
>>> set_bit_le() will directly crash the kernel.
>>>
>>> We could keep the old flavor of having a WARN_ON_ONCE(!vcpu &&
>>> !ALLOW_BITMAP) then return, but since now the userspace can easily trigger
>>> this (e.g. on ARM, a malicious userapp can have DIRTY_RING &&
>>> !ALLOW_BITMAP, then it can simply trigger the gic ioctl to trigger host
>>> warning), I think the better approach is we can kill the process in that
>>> case.  Not sure whether there's anything better we can do.
>>
>> I don't believe !ALLOW_BITMAP && DIRTY_RING is a valid configuration for
>> arm64 given the fact that we'll dirty memory outside of a vCPU context.
> 
> Yes it's not, but after Gavin's current series it'll be possible, IOW a
> malicious app can leverage this to trigger host warning, which is IMHO not
> wanted.
> 
>>
>> Could ALLOW_BITMAP be a requirement of DIRTY_RING, thereby making
>> userspace fail fast? Otherwise (at least on arm64) your VM is DOA on the
>> target. With that the old WARN() could be preserved, as you suggest.
> 
> It's just that x86 doesn't need the bitmap, so it'll be a pure waste there
> otherwise.  It's not only about the memory that will be wasted (that's
> guest mem size / 32k), but also the sync() process for x86 will be all
> zeros and totally meaningless - note that the sync() of bitmap will be part
> of VM downtime in this case (we need to sync() after turning VM off), so it
> will make x86 downtime larger but without any benefit.
> 

Besides, old QEMU won't work if ALLOW_BITMAP is required to enable DIRTY_RING,
if I'm correct.

>> On
>> top of that there would no longer be a need to test for memslot creation
>> when userspace attempts to enable KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP.
> 

Thanks,
Gavin
Oliver Upton Oct. 10, 2022, 11:58 p.m. UTC | #8
On Mon, Oct 10, 2022 at 07:49:15PM -0400, Peter Xu wrote:
> On Mon, Oct 10, 2022 at 11:18:55PM +0000, Oliver Upton wrote:
> > On Fri, Oct 07, 2022 at 10:31:49AM -0400, Peter Xu wrote:
> > 
> > [...]
> > 
> > > > - In kvm_vm_ioctl_enable_dirty_log_ring(), set 'dirty_ring_allow_bitmap' to
> > > >   true when the capability is KVM_CAP_DIRTY_LONG_RING_ACQ_REL
> > > 
> > > What I wanted to do is to decouple the ACQ_REL with ALLOW_BITMAP, so mostly
> > > as what you suggested, except..
> > 
> > +1
> > 
> > > > 
> > > >   static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 cap, u32 size)
> > > >   {
> > > >     :
> > > >     mutex_lock(&kvm->lock);
> > > > 
> > > >     if (kvm->created_vcpus) {
> > > >        /* We don't allow to change this value after vcpu created */
> > > >        r = -EINVAL;
> > > >     } else {
> > > >        kvm->dirty_ring_size = size;
> > > 
> > > .. here I'd not set dirty_ring_allow_bitmap at all so I'd drop below line,
> > > instead..
> > > 
> > > >        kvm->dirty_ring_allow_bitmap = (cap == KVM_CAP_DIRTY_LOG_RING_ACQ_REL);
> > > >        r = 0;
> > > >     }
> > > > 
> > > >     mutex_unlock(&kvm->lock);
> > > >     return r;
> > > >   }
> > > > - In kvm_vm_ioctl_check_extension_generic(), KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP
> > > >   is always flase until KVM_CAP_DIRTY_LOG_RING_ACQ_REL is enabled.
> > > > 
> > > >   static long kvm_vm_ioctl_check_extension_generic(...)
> > > >   {
> > > >     :
> > > >     case KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP:
> > > >         return kvm->dirty_ring_allow_bitmap ? 1 : 0;
> > > 
> > > ... here we always return 1, OTOH in kvm_vm_ioctl_enable_cap_generic():
> > > 
> > >       case KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP:
> > >            if (kvm->dirty_ring_size)
> > >                 return -EINVAL;
> > >            kvm->dirty_ring_allow_bitmap = true;
> > >            return 0;
> > > 
> > > A side effect of checking dirty_ring_size is then we'll be sure to have no
> > > vcpu created too.  Maybe we should also check no memslot created to make
> > > sure the bitmaps are not created.
> > 
> > I'm not sure I follow... What prevents userspace from creating a vCPU
> > between enabling the two caps?
> 
> Enabling of dirty ring requires no vcpu created, so as to make sure all the
> vcpus will have the ring structures allocated as long as ring enabled for
> the vm.  Done in kvm_vm_ioctl_enable_dirty_log_ring():
> 
> 	if (kvm->created_vcpus) {
> 		/* We don't allow to change this value after vcpu created */
> 		r = -EINVAL;
> 	} else {
> 		kvm->dirty_ring_size = size;
> 		r = 0;
> 	}
> 
> Then if we have KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP checking
> dirty_ring_size first then we make sure we need to configure both
> ALLOW_BITMAP and DIRTY_RING before any vcpu creation.

Ah, right. Sorry, I had the 'if' condition inverted in my head.

> > 
> > > Then if the userspace wants to use the bitmap altogether with the ring, it
> > > needs to first detect KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP and enable it
> > > before it enables KVM_CAP_DIRTY_LOG_RING.
> > > 
> > > One trick on ALLOW_BITMAP is in mark_page_dirty_in_slot() - after we allow
> > > !vcpu case we'll need to make sure it won't accidentally try to set bitmap
> > > for !ALLOW_BITMAP, because in that case the bitmap pointer is NULL so
> > > set_bit_le() will directly crash the kernel.
> > > 
> > > We could keep the old flavor of having a WARN_ON_ONCE(!vcpu &&
> > > !ALLOW_BITMAP) then return, but since now the userspace can easily trigger
> > > this (e.g. on ARM, a malicious userapp can have DIRTY_RING &&
> > > !ALLOW_BITMAP, then it can simply trigger the gic ioctl to trigger host
> > > warning), I think the better approach is we can kill the process in that
> > > case.  Not sure whether there's anything better we can do.
> > 
> > I don't believe !ALLOW_BITMAP && DIRTY_RING is a valid configuration for
> > arm64 given the fact that we'll dirty memory outside of a vCPU context.
> 
> Yes it's not, but after Gavin's current series it'll be possible, IOW a
> malicious app can leverage this to trigger host warning, which is IMHO not
> wanted.
> 
> > 
> > Could ALLOW_BITMAP be a requirement of DIRTY_RING, thereby making
> > userspace fail fast? Otherwise (at least on arm64) your VM is DOA on the
> > target. With that the old WARN() could be preserved, as you suggest.
> 
> It's just that x86 doesn't need the bitmap, so it'll be a pure waste there
> otherwise.  It's not only about the memory that will be wasted (that's
> guest mem size / 32k), but also the sync() process for x86 will be all
> zeros and totally meaningless - note that the sync() of bitmap will be part
> of VM downtime in this case (we need to sync() after turning VM off), so it
> will make x86 downtime larger but without any benefit.

Ah, my follow-up [1] missed by just a few minutes :)

I think this further drives the point home -- there's zero need for the
bitmap with dirty ring on x86, so why even support it? The proposal of
ALLOW_BITMAP && DIRTY_RING should be arm64-specific. Any other arch that
needs to dirty memory outside of a vCPU context can opt-in to the
behavior.

[1]: https://lore.kernel.org/kvm/Y0SuJee3oWL2QCqM@google.com/

--
Thanks,
Oliver
Peter Xu Oct. 11, 2022, 12:20 a.m. UTC | #9
On Mon, Oct 10, 2022 at 11:58:22PM +0000, Oliver Upton wrote:
> I think this further drives the point home -- there's zero need for the
> bitmap with dirty ring on x86, so why even support it? The proposal of
> ALLOW_BITMAP && DIRTY_RING should be arm64-specific. Any other arch that
> needs to dirty memory outside of a vCPU context can opt-in to the
> behavior.

Yeah that sounds working too, but it'll be slightly hackish as then the
user app will need some "#ifdef ARM64" blocks for e.g. sync dirty bitmap.
With the new cap the user app can implement the whole ring with generic
code.

Also more flexible to expose it as generic cap? E.g., one day x86 can
enable this too for whatever reason (even though I don't think so..).
Oliver Upton Oct. 11, 2022, 1:12 a.m. UTC | #10
On Mon, Oct 10, 2022 at 08:20:29PM -0400, Peter Xu wrote:
> On Mon, Oct 10, 2022 at 11:58:22PM +0000, Oliver Upton wrote:
> > I think this further drives the point home -- there's zero need for the
> > bitmap with dirty ring on x86, so why even support it? The proposal of
> > ALLOW_BITMAP && DIRTY_RING should be arm64-specific. Any other arch that
> > needs to dirty memory outside of a vCPU context can opt-in to the
> > behavior.
> 
> Yeah that sounds working too, but it'll be slightly hackish as then the
> user app will need some "#ifdef ARM64" blocks for e.g. sync dirty bitmap.
> With the new cap the user app can implement the whole ring with generic
> code.

Isn't the current route of exposing ALLOW_BITMAP on other arches for no
reason headed in exactly that direction? Userspace would need to know if
it _really_ needs the dirty bitmap in addition to the dirty ring, which
could take the form of architecture ifdeffery.

OTOH, if the cap is only exposed when it is absolutely necessary, an
arch-generic live migration implementation could enable the cap whenever
it is advertized and scan the bitmap accordingly.

The VMM must know something about the architecture it is running on, as
it calls KVM_DEV_ARM_ITS_SAVE_TABLES after all...

> Also more flexible to expose it as generic cap? E.g., one day x86 can
> enable this too for whatever reason (even though I don't think so..).

I had imagined something like this patch where the arch opts-in to some
generic construct if it *requires* the use of both the ring and bitmap
(very rough sketch).

--
Thanks,
Oliver
Gavin Shan Oct. 11, 2022, 3:56 a.m. UTC | #11
On 10/11/22 9:12 AM, Oliver Upton wrote:
> On Mon, Oct 10, 2022 at 08:20:29PM -0400, Peter Xu wrote:
>> On Mon, Oct 10, 2022 at 11:58:22PM +0000, Oliver Upton wrote:
>>> I think this further drives the point home -- there's zero need for the
>>> bitmap with dirty ring on x86, so why even support it? The proposal of
>>> ALLOW_BITMAP && DIRTY_RING should be arm64-specific. Any other arch that
>>> needs to dirty memory outside of a vCPU context can opt-in to the
>>> behavior.
>>
>> Yeah that sounds working too, but it'll be slightly hackish as then the
>> user app will need some "#ifdef ARM64" blocks for e.g. sync dirty bitmap.
>> With the new cap the user app can implement the whole ring with generic
>> code.
> 
> Isn't the current route of exposing ALLOW_BITMAP on other arches for no
> reason headed in exactly that direction? Userspace would need to know if
> it _really_ needs the dirty bitmap in addition to the dirty ring, which
> could take the form of architecture ifdeffery.
> 
> OTOH, if the cap is only exposed when it is absolutely necessary, an
> arch-generic live migration implementation could enable the cap whenever
> it is advertized and scan the bitmap accordingly.
> 
> The VMM must know something about the architecture it is running on, as
> it calls KVM_DEV_ARM_ITS_SAVE_TABLES after all...
> 

It looks good to me by using CONFIG_HAVE_KVM_DIRTY_RING_USE_BITMAP to
opt-in KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP. The most important point is
to ensure 'kvm->dirty_ring_with_bitmap == true' when dirty ring capability
is enabled. In this way, we can fail early when KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
isn't enabled on attempt to enable dirty ring capability.

If both of you agree, I will integrate the suggested code changes in
next respin, with necessary tweak.

- In kvm_vm_ioctl_enable_cap_generic(), 'kvm->dirty_ring_with_bitmap' is
   updated to 'true' unconditionally.

   static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
                                              struct kvm_enable_cap *cap)
   {
       :
       case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
            kvm->dirty_ring_with_bitmap = true;
            return 0;
   }

- In mark_page_dirty_in_slot(), we need comprehensive checks like below.

   void mark_page_dirty_in_slot(struct kvm *kvm,
                                const struct kvm_memory_slot *memslot,
                                gfn_t gfn)
   {
#ifdef CONFIG_HAVE_KVM_DIRTY_RING
       if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
           return;

#ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
       if (WARN_ON_ONCE(!vcpu))
           return;
#endif
#endif

   }

- Use kvm_dirty_ring_exclusive(), which was suggested by Peter before.
   The function is used in various spots to allow the dirty bitmap is
   created and accessed.

   bool kvm_dirty_ring_exclusive(struct kvm *kvm)
   {
       return kvm->dirty_ring_size && !kvm->dirty_ring_with_bitmap;
   }


>> Also more flexible to expose it as generic cap? E.g., one day x86 can
>> enable this too for whatever reason (even though I don't think so..).
> 
> I had imagined something like this patch where the arch opts-in to some
> generic construct if it *requires* the use of both the ring and bitmap
> (very rough sketch).
> 

Thanks,
Gavin
Gavin Shan Oct. 11, 2022, 6:31 a.m. UTC | #12
Hi Peter/Oliver,

On 10/11/22 11:56 AM, Gavin Shan wrote:
> On 10/11/22 9:12 AM, Oliver Upton wrote:
>> On Mon, Oct 10, 2022 at 08:20:29PM -0400, Peter Xu wrote:
>>> On Mon, Oct 10, 2022 at 11:58:22PM +0000, Oliver Upton wrote:
>>>> I think this further drives the point home -- there's zero need for the
>>>> bitmap with dirty ring on x86, so why even support it? The proposal of
>>>> ALLOW_BITMAP && DIRTY_RING should be arm64-specific. Any other arch that
>>>> needs to dirty memory outside of a vCPU context can opt-in to the
>>>> behavior.
>>>
>>> Yeah that sounds working too, but it'll be slightly hackish as then the
>>> user app will need some "#ifdef ARM64" blocks for e.g. sync dirty bitmap.
>>> With the new cap the user app can implement the whole ring with generic
>>> code.
>>
>> Isn't the current route of exposing ALLOW_BITMAP on other arches for no
>> reason headed in exactly that direction? Userspace would need to know if
>> it _really_ needs the dirty bitmap in addition to the dirty ring, which
>> could take the form of architecture ifdeffery.
>>
>> OTOH, if the cap is only exposed when it is absolutely necessary, an
>> arch-generic live migration implementation could enable the cap whenever
>> it is advertized and scan the bitmap accordingly.
>>
>> The VMM must know something about the architecture it is running on, as
>> it calls KVM_DEV_ARM_ITS_SAVE_TABLES after all...
>>
> 
> It looks good to me by using CONFIG_HAVE_KVM_DIRTY_RING_USE_BITMAP to
> opt-in KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP. The most important point is
> to ensure 'kvm->dirty_ring_with_bitmap == true' when dirty ring capability
> is enabled. In this way, we can fail early when KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
> isn't enabled on attempt to enable dirty ring capability.
> 
> If both of you agree, I will integrate the suggested code changes in
> next respin, with necessary tweak.
> 
> - In kvm_vm_ioctl_enable_cap_generic(), 'kvm->dirty_ring_with_bitmap' is
>    updated to 'true' unconditionally.
> 
>    static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>                                               struct kvm_enable_cap *cap)
>    {
>        :
>        case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
>             kvm->dirty_ring_with_bitmap = true;
>             return 0;
>    }
> 
> - In mark_page_dirty_in_slot(), we need comprehensive checks like below.
> 
>    void mark_page_dirty_in_slot(struct kvm *kvm,
>                                 const struct kvm_memory_slot *memslot,
>                                 gfn_t gfn)
>    {
> #ifdef CONFIG_HAVE_KVM_DIRTY_RING
>        if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
>            return;
> 
> #ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
>        if (WARN_ON_ONCE(!vcpu))
>            return;
> #endif
> #endif
> 
>    }
> 
> - Use kvm_dirty_ring_exclusive(), which was suggested by Peter before.
>    The function is used in various spots to allow the dirty bitmap is
>    created and accessed.
> 
>    bool kvm_dirty_ring_exclusive(struct kvm *kvm)
>    {
>        return kvm->dirty_ring_size && !kvm->dirty_ring_with_bitmap;
>    }
> 
> 

I've included Oliver's suggested changes into v6, which was just posted:

https://lore.kernel.org/kvmarm/3123a04f-a674-782b-9e9b-0baf3db49ebc@redhat.com/

Please find your time to review v6 directly, thanks!

>>> Also more flexible to expose it as generic cap? E.g., one day x86 can
>>> enable this too for whatever reason (even though I don't think so..).
>>
>> I had imagined something like this patch where the arch opts-in to some
>> generic construct if it *requires* the use of both the ring and bitmap
>> (very rough sketch).
>>

Thanks,
Gavin
Peter Xu Oct. 14, 2022, 4:55 p.m. UTC | #13
On Tue, Oct 11, 2022 at 01:12:43AM +0000, Oliver Upton wrote:
> The VMM must know something about the architecture it is running on, as
> it calls KVM_DEV_ARM_ITS_SAVE_TABLES after all...

IIUC this is still a kernel impl detail to flush data into guest pages
within this ioctl, or am I wrong?

For example, I'm assuming it's safe to change KVM_DEV_ARM_ITS_SAVE_TABLES
impl one day to not flush data to guest memories, then the kernel should
also disable the ALLOW_BITMAP cap in the same patch, so that any old qemu
binary that supports arm64 dirty ring will naturally skip all the bitmap
ops and becoming the same as what it does with x86 when running on that new
kernel.  With implicit approach suggested, we need to modify QEMU.

Changing impl of KVM_DEV_ARM_ITS_SAVE_TABLES is probably not a good
example.. but just want to show what I meant.  Fundamentally it sounds
cleaner if it's the kernel that tells the user "okay you collected the
ring, but that's not enough; you need to collect the bitmap too", rather
than assuming the user app will always know what kvm did in details.  No
strong opinion though, as I could also have misunderstood how arm works.
Oliver Upton Oct. 18, 2022, 7:38 a.m. UTC | #14
On Fri, Oct 14, 2022 at 12:55:35PM -0400, Peter Xu wrote:
> On Tue, Oct 11, 2022 at 01:12:43AM +0000, Oliver Upton wrote:
> > The VMM must know something about the architecture it is running on, as
> > it calls KVM_DEV_ARM_ITS_SAVE_TABLES after all...
> 
> IIUC this is still a kernel impl detail to flush data into guest pages
> within this ioctl, or am I wrong?

Somewhat...

The guest is assigning memory from the IPA space to back the ITS tables,
but KVM maintains its own internal representation. It just so happens
that we've conditioned userspace to be aware that ITS emulation is
incoherent w.r.t. the guest memory that backs the tables.

> For example, I'm assuming it's safe to change KVM_DEV_ARM_ITS_SAVE_TABLES
> impl one day to not flush data to guest memories, then the kernel should
> also disable the ALLOW_BITMAP cap in the same patch, so that any old qemu
> binary that supports arm64 dirty ring will naturally skip all the bitmap
> ops and becoming the same as what it does with x86 when running on that new
> kernel.  With implicit approach suggested, we need to modify QEMU.
> 
> Changing impl of KVM_DEV_ARM_ITS_SAVE_TABLES is probably not a good
> example.. but just want to show what I meant.  Fundamentally it sounds
> cleaner if it's the kernel that tells the user "okay you collected the
> ring, but that's not enough; you need to collect the bitmap too", rather
> than assuming the user app will always know what kvm did in details.  No
> strong opinion though, as I could also have misunderstood how arm works.

I think the SAVE_TABLES ioctl is likely here to stay given the odd quirk
that it really is guest memory, so we'll probably need the bitmap on
arm64 for a long time. Even if we were to kill it, userspace would need
to take a change anyway to switch to a new ITS migration mechanism.

If we ever get to the point that we can relax this restriction i think a
flag on the BITMAP_WITH_TABLE cap that says "I don't actually set any
bits in the bitmap" would do. We shouldn't hide the cap entirely, as
that would be ABI breakage for VMMs that expect bitmap+ring.

Thoughts?

--
Thanks,
Oliver
Oliver Upton Oct. 18, 2022, 7:40 a.m. UTC | #15
On Tue, Oct 18, 2022 at 10:38:10AM +0300, Oliver Upton wrote:
> On Fri, Oct 14, 2022 at 12:55:35PM -0400, Peter Xu wrote:
> > On Tue, Oct 11, 2022 at 01:12:43AM +0000, Oliver Upton wrote:
> > > The VMM must know something about the architecture it is running on, as
> > > it calls KVM_DEV_ARM_ITS_SAVE_TABLES after all...
> > 
> > IIUC this is still a kernel impl detail to flush data into guest pages
> > within this ioctl, or am I wrong?
> 
> Somewhat...
> 
> The guest is assigning memory from the IPA space to back the ITS tables,
> but KVM maintains its own internal representation. It just so happens
> that we've conditioned userspace to be aware that ITS emulation is
> incoherent w.r.t. the guest memory that backs the tables.
> 
> > For example, I'm assuming it's safe to change KVM_DEV_ARM_ITS_SAVE_TABLES
> > impl one day to not flush data to guest memories, then the kernel should
> > also disable the ALLOW_BITMAP cap in the same patch, so that any old qemu
> > binary that supports arm64 dirty ring will naturally skip all the bitmap
> > ops and becoming the same as what it does with x86 when running on that new
> > kernel.  With implicit approach suggested, we need to modify QEMU.
> > 
> > Changing impl of KVM_DEV_ARM_ITS_SAVE_TABLES is probably not a good
> > example.. but just want to show what I meant.  Fundamentally it sounds
> > cleaner if it's the kernel that tells the user "okay you collected the
> > ring, but that's not enough; you need to collect the bitmap too", rather
> > than assuming the user app will always know what kvm did in details.  No
> > strong opinion though, as I could also have misunderstood how arm works.
> 
> I think the SAVE_TABLES ioctl is likely here to stay given the odd quirk
> that it really is guest memory, so we'll probably need the bitmap on
> arm64 for a long time. Even if we were to kill it, userspace would need
> to take a change anyway to switch to a new ITS migration mechanism.
> 
> If we ever get to the point that we can relax this restriction i think a
> flag on the BITMAP_WITH_TABLE cap that says "I don't actually set any

BITMAP_WITH_RING

> bits in the bitmap" would do. We shouldn't hide the cap entirely, as
> that would be ABI breakage for VMMs that expect bitmap+ring.
> 
> Thoughts?
> 
> --
> Thanks,
> Oliver
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Peter Xu Oct. 18, 2022, 3:50 p.m. UTC | #16
On Tue, Oct 18, 2022 at 10:38:10AM +0300, Oliver Upton wrote:
> If we ever get to the point that we can relax this restriction i think a
> flag on the BITMAP_WITH_TABLE cap that says "I don't actually set any
> bits in the bitmap" would do. We shouldn't hide the cap entirely, as
> that would be ABI breakage for VMMs that expect bitmap+ring.

I'd rather drop the cap directly if it's just a boolean that tells us
"whether we need bitmap to back rings". Btw when I said "dropping it" I
meant "don't return 1 for the cap anymore" - we definitely need to make the
cap macro stable as part of kvm API..

But I think I misunderstood the proposal previously, sorry. I assumed we
were discussing an internal HAVE_DIRTY_RING_WITH_BITMAP only.  I noticed
this only after I had a closer look at Gavin's patch.  Having a cap exposed
sounds always good to me.

I'll comment on Gavin's patch directly, thanks.
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 32427ea160df..4f5e09042f8a 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8019,8 +8019,8 @@  guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
 (0x40000001). Otherwise, a guest may use the paravirtual features
 regardless of what has actually been exposed through the CPUID leaf.
 
-8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
-----------------------------------------------------------
+8.29 KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL, RING_ALLOW_BITMAP}
+--------------------------------------------------------------
 
 :Architectures: x86
 :Parameters: args[0] - size of the dirty log ring
@@ -8104,13 +8104,6 @@  flushing is done by the KVM_GET_DIRTY_LOG ioctl).  To achieve that, one
 needs to kick the vcpu out of KVM_RUN using a signal.  The resulting
 vmexit ensures that all dirty GFNs are flushed to the dirty rings.
 
-NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding
-ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls
-KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG.  After enabling
-KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual
-machine will switch to ring-buffer dirty page tracking and further
-KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail.
-
 NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that
 should be exposed by weakly ordered architecture, in order to indicate
 the additional memory ordering requirements imposed on userspace when
@@ -8119,6 +8112,12 @@  Architecture with TSO-like ordering (such as x86) are allowed to
 expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL
 to userspace.
 
+NOTE: There is no running vcpu and available vcpu dirty ring when pages
+becomes dirty in some cases. One example is to save arm64's vgic/its
+tables during migration. The dirty bitmap is still used to track those
+dirty pages, indicated by KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP. The ditry
+bitmap is visited by KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG ioctls.
+
 8.30 KVM_CAP_XEN_HVM
 --------------------
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0d5d4419139a..a1f10bc5fe8b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1178,6 +1178,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_ZPCI_OP 221
 #define KVM_CAP_S390_CPU_TOPOLOGY 222
 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
+#define KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP 224
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5b064dbadaf4..505e3840cf0c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1617,7 +1617,7 @@  static int kvm_prepare_memory_region(struct kvm *kvm,
 			new->dirty_bitmap = NULL;
 		else if (old && old->dirty_bitmap)
 			new->dirty_bitmap = old->dirty_bitmap;
-		else if (!kvm->dirty_ring_size) {
+		else {
 			r = kvm_alloc_dirty_bitmap(new);
 			if (r)
 				return r;
@@ -2060,10 +2060,6 @@  int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
 	unsigned long n;
 	unsigned long any = 0;
 
-	/* Dirty ring tracking is exclusive to dirty log tracking */
-	if (kvm->dirty_ring_size)
-		return -ENXIO;
-
 	*memslot = NULL;
 	*is_dirty = 0;
 
@@ -2125,10 +2121,6 @@  static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 	unsigned long *dirty_bitmap_buffer;
 	bool flush;
 
-	/* Dirty ring tracking is exclusive to dirty log tracking */
-	if (kvm->dirty_ring_size)
-		return -ENXIO;
-
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
@@ -2237,10 +2229,6 @@  static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	unsigned long *dirty_bitmap_buffer;
 	bool flush;
 
-	/* Dirty ring tracking is exclusive to dirty log tracking */
-	if (kvm->dirty_ring_size)
-		return -ENXIO;
-
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
@@ -3305,7 +3293,7 @@  void mark_page_dirty_in_slot(struct kvm *kvm,
 	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
 #ifdef CONFIG_HAVE_KVM_DIRTY_RING
-	if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
+	if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
 		return;
 #endif
 
@@ -3313,7 +3301,7 @@  void mark_page_dirty_in_slot(struct kvm *kvm,
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 		u32 slot = (memslot->as_id << 16) | memslot->id;
 
-		if (kvm->dirty_ring_size)
+		if (kvm->dirty_ring_size && vcpu)
 			kvm_dirty_ring_push(&vcpu->dirty_ring,
 					    slot, rel_gfn);
 		else
@@ -4485,6 +4473,12 @@  static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 		return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn);
 #else
 		return 0;
+#endif
+	case KVM_CAP_DIRTY_LOG_RING_ALLOW_BITMAP:
+#ifdef CONFIG_HAVE_KVM_DIRTY_RING
+		return 1;
+#else
+		return 0;
 #endif
 	case KVM_CAP_BINARY_STATS_FD:
 	case KVM_CAP_SYSTEM_EVENT_DATA: