Message ID | 20221108041039.111145-4-gshan@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: arm64: Enable ring-based dirty memory tracking | expand |
On Tue, Nov 08, 2022, Gavin Shan wrote: > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig > index 800f9470e36b..228be1145cf3 100644 > --- a/virt/kvm/Kconfig > +++ b/virt/kvm/Kconfig > @@ -33,6 +33,14 @@ config HAVE_KVM_DIRTY_RING_ACQ_REL > bool > select HAVE_KVM_DIRTY_RING > > +# Only architectures that need to dirty memory outside of a vCPU > +# context should select this, advertising to userspace the > +# requirement to use a dirty bitmap in addition to the vCPU dirty > +# ring. The Kconfig does more than advertise a feature to userspace. # Allow enabling both the dirty bitmap and dirty ring. Only architectures that # need to dirty memory outside of a vCPU context should select this. > +config HAVE_KVM_DIRTY_RING_WITH_BITMAP I think we should replace "HAVE" with "NEED". Any architecture that supports the dirty ring can easily support ring+bitmap, but based on the discussion from v5[*], the comment above, and the fact that the bitmap will _never_ be used in the proposed implementation because x86 will always have a vCPU, this Kconfig should only be selected if the bitmap is needed to support migration. [*] https://lore.kernel.org/all/Y0SxnoT5u7+1TCT+@google.com > + bool > + depends on HAVE_KVM_DIRTY_RING > + > config HAVE_KVM_EVENTFD > bool > select EVENTFD > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c > index fecbb7d75ad2..f95cbcdd74ff 100644 > --- a/virt/kvm/dirty_ring.c > +++ b/virt/kvm/dirty_ring.c > @@ -21,6 +21,18 @@ u32 kvm_dirty_ring_get_rsvd_entries(void) > return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size(); > } > > +bool kvm_use_dirty_bitmap(struct kvm *kvm) > +{ > + lockdep_assert_held(&kvm->slots_lock); > + > + return !kvm->dirty_ring_size || kvm->dirty_ring_with_bitmap; > +} > + > +bool __weak kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm) Rather than __weak, what about wrapping this with an #ifdef to effectively force architectures to implement the override if they need ring+bitmap? Given that the bitmap will never be used if there's a running vCPU, selecting the Kconfig without overriding this utility can't possibly be correct. #ifndef CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm) { return false; } #endif > @@ -4588,6 +4608,29 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, > return -EINVAL; > > return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]); > + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: { > + int r = -EINVAL; > + > + if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || > + !kvm->dirty_ring_size) I have no objection to disallowing userspace from disabling the combo, but I think it's worth requiring cap->args[0] to be '0' just in case we change our minds in the future. > + return r; > + > + mutex_lock(&kvm->slots_lock); > + > + /* > + * For simplicity, allow enabling ring+bitmap if and only if > + * there are no memslots, e.g. to ensure all memslots allocate > + * a bitmap after the capability is enabled. > + */ > + if (kvm_are_all_memslots_empty(kvm)) { > + kvm->dirty_ring_with_bitmap = true; > + r = 0; > + } > + > + mutex_unlock(&kvm->slots_lock); > + > + return r; > + } > default: > return kvm_vm_ioctl_enable_cap(kvm, cap); > } > -- > 2.23.0 >
Hi Sean, On 11/9/22 12:25 AM, Sean Christopherson wrote: > On Tue, Nov 08, 2022, Gavin Shan wrote: >> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig >> index 800f9470e36b..228be1145cf3 100644 >> --- a/virt/kvm/Kconfig >> +++ b/virt/kvm/Kconfig >> @@ -33,6 +33,14 @@ config HAVE_KVM_DIRTY_RING_ACQ_REL >> bool >> select HAVE_KVM_DIRTY_RING >> >> +# Only architectures that need to dirty memory outside of a vCPU >> +# context should select this, advertising to userspace the >> +# requirement to use a dirty bitmap in addition to the vCPU dirty >> +# ring. > > The Kconfig does more than advertise a feature to userspace. > > # Allow enabling both the dirty bitmap and dirty ring. Only architectures that > # need to dirty memory outside of a vCPU context should select this. > Agreed. The comments will be adjusted accordingly. >> +config HAVE_KVM_DIRTY_RING_WITH_BITMAP > > I think we should replace "HAVE" with "NEED". Any architecture that supports the > dirty ring can easily support ring+bitmap, but based on the discussion from v5[*], > the comment above, and the fact that the bitmap will _never_ be used in the > proposed implementation because x86 will always have a vCPU, this Kconfig should > only be selected if the bitmap is needed to support migration. > > [*] https://lore.kernel.org/all/Y0SxnoT5u7+1TCT+@google.com > Both look good to me. Lets change it to CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP then. >> + bool >> + depends on HAVE_KVM_DIRTY_RING >> + >> config HAVE_KVM_EVENTFD >> bool >> select EVENTFD >> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c >> index fecbb7d75ad2..f95cbcdd74ff 100644 >> --- a/virt/kvm/dirty_ring.c >> +++ b/virt/kvm/dirty_ring.c >> @@ -21,6 +21,18 @@ u32 kvm_dirty_ring_get_rsvd_entries(void) >> return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size(); >> } >> >> +bool kvm_use_dirty_bitmap(struct kvm *kvm) >> +{ >> + lockdep_assert_held(&kvm->slots_lock); >> + >> + return !kvm->dirty_ring_size || kvm->dirty_ring_with_bitmap; >> +} >> + >> +bool __weak kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm) > > Rather than __weak, what about wrapping this with an #ifdef to effectively force > architectures to implement the override if they need ring+bitmap? Given that the > bitmap will never be used if there's a running vCPU, selecting the Kconfig without > overriding this utility can't possibly be correct. > > #ifndef CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP > bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm) > { > return false; > } > #endif > It's a good idea, which will be included to next revision :) >> @@ -4588,6 +4608,29 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, >> return -EINVAL; >> >> return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]); >> + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: { >> + int r = -EINVAL; >> + >> + if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || >> + !kvm->dirty_ring_size) > > I have no objection to disallowing userspace from disabling the combo, but I > think it's worth requiring cap->args[0] to be '0' just in case we change our minds > in the future. > I assume you're suggesting to have non-zero value in cap->args[0] to enable the capability. if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || !kvm->dirty_ring_size || !cap->args[0]) return r; >> + return r; >> + >> + mutex_lock(&kvm->slots_lock); >> + >> + /* >> + * For simplicity, allow enabling ring+bitmap if and only if >> + * there are no memslots, e.g. to ensure all memslots allocate >> + * a bitmap after the capability is enabled. >> + */ >> + if (kvm_are_all_memslots_empty(kvm)) { >> + kvm->dirty_ring_with_bitmap = true; >> + r = 0; >> + } >> + >> + mutex_unlock(&kvm->slots_lock); >> + >> + return r; >> + } >> default: >> return kvm_vm_ioctl_enable_cap(kvm, cap); >> } Thanks, Gavin
On Wed, Nov 09, 2022, Gavin Shan wrote: > Hi Sean, > > On 11/9/22 12:25 AM, Sean Christopherson wrote: > > I have no objection to disallowing userspace from disabling the combo, but I > > think it's worth requiring cap->args[0] to be '0' just in case we change our minds > > in the future. > > > > I assume you're suggesting to have non-zero value in cap->args[0] to enable the > capability. > > if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || > !kvm->dirty_ring_size || !cap->args[0]) > return r; I was actually thinking of taking the lazy route and requiring userspace to zero the arg, i.e. treat it as a flags extensions. Oh, wait, that's silly. I always forget that `cap->flags` exists. Just this? if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || !kvm->dirty_ring_size || cap->flags) return r; It'll be kinda awkward if KVM ever does add a flag to disable the bitmap, but that's seems quite unlikely and not the end of the world if it does happen. And on the other hand, requiring '0' is less weird and less annoying for userspace _now_.
Hi Sean, On 11/9/22 8:05 AM, Sean Christopherson wrote: > On Wed, Nov 09, 2022, Gavin Shan wrote: >> On 11/9/22 12:25 AM, Sean Christopherson wrote: >>> I have no objection to disallowing userspace from disabling the combo, but I >>> think it's worth requiring cap->args[0] to be '0' just in case we change our minds >>> in the future. >>> >> >> I assume you're suggesting to have non-zero value in cap->args[0] to enable the >> capability. >> >> if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || >> !kvm->dirty_ring_size || !cap->args[0]) >> return r; > > I was actually thinking of taking the lazy route and requiring userspace to zero > the arg, i.e. treat it as a flags extensions. Oh, wait, that's silly. I always > forget that `cap->flags` exists. > > Just this? > > if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || > !kvm->dirty_ring_size || cap->flags) > return r; > > It'll be kinda awkward if KVM ever does add a flag to disable the bitmap, but > that's seems quite unlikely and not the end of the world if it does happen. And > on the other hand, requiring '0' is less weird and less annoying for userspace > _now_. > I don't quiet understand the term "lazy route". So you're still thinking of the possibility to allow disabling the capability in future? If so, cap->flags or cap->args[0] can be used. For now, we just need a binding between cap->flags/args[0] with the operation of enabling the capability. For example, "cap->flags == 0x0" means to enable the capability for now, and "cap->flags != 0x0" to disable the capability in future. The suggested changes look good to me in either way. Sean, can I grab your reviewed-by with your comments addressed? I'm making next revision (v10) a final one :) Thanks, Gavin
On Wed, Nov 09, 2022, Gavin Shan wrote: > Hi Sean, > > On 11/9/22 8:05 AM, Sean Christopherson wrote: > > On Wed, Nov 09, 2022, Gavin Shan wrote: > > > On 11/9/22 12:25 AM, Sean Christopherson wrote: > > > > I have no objection to disallowing userspace from disabling the combo, but I > > > > think it's worth requiring cap->args[0] to be '0' just in case we change our minds > > > > in the future. > > > > > > > > > > I assume you're suggesting to have non-zero value in cap->args[0] to enable the > > > capability. > > > > > > if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || > > > !kvm->dirty_ring_size || !cap->args[0]) > > > return r; > > > > I was actually thinking of taking the lazy route and requiring userspace to zero > > the arg, i.e. treat it as a flags extensions. Oh, wait, that's silly. I always > > forget that `cap->flags` exists. > > > > Just this? > > > > if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || > > !kvm->dirty_ring_size || cap->flags) > > return r; > > > > It'll be kinda awkward if KVM ever does add a flag to disable the bitmap, but > > that's seems quite unlikely and not the end of the world if it does happen. And > > on the other hand, requiring '0' is less weird and less annoying for userspace > > _now_. > > > > I don't quiet understand the term "lazy route". "lazy" in that requiring a non-zero value would mean adding another #define, otherwise the extensibility is limited to two values. Again, unlikely to matter, but it wouldn't make sense to go through the effort to provide some extensibility and then only allow for one possible extension. If KVM is "lazy" and just requires flags to be '0', then there's no need for more #defines, and userspace doesn't have to pass more values in its enabling. > So you're still thinking of the possibility to allow disabling the capability > in future? Yes, or more likely, tweaking the behavior of ring+bitmap. As is, the behavior is purely a fallback for a single case where KVM can't push to the dirty ring due to not having a running vCPU. It's possible someone might come up with a use case where they want KVM to do something different, e.g. fallback to the bitmap if the ring is full. In other words, it's mostly to hedge against futures we haven't thought of. Reserving cap->flags is cheap and easy for both KVM and userspace, so there's no real reason not to do so. > If so, cap->flags or cap->args[0] can be used. For now, we just > need a binding between cap->flags/args[0] with the operation of enabling the > capability. For example, "cap->flags == 0x0" means to enable the capability > for now, and "cap->flags != 0x0" to disable the capability in future. > > The suggested changes look good to me in either way. Sean, can I grab your > reviewed-by with your comments addressed? I'll look at v10, I don't like providing reviews that are conditional on changes that are more than nits. That said, there're no remaining issues that can't be sorted out on top, so don't hold up v10 if I don't look at it in a timely manner for whatever reason. I agree with Marc that it'd be good to get this in -next sooner than later.
Hi Sean, On 11/9/22 8:32 AM, Sean Christopherson wrote: > On Wed, Nov 09, 2022, Gavin Shan wrote: >> On 11/9/22 8:05 AM, Sean Christopherson wrote: >>> On Wed, Nov 09, 2022, Gavin Shan wrote: >>>> On 11/9/22 12:25 AM, Sean Christopherson wrote: >>>>> I have no objection to disallowing userspace from disabling the combo, but I >>>>> think it's worth requiring cap->args[0] to be '0' just in case we change our minds >>>>> in the future. >>>>> >>>> >>>> I assume you're suggesting to have non-zero value in cap->args[0] to enable the >>>> capability. >>>> >>>> if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || >>>> !kvm->dirty_ring_size || !cap->args[0]) >>>> return r; >>> >>> I was actually thinking of taking the lazy route and requiring userspace to zero >>> the arg, i.e. treat it as a flags extensions. Oh, wait, that's silly. I always >>> forget that `cap->flags` exists. >>> >>> Just this? >>> >>> if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || >>> !kvm->dirty_ring_size || cap->flags) >>> return r; >>> >>> It'll be kinda awkward if KVM ever does add a flag to disable the bitmap, but >>> that's seems quite unlikely and not the end of the world if it does happen. And >>> on the other hand, requiring '0' is less weird and less annoying for userspace >>> _now_. >>> >> >> I don't quiet understand the term "lazy route". > > "lazy" in that requiring a non-zero value would mean adding another #define, > otherwise the extensibility is limited to two values. Again, unlikely to matter, > but it wouldn't make sense to go through the effort to provide some extensibility > and then only allow for one possible extension. If KVM is "lazy" and just requires > flags to be '0', then there's no need for more #defines, and userspace doesn't > have to pass more values in its enabling. > Thanks for the explanation. I understand the term 'lazy route' now. Right, cap->flags is good place to hold #defines in future. cap->args[0] doesn't suite strictly here. >> So you're still thinking of the possibility to allow disabling the capability >> in future? > > Yes, or more likely, tweaking the behavior of ring+bitmap. As is, the behavior > is purely a fallback for a single case where KVM can't push to the dirty ring due > to not having a running vCPU. It's possible someone might come up with a use case > where they want KVM to do something different, e.g. fallback to the bitmap if the > ring is full. > > In other words, it's mostly to hedge against futures we haven't thought of. Reserving > cap->flags is cheap and easy for both KVM and userspace, so there's no real reason > not to do so. > Agreed that it's cheap to reserve cap->flags. I will change the code accordingly in v10. >> If so, cap->flags or cap->args[0] can be used. For now, we just >> need a binding between cap->flags/args[0] with the operation of enabling the >> capability. For example, "cap->flags == 0x0" means to enable the capability >> for now, and "cap->flags != 0x0" to disable the capability in future. >> >> The suggested changes look good to me in either way. Sean, can I grab your >> reviewed-by with your comments addressed? > > I'll look at v10, I don't like providing reviews that are conditional on changes > that are more than nits. > > That said, there're no remaining issues that can't be sorted out on top, so don't > hold up v10 if I don't look at it in a timely manner for whatever reason. I agree > with Marc that it'd be good to get this in -next sooner than later. > Sure. I would give v9 a few days, prior to posting v10. I'm not sure if other people still have concerns. If there are more comments, I want to address all of them in v10 :) Thanks, Gavin
Hi Gavin, On Wed, 09 Nov 2022 00:51:21 +0000, Gavin Shan <gshan@redhat.com> wrote: > > On 11/9/22 8:32 AM, Sean Christopherson wrote: > > That said, there're no remaining issues that can't be sorted out > > on top, so don't hold up v10 if I don't look at it in a timely > > manner for whatever reason. I agree with Marc that it'd be good > > to get this in -next sooner than later. > > > > Sure. I would give v9 a few days, prior to posting v10. I'm not sure > if other people still have concerns. If there are more comments, I > want to address all of them in v10 :) Please post v10 ASAP. I'm a bit behind on queuing stuff, and I'll be travelling next week, making it a bit more difficult to be on top of things. So whatever I can put into -next now is good. Thanks, M.
Hi Marc, On 11/10/22 6:25 PM, Marc Zyngier wrote: > On Wed, 09 Nov 2022 00:51:21 +0000, > Gavin Shan <gshan@redhat.com> wrote: >> >> On 11/9/22 8:32 AM, Sean Christopherson wrote: >>> That said, there're no remaining issues that can't be sorted out >>> on top, so don't hold up v10 if I don't look at it in a timely >>> manner for whatever reason. I agree with Marc that it'd be good >>> to get this in -next sooner than later. >>> >> >> Sure. I would give v9 a few days, prior to posting v10. I'm not sure >> if other people still have concerns. If there are more comments, I >> want to address all of them in v10 :) > > Please post v10 ASAP. I'm a bit behind on queuing stuff, and I'll be > travelling next week, making it a bit more difficult to be on top of > things. So whatever I can put into -next now is good. > Thanks, Marc. v10 was just posted :) https://lore.kernel.org/kvmarm/20221110104914.31280-1-gshan@redhat.com/T/#t Thanks, Gavin
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index eee9f857a986..1f1b09aa6db4 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -8003,13 +8003,6 @@ flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one needs to kick the vcpu out of KVM_RUN using a signal. The resulting vmexit ensures that all dirty GFNs are flushed to the dirty rings. -NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding -ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls -KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG. After enabling -KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual -machine will switch to ring-buffer dirty page tracking and further -KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail. - NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that should be exposed by weakly ordered architecture, in order to indicate the additional memory ordering requirements imposed on userspace when @@ -8018,6 +8011,33 @@ Architecture with TSO-like ordering (such as x86) are allowed to expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL to userspace. +After enabling the dirty rings, the userspace needs to detect the +capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the +ring structures can be backed by per-slot bitmaps. With this capability +advertised, it means the architecture can dirty guest pages without +vcpu/ring context, so that some of the dirty information will still be +maintained in the bitmap structure. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP +can't be enabled if the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL +hasn't been enabled, or any memslot has been existing. + +Note that the bitmap here is only a backup of the ring structure. The +use of the ring and bitmap combination is only beneficial if there is +only a very small amount of memory that is dirtied out of vcpu/ring +context. Otherwise, the stand-alone per-slot bitmap mechanism needs to +be considered. + +To collect dirty bits in the backup bitmap, userspace can use the same +KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG isn't needed as long as all +the generation of the dirty bits is done in a single pass. Collecting +the dirty bitmap should be the very last thing that the VMM does before +considering the state as complete. VMM needs to ensure that the dirty +state is final and avoid missing dirty pages from another ioctl ordered +after the bitmap collection. + +NOTE: One example of using the backup bitmap is saving arm64 vgic/its +tables through KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} command on +KVM device "kvm-arm-vgic-its" when dirty ring is enabled. + 8.30 KVM_CAP_XEN_HVM -------------------- diff --git a/Documentation/virt/kvm/devices/arm-vgic-its.rst b/Documentation/virt/kvm/devices/arm-vgic-its.rst index d257eddbae29..e053124f77c4 100644 --- a/Documentation/virt/kvm/devices/arm-vgic-its.rst +++ b/Documentation/virt/kvm/devices/arm-vgic-its.rst @@ -52,7 +52,10 @@ KVM_DEV_ARM_VGIC_GRP_CTRL KVM_DEV_ARM_ITS_SAVE_TABLES save the ITS table data into guest RAM, at the location provisioned - by the guest in corresponding registers/table entries. + by the guest in corresponding registers/table entries. Should userspace + require a form of dirty tracking to identify which pages are modified + by the saving process, it should use a bitmap even if using another + mechanism to track the memory dirtied by the vCPUs. The layout of the tables in guest memory defines an ABI. The entries are laid out in little endian format as described in the last paragraph. diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h index 199ead37b104..4862c98d80d3 100644 --- a/include/linux/kvm_dirty_ring.h +++ b/include/linux/kvm_dirty_ring.h @@ -37,6 +37,11 @@ static inline u32 kvm_dirty_ring_get_rsvd_entries(void) return 0; } +static inline bool kvm_use_dirty_bitmap(struct kvm *kvm) +{ + return true; +} + static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size) { @@ -67,6 +72,8 @@ static inline void kvm_dirty_ring_free(struct kvm_dirty_ring *ring) #else /* CONFIG_HAVE_KVM_DIRTY_RING */ int kvm_cpu_dirty_log_size(void); +bool kvm_use_dirty_bitmap(struct kvm *kvm); +bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm); u32 kvm_dirty_ring_get_rsvd_entries(void); int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 6fab55e58111..f51eb9419bfc 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -779,6 +779,7 @@ struct kvm { pid_t userspace_pid; unsigned int max_halt_poll_ns; u32 dirty_ring_size; + bool dirty_ring_with_bitmap; bool vm_bugged; bool vm_dead; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 0d5d4419139a..c87b5882d7ae 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1178,6 +1178,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_S390_ZPCI_OP 221 #define KVM_CAP_S390_CPU_TOPOLOGY 222 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223 +#define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 224 #ifdef KVM_CAP_IRQ_ROUTING diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 800f9470e36b..228be1145cf3 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -33,6 +33,14 @@ config HAVE_KVM_DIRTY_RING_ACQ_REL bool select HAVE_KVM_DIRTY_RING +# Only architectures that need to dirty memory outside of a vCPU +# context should select this, advertising to userspace the +# requirement to use a dirty bitmap in addition to the vCPU dirty +# ring. +config HAVE_KVM_DIRTY_RING_WITH_BITMAP + bool + depends on HAVE_KVM_DIRTY_RING + config HAVE_KVM_EVENTFD bool select EVENTFD diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index fecbb7d75ad2..f95cbcdd74ff 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -21,6 +21,18 @@ u32 kvm_dirty_ring_get_rsvd_entries(void) return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size(); } +bool kvm_use_dirty_bitmap(struct kvm *kvm) +{ + lockdep_assert_held(&kvm->slots_lock); + + return !kvm->dirty_ring_size || kvm->dirty_ring_with_bitmap; +} + +bool __weak kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm) +{ + return false; +} + static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring) { return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index c865d7d82685..5f32752b7d96 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1617,7 +1617,7 @@ static int kvm_prepare_memory_region(struct kvm *kvm, new->dirty_bitmap = NULL; else if (old && old->dirty_bitmap) new->dirty_bitmap = old->dirty_bitmap; - else if (!kvm->dirty_ring_size) { + else if (kvm_use_dirty_bitmap(kvm)) { r = kvm_alloc_dirty_bitmap(new); if (r) return r; @@ -2060,8 +2060,8 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log, unsigned long n; unsigned long any = 0; - /* Dirty ring tracking is exclusive to dirty log tracking */ - if (kvm->dirty_ring_size) + /* Dirty ring tracking may be exclusive to dirty log tracking */ + if (!kvm_use_dirty_bitmap(kvm)) return -ENXIO; *memslot = NULL; @@ -2125,8 +2125,8 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log) unsigned long *dirty_bitmap_buffer; bool flush; - /* Dirty ring tracking is exclusive to dirty log tracking */ - if (kvm->dirty_ring_size) + /* Dirty ring tracking may be exclusive to dirty log tracking */ + if (!kvm_use_dirty_bitmap(kvm)) return -ENXIO; as_id = log->slot >> 16; @@ -2237,8 +2237,8 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, unsigned long *dirty_bitmap_buffer; bool flush; - /* Dirty ring tracking is exclusive to dirty log tracking */ - if (kvm->dirty_ring_size) + /* Dirty ring tracking may be exclusive to dirty log tracking */ + if (!kvm_use_dirty_bitmap(kvm)) return -ENXIO; as_id = log->slot >> 16; @@ -3305,7 +3305,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); #ifdef CONFIG_HAVE_KVM_DIRTY_RING - if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm)) + if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm)) + return; + + if (WARN_ON_ONCE(!kvm_arch_allow_write_without_running_vcpu(kvm) && !vcpu)) return; #endif @@ -3313,7 +3316,7 @@ void mark_page_dirty_in_slot(struct kvm *kvm, unsigned long rel_gfn = gfn - memslot->base_gfn; u32 slot = (memslot->as_id << 16) | memslot->id; - if (kvm->dirty_ring_size) + if (kvm->dirty_ring_size && vcpu) kvm_dirty_ring_push(vcpu, slot, rel_gfn); else set_bit_le(rel_gfn, memslot->dirty_bitmap); @@ -4482,6 +4485,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn); #else return 0; +#endif +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: #endif case KVM_CAP_BINARY_STATS_FD: case KVM_CAP_SYSTEM_EVENT_DATA: @@ -4558,6 +4564,20 @@ int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm, return -EINVAL; } +static bool kvm_are_all_memslots_empty(struct kvm *kvm) +{ + int i; + + lockdep_assert_held(&kvm->slots_lock); + + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { + if (!kvm_memslots_empty(__kvm_memslots(kvm, i))) + return false; + } + + return true; +} + static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, struct kvm_enable_cap *cap) { @@ -4588,6 +4608,29 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, return -EINVAL; return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]); + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: { + int r = -EINVAL; + + if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || + !kvm->dirty_ring_size) + return r; + + mutex_lock(&kvm->slots_lock); + + /* + * For simplicity, allow enabling ring+bitmap if and only if + * there are no memslots, e.g. to ensure all memslots allocate + * a bitmap after the capability is enabled. + */ + if (kvm_are_all_memslots_empty(kvm)) { + kvm->dirty_ring_with_bitmap = true; + r = 0; + } + + mutex_unlock(&kvm->slots_lock); + + return r; + } default: return kvm_vm_ioctl_enable_cap(kvm, cap); }