Message ID | 20221031003621.164306-5-gshan@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: arm64: Enable ring-based dirty memory tracking | expand |
On Mon, Oct 31, 2022 at 08:36:16AM +0800, Gavin Shan wrote: > ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is > enabled. It's conflicting with that ring-based dirty page tracking always > requires a running VCPU context. > > Introduce a new flavor of dirty ring that requires the use of both VCPU > dirty rings and a dirty bitmap. The expectation is that for non-VCPU > sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to > the dirty bitmap. Userspace should scan the dirty bitmap before migrating > the VM to the target. > > Use an additional capability to advertise this behavior. The newly added > capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before > KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added > capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL. > > Suggested-by: Marc Zyngier <maz@kernel.org> > Suggested-by: Peter Xu <peterx@redhat.com> > Co-developed-by: Oliver Upton <oliver.upton@linux.dev> > Signed-off-by: Oliver Upton <oliver.upton@linux.dev> > Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: Peter Xu <peterx@redhat.com>
On Mon, Oct 31, 2022 at 08:36:16AM +0800, Gavin Shan wrote: > ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is > enabled. It's conflicting with that ring-based dirty page tracking always > requires a running VCPU context. > > Introduce a new flavor of dirty ring that requires the use of both VCPU > dirty rings and a dirty bitmap. The expectation is that for non-VCPU > sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to > the dirty bitmap. Userspace should scan the dirty bitmap before migrating > the VM to the target. > > Use an additional capability to advertise this behavior. The newly added > capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before > KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added > capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL. Whatever ordering requirements we settle on between these capabilities needs to be documented as well. [...] > @@ -4588,6 +4594,13 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, > return -EINVAL; > > return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]); > + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: > + if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || > + !kvm->dirty_ring_size) I believe this ordering requirement is problematic, as it piles on top of an existing problem w.r.t. KVM_CAP_DIRTY_LOG_RING v. memslot creation. Example: - Enable KVM_CAP_DIRTY_LOG_RING - Create some memslots w/ dirty logging enabled (note that the bitmap is _not_ allocated) - Enable KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP - Save ITS tables and get a NULL dereference in mark_page_dirty_in_slot(): if (vcpu && kvm->dirty_ring_size) kvm_dirty_ring_push(&vcpu->dirty_ring, slot, rel_gfn); else -------> set_bit_le(rel_gfn, memslot->dirty_bitmap); Similarly, KVM may unnecessarily allocate bitmaps if dirty logging is enabled on memslots before KVM_CAP_DIRTY_LOG_RING is enabled. You could paper over this issue by disallowing DIRTY_RING_WITH_BITMAP if DIRTY_LOG_RING has already been enabled, but the better approach would be to explicitly check kvm_memslots_empty() such that the real dependency is obvious. Peter, hadn't you mentioned something about checking against memslots in an earlier revision? -- Thanks, Oliver
Hi Oliver, On 11/4/22 7:32 AM, Oliver Upton wrote: > On Mon, Oct 31, 2022 at 08:36:16AM +0800, Gavin Shan wrote: >> ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is >> enabled. It's conflicting with that ring-based dirty page tracking always >> requires a running VCPU context. >> >> Introduce a new flavor of dirty ring that requires the use of both VCPU >> dirty rings and a dirty bitmap. The expectation is that for non-VCPU >> sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to >> the dirty bitmap. Userspace should scan the dirty bitmap before migrating >> the VM to the target. >> >> Use an additional capability to advertise this behavior. The newly added >> capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before >> KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added >> capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL. > > Whatever ordering requirements we settle on between these capabilities > needs to be documented as well. > > [...] > It's mentioned in 'Documentation/virt/kvm/api.rst' as below. After using the dirty rings, the userspace needs to detect the capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the ring structures need to be backed by per-slot bitmaps. With this capability advertised and supported, it means the architecture can dirty guest pages without vcpu/ring context, so that some of the dirty information will still be maintained in the bitmap structure. The description may be not obvious about the ordering. For this, I can add the following sentence at end of the section. The capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP can't be enabled until the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL has been enabled. >> @@ -4588,6 +4594,13 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, >> return -EINVAL; >> >> return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]); >> + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: >> + if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || >> + !kvm->dirty_ring_size) > > I believe this ordering requirement is problematic, as it piles on top > of an existing problem w.r.t. KVM_CAP_DIRTY_LOG_RING v. memslot > creation. > > Example: > - Enable KVM_CAP_DIRTY_LOG_RING > - Create some memslots w/ dirty logging enabled (note that the bitmap > is _not_ allocated) > - Enable KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP > - Save ITS tables and get a NULL dereference in > mark_page_dirty_in_slot(): > > if (vcpu && kvm->dirty_ring_size) > kvm_dirty_ring_push(&vcpu->dirty_ring, > slot, rel_gfn); > else > -------> set_bit_le(rel_gfn, memslot->dirty_bitmap); > > Similarly, KVM may unnecessarily allocate bitmaps if dirty logging is > enabled on memslots before KVM_CAP_DIRTY_LOG_RING is enabled. > > You could paper over this issue by disallowing DIRTY_RING_WITH_BITMAP if > DIRTY_LOG_RING has already been enabled, but the better approach would > be to explicitly check kvm_memslots_empty() such that the real > dependency is obvious. Peter, hadn't you mentioned something about > checking against memslots in an earlier revision? > The userspace (QEMU) needs to ensure that no dirty bitmap is created before the capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is enabled. It's unknown by QEMU that vgic/its is used when KVM_CAP_DIRTY_LOG_RING_ACQ_REL is enabled. kvm_initialization enable_KVM_CAP_DIRTY_LOG_RING_ACQ_REL // Where KVM_CAP_DIRTY_LOG_RING is enabled board_initialization // Where QEMU knows if vgic/its is used add_memory_slots kvm_post_initialization enable_KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP : start_migration enable_dirty_page_tracking create_dirty_bitmap // With KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP enabled Thanks, Gavin
On Fri, Nov 04, 2022 at 08:12:21AM +0800, Gavin Shan wrote: > Hi Oliver, > > On 11/4/22 7:32 AM, Oliver Upton wrote: > > On Mon, Oct 31, 2022 at 08:36:16AM +0800, Gavin Shan wrote: > > > ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is > > > enabled. It's conflicting with that ring-based dirty page tracking always > > > requires a running VCPU context. > > > > > > Introduce a new flavor of dirty ring that requires the use of both VCPU > > > dirty rings and a dirty bitmap. The expectation is that for non-VCPU > > > sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to > > > the dirty bitmap. Userspace should scan the dirty bitmap before migrating > > > the VM to the target. > > > > > > Use an additional capability to advertise this behavior. The newly added > > > capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before > > > KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added > > > capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL. > > > > Whatever ordering requirements we settle on between these capabilities > > needs to be documented as well. > > > > [...] > > > > It's mentioned in 'Documentation/virt/kvm/api.rst' as below. > > After using the dirty rings, the userspace needs to detect the capability > of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the ring structures > need to be backed by per-slot bitmaps. With this capability advertised > and supported, it means the architecture can dirty guest pages without > vcpu/ring context, so that some of the dirty information will still be > maintained in the bitmap structure. > > The description may be not obvious about the ordering. For this, I can > add the following sentence at end of the section. > > The capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP can't be enabled > until the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL has been enabled. > > > > @@ -4588,6 +4594,13 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, > > > return -EINVAL; > > > return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]); > > > + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: > > > + if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || > > > + !kvm->dirty_ring_size) > > > > I believe this ordering requirement is problematic, as it piles on top > > of an existing problem w.r.t. KVM_CAP_DIRTY_LOG_RING v. memslot > > creation. > > > > Example: > > - Enable KVM_CAP_DIRTY_LOG_RING > > - Create some memslots w/ dirty logging enabled (note that the bitmap > > is _not_ allocated) > > - Enable KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP > > - Save ITS tables and get a NULL dereference in > > mark_page_dirty_in_slot(): > > > > if (vcpu && kvm->dirty_ring_size) > > kvm_dirty_ring_push(&vcpu->dirty_ring, > > slot, rel_gfn); > > else > > -------> set_bit_le(rel_gfn, memslot->dirty_bitmap); > > > > Similarly, KVM may unnecessarily allocate bitmaps if dirty logging is > > enabled on memslots before KVM_CAP_DIRTY_LOG_RING is enabled. > > > > You could paper over this issue by disallowing DIRTY_RING_WITH_BITMAP if > > DIRTY_LOG_RING has already been enabled, but the better approach would > > be to explicitly check kvm_memslots_empty() such that the real > > dependency is obvious. Peter, hadn't you mentioned something about > > checking against memslots in an earlier revision? > > > > The userspace (QEMU) needs to ensure that no dirty bitmap is created > before the capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is enabled. > It's unknown by QEMU that vgic/its is used when KVM_CAP_DIRTY_LOG_RING_ACQ_REL > is enabled. I'm not worried about what QEMU (or any particular VMM for that matter) does with the UAPI. The problem is this patch provides a trivial way for userspace to cause a NULL dereference in the kernel. Imposing ordering between the cap and memslot creation avoids the problem altogether. So, looking at your example: > kvm_initialization > enable_KVM_CAP_DIRTY_LOG_RING_ACQ_REL // Where KVM_CAP_DIRTY_LOG_RING is enabled > board_initialization // Where QEMU knows if vgic/its is used Is it possible that QEMU could hoist enabling RING_WITH_BITMAP here? Based on your description QEMU has decided to use the vGIC ITS but hasn't yet added any memslots. > add_memory_slots > kvm_post_initialization > enable_KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP > : > start_migration > enable_dirty_page_tracking > create_dirty_bitmap // With KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP enabled Just to make sure we're on the same page, there's two issues: (1) If DIRTY_LOG_RING is enabled before memslot creation and RING_WITH_BITMAP is enabled after memslots have been created w/ dirty logging enabled, memslot->dirty_bitmap == NULL and the kernel will fault when attempting to save the ITS tables. (2) Not your code, but a similar issue. If DIRTY_LOG_RING[_ACQ_REL] is enabled after memslots have been created w/ dirty logging enabled, memslot->dirty_bitmap != NULL and that memory is wasted until the memslot is freed. I don't expect you to fix #2, though I've mentioned it because using the same approach to #1 and #2 would be nice. -- Thanks, Oliver
Hi Oliver, On 11/4/22 9:06 AM, Oliver Upton wrote: > On Fri, Nov 04, 2022 at 08:12:21AM +0800, Gavin Shan wrote: >> On 11/4/22 7:32 AM, Oliver Upton wrote: >>> On Mon, Oct 31, 2022 at 08:36:16AM +0800, Gavin Shan wrote: >>>> ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is >>>> enabled. It's conflicting with that ring-based dirty page tracking always >>>> requires a running VCPU context. >>>> >>>> Introduce a new flavor of dirty ring that requires the use of both VCPU >>>> dirty rings and a dirty bitmap. The expectation is that for non-VCPU >>>> sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to >>>> the dirty bitmap. Userspace should scan the dirty bitmap before migrating >>>> the VM to the target. >>>> >>>> Use an additional capability to advertise this behavior. The newly added >>>> capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before >>>> KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added >>>> capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL. >>> >>> Whatever ordering requirements we settle on between these capabilities >>> needs to be documented as well. >>> >>> [...] >>> >> >> It's mentioned in 'Documentation/virt/kvm/api.rst' as below. >> >> After using the dirty rings, the userspace needs to detect the capability >> of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the ring structures >> need to be backed by per-slot bitmaps. With this capability advertised >> and supported, it means the architecture can dirty guest pages without >> vcpu/ring context, so that some of the dirty information will still be >> maintained in the bitmap structure. >> >> The description may be not obvious about the ordering. For this, I can >> add the following sentence at end of the section. >> >> The capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP can't be enabled >> until the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL has been enabled. >> >>>> @@ -4588,6 +4594,13 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, >>>> return -EINVAL; >>>> return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]); >>>> + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: >>>> + if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || >>>> + !kvm->dirty_ring_size) >>> >>> I believe this ordering requirement is problematic, as it piles on top >>> of an existing problem w.r.t. KVM_CAP_DIRTY_LOG_RING v. memslot >>> creation. >>> >>> Example: >>> - Enable KVM_CAP_DIRTY_LOG_RING >>> - Create some memslots w/ dirty logging enabled (note that the bitmap >>> is _not_ allocated) >>> - Enable KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP >>> - Save ITS tables and get a NULL dereference in >>> mark_page_dirty_in_slot(): >>> >>> if (vcpu && kvm->dirty_ring_size) >>> kvm_dirty_ring_push(&vcpu->dirty_ring, >>> slot, rel_gfn); >>> else >>> -------> set_bit_le(rel_gfn, memslot->dirty_bitmap); >>> >>> Similarly, KVM may unnecessarily allocate bitmaps if dirty logging is >>> enabled on memslots before KVM_CAP_DIRTY_LOG_RING is enabled. >>> >>> You could paper over this issue by disallowing DIRTY_RING_WITH_BITMAP if >>> DIRTY_LOG_RING has already been enabled, but the better approach would >>> be to explicitly check kvm_memslots_empty() such that the real >>> dependency is obvious. Peter, hadn't you mentioned something about >>> checking against memslots in an earlier revision? >>> >> >> The userspace (QEMU) needs to ensure that no dirty bitmap is created >> before the capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is enabled. >> It's unknown by QEMU that vgic/its is used when KVM_CAP_DIRTY_LOG_RING_ACQ_REL >> is enabled. > > I'm not worried about what QEMU (or any particular VMM for that matter) > does with the UAPI. The problem is this patch provides a trivial way for > userspace to cause a NULL dereference in the kernel. Imposing ordering > between the cap and memslot creation avoids the problem altogether. > > So, looking at your example: > >> kvm_initialization >> enable_KVM_CAP_DIRTY_LOG_RING_ACQ_REL // Where KVM_CAP_DIRTY_LOG_RING is enabled >> board_initialization // Where QEMU knows if vgic/its is used > > Is it possible that QEMU could hoist enabling RING_WITH_BITMAP here? > Based on your description QEMU has decided to use the vGIC ITS but > hasn't yet added any memslots. > It's possible to add ARM specific helper kvm_arm_enable_dirty_ring_with_bitmap() in qemu/target/arm.c, to enable RING_WITH_BITMAP if needed. The newly added function can be called here when vgic/its is used. >> add_memory_slots >> kvm_post_initialization >> enable_KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP >> : >> start_migration >> enable_dirty_page_tracking >> create_dirty_bitmap // With KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP enabled > > Just to make sure we're on the same page, there's two issues: > > (1) If DIRTY_LOG_RING is enabled before memslot creation and > RING_WITH_BITMAP is enabled after memslots have been created w/ > dirty logging enabled, memslot->dirty_bitmap == NULL and the > kernel will fault when attempting to save the ITS tables. > > (2) Not your code, but a similar issue. If DIRTY_LOG_RING[_ACQ_REL] is > enabled after memslots have been created w/ dirty logging enabled, > memslot->dirty_bitmap != NULL and that memory is wasted until the > memslot is freed. > > I don't expect you to fix #2, though I've mentioned it because using the > same approach to #1 and #2 would be nice. > Yes, I got your points. Case (2) is still possible to happen with QEMU excluded. However, QEMU is always enabling DIRTY_LOG_RING[_ACQ_REL] before any memory slot is created. I agree that we need to ensure there are no memory slots when DIRTY_LOG_RING[_ACQ_REL] is enabled. For case (1), we can ensure RING_WTIH_BITMAP is enabled before any memory slot is added, as below. QEMU needs a new helper (as above) to enable it on board's level. Lets fix both with a new helper in PATCH[v8 4/9] like below? static inline bool kvm_vm_has_memslot_pages(struct kvm *kvm) { bool has_memslot_pages; mutex_lock(&kvm->slots_lock); has_memslot_pages = !!kvm->nr_memslot_pages; mutex_unlock(&kvm->slots_lock); return has_memslot_pages; } static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, struct kvm_enable_cap *cap) { : switch (cap->cap) { case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || !kvm->dirty_ring_size || kvm_vm_has_memslot_pages(kvm)) return -EINVAL; kvm->dirty_ring_with_bitmap = true; return 0; default: return kvm_vm_ioctl_enable_cap(kvm, cap); } } static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size) { : /* We only allow it to set once */ if (kvm->dirty_ring_size) return -EINVAL; if (kvm_vm_has_memslot_pages(kvm)) return -EINVAL; : } Thanks, Gavin
On Fri, Nov 04, 2022 at 02:57:15PM +0800, Gavin Shan wrote: > On 11/4/22 9:06 AM, Oliver Upton wrote: [...] > > Just to make sure we're on the same page, there's two issues: > > > > (1) If DIRTY_LOG_RING is enabled before memslot creation and > > RING_WITH_BITMAP is enabled after memslots have been created w/ > > dirty logging enabled, memslot->dirty_bitmap == NULL and the > > kernel will fault when attempting to save the ITS tables. > > > > (2) Not your code, but a similar issue. If DIRTY_LOG_RING[_ACQ_REL] is > > enabled after memslots have been created w/ dirty logging enabled, > > memslot->dirty_bitmap != NULL and that memory is wasted until the > > memslot is freed. > > > > I don't expect you to fix #2, though I've mentioned it because using the > > same approach to #1 and #2 would be nice. > > > > Yes, I got your points. Case (2) is still possible to happen with QEMU > excluded. However, QEMU is always enabling DIRTY_LOG_RING[_ACQ_REL] before > any memory slot is created. I agree that we need to ensure there are no > memory slots when DIRTY_LOG_RING[_ACQ_REL] is enabled. > > For case (1), we can ensure RING_WTIH_BITMAP is enabled before any memory > slot is added, as below. QEMU needs a new helper (as above) to enable it > on board's level. > > Lets fix both with a new helper in PATCH[v8 4/9] like below? I agree that we should address (1) like this, but in (2) requiring that no memslots were created before enabling the existing capabilities would be a change in ABI. If we can get away with that, great, but otherwise we may need to delete the bitmaps associated with all memslots when the cap is enabled. > static inline bool kvm_vm_has_memslot_pages(struct kvm *kvm) > { > bool has_memslot_pages; > > mutex_lock(&kvm->slots_lock); > > has_memslot_pages = !!kvm->nr_memslot_pages; > > mutex_unlock(&kvm->slots_lock); > > return has_memslot_pages; > } Do we need to build another helper for this? kvm_memslots_empty() will tell you whether or not a memslot has been created by checking the gfn tree. On top of that, the memslot check and setting kvm->dirty_ring_with_bitmap must happen behind the slots_lock. Otherwise you could still wind up creating memslots w/o bitmaps. Something like: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 91cf51a25394..420cc101a16e 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4588,6 +4588,32 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, return -EINVAL; return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]); + + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: { + struct kvm_memslots *slots; + int r = -EINVAL; + + if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || + !kvm->dirty_ring_size) + return r; + + mutex_lock(&kvm->slots_lock); + + slots = kvm_memslots(kvm); + + /* + * Avoid a race between memslot creation and enabling the ring + + * bitmap capability to guarantee that no memslots have been + * created without a bitmap. + */ + if (kvm_memslots_empty(slots)) { + kvm->dirty_ring_with_bitmap = cap->args[0]; + r = 0; + } + + mutex_unlock(&kvm->slots_lock); + return r; + } default: return kvm_vm_ioctl_enable_cap(kvm, cap); } -- Thanks, Oliver
Hi Oliver, On 11/5/22 4:12 AM, Oliver Upton wrote: > On Fri, Nov 04, 2022 at 02:57:15PM +0800, Gavin Shan wrote: >> On 11/4/22 9:06 AM, Oliver Upton wrote: > > [...] > >>> Just to make sure we're on the same page, there's two issues: >>> >>> (1) If DIRTY_LOG_RING is enabled before memslot creation and >>> RING_WITH_BITMAP is enabled after memslots have been created w/ >>> dirty logging enabled, memslot->dirty_bitmap == NULL and the >>> kernel will fault when attempting to save the ITS tables. >>> >>> (2) Not your code, but a similar issue. If DIRTY_LOG_RING[_ACQ_REL] is >>> enabled after memslots have been created w/ dirty logging enabled, >>> memslot->dirty_bitmap != NULL and that memory is wasted until the >>> memslot is freed. >>> >>> I don't expect you to fix #2, though I've mentioned it because using the >>> same approach to #1 and #2 would be nice. >>> >> >> Yes, I got your points. Case (2) is still possible to happen with QEMU >> excluded. However, QEMU is always enabling DIRTY_LOG_RING[_ACQ_REL] before >> any memory slot is created. I agree that we need to ensure there are no >> memory slots when DIRTY_LOG_RING[_ACQ_REL] is enabled. >> >> For case (1), we can ensure RING_WTIH_BITMAP is enabled before any memory >> slot is added, as below. QEMU needs a new helper (as above) to enable it >> on board's level. >> >> Lets fix both with a new helper in PATCH[v8 4/9] like below? > > I agree that we should address (1) like this, but in (2) requiring that > no memslots were created before enabling the existing capabilities would > be a change in ABI. If we can get away with that, great, but otherwise > we may need to delete the bitmaps associated with all memslots when the > cap is enabled. > I had the assumption QEMU and kvm/selftests are the only consumers to use DIRTY_RING. In this case, requiring that no memslots were created to enable DIRTY_RING won't break userspace. Following your thoughts, the tracked dirty pages in the bitmap also need to be synchronized to the per-vcpu-ring before the bitmap can be destroyed. We don't have per-vcpu-ring at this stage. >> static inline bool kvm_vm_has_memslot_pages(struct kvm *kvm) >> { >> bool has_memslot_pages; >> >> mutex_lock(&kvm->slots_lock); >> >> has_memslot_pages = !!kvm->nr_memslot_pages; >> >> mutex_unlock(&kvm->slots_lock); >> >> return has_memslot_pages; >> } > > Do we need to build another helper for this? kvm_memslots_empty() will > tell you whether or not a memslot has been created by checking the gfn > tree. > The helper was introduced to be shared when DIRTY_RING[_ACQ_REL] and DIRTY_RING_WITH_BITMAP are enabled. Since the issue (2) isn't concern to us, lets put it aside and the helper isn't needed. kvm_memslots_empty() has same effect as to 'kvm->nr_memslot_pages', it's fine to use kvm_memslots_empty(), which is more generic. > On top of that, the memslot check and setting > kvm->dirty_ring_with_bitmap must happen behind the slots_lock. Otherwise > you could still wind up creating memslots w/o bitmaps. > Agree. > > Something like: > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 91cf51a25394..420cc101a16e 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -4588,6 +4588,32 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, > return -EINVAL; > > return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]); > + > + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: { > + struct kvm_memslots *slots; > + int r = -EINVAL; > + > + if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || > + !kvm->dirty_ring_size) > + return r; > + > + mutex_lock(&kvm->slots_lock); > + > + slots = kvm_memslots(kvm); > + > + /* > + * Avoid a race between memslot creation and enabling the ring + > + * bitmap capability to guarantee that no memslots have been > + * created without a bitmap. > + */ > + if (kvm_memslots_empty(slots)) { > + kvm->dirty_ring_with_bitmap = cap->args[0]; > + r = 0; > + } > + > + mutex_unlock(&kvm->slots_lock); > + return r; > + } > default: > return kvm_vm_ioctl_enable_cap(kvm, cap); > } > The proposed changes look good to me. It will be integrated to PATCH[v8 4/9]. By the way, v8 will be posted shortly. Thanks, Gavin
On Sat, Nov 05, 2022 at 05:57:33AM +0800, Gavin Shan wrote: > On 11/5/22 4:12 AM, Oliver Upton wrote: > > I agree that we should address (1) like this, but in (2) requiring that > > no memslots were created before enabling the existing capabilities would > > be a change in ABI. If we can get away with that, great, but otherwise > > we may need to delete the bitmaps associated with all memslots when the > > cap is enabled. > > > > I had the assumption QEMU and kvm/selftests are the only consumers to > use DIRTY_RING. In this case, requiring that no memslots were created > to enable DIRTY_RING won't break userspace. > Following your thoughts, the tracked dirty pages in the bitmap also > need to be synchronized to the per-vcpu-ring before the bitmap can be > destroyed. Eh, I don't think we'd need to go that far. No matter what, any dirty bits that were present in the bitmap could never be read again anyway, as we reject KVM_GET_DIRTY_LOG if kvm->dirty_ring_size != 0. > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index 91cf51a25394..420cc101a16e 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -4588,6 +4588,32 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, > > return -EINVAL; > > return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]); > > + > > + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: { > > + struct kvm_memslots *slots; > > + int r = -EINVAL; > > + > > + if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || > > + !kvm->dirty_ring_size) > > + return r; > > + > > + mutex_lock(&kvm->slots_lock); > > + > > + slots = kvm_memslots(kvm); > > + > > + /* > > + * Avoid a race between memslot creation and enabling the ring + > > + * bitmap capability to guarantee that no memslots have been > > + * created without a bitmap. > > + */ > > + if (kvm_memslots_empty(slots)) { > > + kvm->dirty_ring_with_bitmap = cap->args[0]; > > + r = 0; > > + } > > + > > + mutex_unlock(&kvm->slots_lock); > > + return r; > > + } > > default: > > return kvm_vm_ioctl_enable_cap(kvm, cap); > > } > > > > The proposed changes look good to me. It will be integrated to PATCH[v8 4/9]. > By the way, v8 will be posted shortly. Excellent, thanks! -- Best, Oliver
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index eee9f857a986..4d4eeb5c3c5a 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -8003,13 +8003,6 @@ flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one needs to kick the vcpu out of KVM_RUN using a signal. The resulting vmexit ensures that all dirty GFNs are flushed to the dirty rings. -NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding -ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls -KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG. After enabling -KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual -machine will switch to ring-buffer dirty page tracking and further -KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail. - NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that should be exposed by weakly ordered architecture, in order to indicate the additional memory ordering requirements imposed on userspace when @@ -8018,6 +8011,30 @@ Architecture with TSO-like ordering (such as x86) are allowed to expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL to userspace. +After using the dirty rings, the userspace needs to detect the capability +of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the ring structures +need to be backed by per-slot bitmaps. With this capability advertised +and supported, it means the architecture can dirty guest pages without +vcpu/ring context, so that some of the dirty information will still be +maintained in the bitmap structure. + +Note that the bitmap here is only a backup of the ring structure, and +normally should only contain a very small amount of dirty pages, which +needs to be transferred during VM downtime. Collecting the dirty bitmap +should be the very last thing that the VMM does before transmitting state +to the target VM. VMM needs to ensure that the dirty state is final and +avoid missing dirty pages from another ioctl ordered after the bitmap +collection. + +To collect dirty bits in the backup bitmap, the userspace can use the +same KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG shouldn't be needed +and its behavior is undefined since collecting the dirty bitmap always +happens in the last phase of VM's migration. + +NOTE: One example of using the backup bitmap is saving arm64 vgic/its +tables through KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} command on +KVM device "kvm-arm-vgic-its" during VM's migration. + 8.30 KVM_CAP_XEN_HVM -------------------- diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h index 04290eda0852..b08b9afd8bdb 100644 --- a/include/linux/kvm_dirty_ring.h +++ b/include/linux/kvm_dirty_ring.h @@ -37,6 +37,11 @@ static inline u32 kvm_dirty_ring_get_rsvd_entries(void) return 0; } +static inline bool kvm_use_dirty_bitmap(struct kvm *kvm) +{ + return true; +} + static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size) { @@ -66,6 +71,7 @@ static inline void kvm_dirty_ring_free(struct kvm_dirty_ring *ring) #else /* CONFIG_HAVE_KVM_DIRTY_RING */ int kvm_cpu_dirty_log_size(void); +bool kvm_use_dirty_bitmap(struct kvm *kvm); u32 kvm_dirty_ring_get_rsvd_entries(void); int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 648d663f32c4..db83f63f4e61 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -779,6 +779,7 @@ struct kvm { pid_t userspace_pid; unsigned int max_halt_poll_ns; u32 dirty_ring_size; + bool dirty_ring_with_bitmap; bool vm_bugged; bool vm_dead; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 0d5d4419139a..c87b5882d7ae 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1178,6 +1178,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_S390_ZPCI_OP 221 #define KVM_CAP_S390_CPU_TOPOLOGY 222 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223 +#define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 224 #ifdef KVM_CAP_IRQ_ROUTING diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 800f9470e36b..228be1145cf3 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -33,6 +33,14 @@ config HAVE_KVM_DIRTY_RING_ACQ_REL bool select HAVE_KVM_DIRTY_RING +# Only architectures that need to dirty memory outside of a vCPU +# context should select this, advertising to userspace the +# requirement to use a dirty bitmap in addition to the vCPU dirty +# ring. +config HAVE_KVM_DIRTY_RING_WITH_BITMAP + bool + depends on HAVE_KVM_DIRTY_RING + config HAVE_KVM_EVENTFD bool select EVENTFD diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index 6091e1403bc8..7ce6a5f81c98 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -21,6 +21,11 @@ u32 kvm_dirty_ring_get_rsvd_entries(void) return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size(); } +bool kvm_use_dirty_bitmap(struct kvm *kvm) +{ + return !kvm->dirty_ring_size || kvm->dirty_ring_with_bitmap; +} + static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring) { return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 91cf51a25394..0351c8fb41b9 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1617,7 +1617,7 @@ static int kvm_prepare_memory_region(struct kvm *kvm, new->dirty_bitmap = NULL; else if (old && old->dirty_bitmap) new->dirty_bitmap = old->dirty_bitmap; - else if (!kvm->dirty_ring_size) { + else if (kvm_use_dirty_bitmap(kvm)) { r = kvm_alloc_dirty_bitmap(new); if (r) return r; @@ -2060,8 +2060,8 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log, unsigned long n; unsigned long any = 0; - /* Dirty ring tracking is exclusive to dirty log tracking */ - if (kvm->dirty_ring_size) + /* Dirty ring tracking may be exclusive to dirty log tracking */ + if (!kvm_use_dirty_bitmap(kvm)) return -ENXIO; *memslot = NULL; @@ -2125,8 +2125,8 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log) unsigned long *dirty_bitmap_buffer; bool flush; - /* Dirty ring tracking is exclusive to dirty log tracking */ - if (kvm->dirty_ring_size) + /* Dirty ring tracking may be exclusive to dirty log tracking */ + if (!kvm_use_dirty_bitmap(kvm)) return -ENXIO; as_id = log->slot >> 16; @@ -2237,8 +2237,8 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, unsigned long *dirty_bitmap_buffer; bool flush; - /* Dirty ring tracking is exclusive to dirty log tracking */ - if (kvm->dirty_ring_size) + /* Dirty ring tracking may be exclusive to dirty log tracking */ + if (!kvm_use_dirty_bitmap(kvm)) return -ENXIO; as_id = log->slot >> 16; @@ -3305,7 +3305,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); #ifdef CONFIG_HAVE_KVM_DIRTY_RING - if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm)) + if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm)) + return; + + if (WARN_ON_ONCE(!kvm->dirty_ring_with_bitmap && !vcpu)) return; #endif @@ -3313,7 +3316,7 @@ void mark_page_dirty_in_slot(struct kvm *kvm, unsigned long rel_gfn = gfn - memslot->base_gfn; u32 slot = (memslot->as_id << 16) | memslot->id; - if (kvm->dirty_ring_size) + if (kvm->dirty_ring_size && vcpu) kvm_dirty_ring_push(vcpu, slot, rel_gfn); else set_bit_le(rel_gfn, memslot->dirty_bitmap); @@ -4482,6 +4485,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn); #else return 0; +#endif +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: #endif case KVM_CAP_BINARY_STATS_FD: case KVM_CAP_SYSTEM_EVENT_DATA: @@ -4588,6 +4594,13 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, return -EINVAL; return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]); + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: + if (!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) || + !kvm->dirty_ring_size) + return -EINVAL; + + kvm->dirty_ring_with_bitmap = true; + return 0; default: return kvm_vm_ioctl_enable_cap(kvm, cap); }