Message ID | 20220915101049.187325-2-shivam.kumar1@nutanix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: Dirty quota-based throttling | expand |
Am 15.09.22 um 12:10 schrieb Shivam Kumar: > Define variables to track and throttle memory dirtying for every vcpu. > > dirty_count: Number of pages the vcpu has dirtied since its creation, > while dirty logging is enabled. > dirty_quota: Number of pages the vcpu is allowed to dirty. To dirty > more, it needs to request more quota by exiting to > userspace. > > Implement the flow for throttling based on dirty quota. > > i) Increment dirty_count for the vcpu whenever it dirties a page. > ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty > count equals/exceeds dirty quota) to request more dirty quota. > > Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com> > Suggested-by: Manish Mishra <manish.mishra@nutanix.com> > Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com> > Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com> > Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com> [...] I am wondering if this will work on s390. On s390 we only call mark_page_dirty_in_slot for the kvm_read/write functions but not for those done by the guest on fault. We do account those lazily in kvm_arch_sync_dirty_log (like x96 in the past). > + > void mark_page_dirty_in_slot(struct kvm *kvm, > const struct kvm_memory_slot *memslot, > gfn_t gfn) > { > struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); > > -#ifdef CONFIG_HAVE_KVM_DIRTY_RING > if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm)) > return; > -#endif This will rigger on s390 for the interrupt payload written from vhost if I recall correctly. Please do not enable this warning. > > - if (memslot && kvm_slot_dirty_track_enabled(memslot)) { > + if (!memslot) > + return; > + > + WARN_ON_ONCE(!vcpu->stat.generic.pages_dirtied++); > + > + if (kvm_slot_dirty_track_enabled(memslot)) { > unsigned long rel_gfn = gfn - memslot->base_gfn; > u32 slot = (memslot->as_id << 16) | memslot->id; > > @@ -3318,6 +3336,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm, > slot, rel_gfn); > else > set_bit_le(rel_gfn, memslot->dirty_bitmap); > + > + kvm_vcpu_is_dirty_quota_exhausted(vcpu); > } > } > EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot);
Am 15.09.22 um 15:21 schrieb Christian Borntraeger: > > > Am 15.09.22 um 12:10 schrieb Shivam Kumar: >> Define variables to track and throttle memory dirtying for every vcpu. >> >> dirty_count: Number of pages the vcpu has dirtied since its creation, >> while dirty logging is enabled. >> dirty_quota: Number of pages the vcpu is allowed to dirty. To dirty >> more, it needs to request more quota by exiting to >> userspace. >> >> Implement the flow for throttling based on dirty quota. >> >> i) Increment dirty_count for the vcpu whenever it dirties a page. >> ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty >> count equals/exceeds dirty quota) to request more dirty quota. >> >> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com> >> Suggested-by: Manish Mishra <manish.mishra@nutanix.com> >> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com> >> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com> >> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com> > [...] > > I am wondering if this will work on s390. On s390 we only call > mark_page_dirty_in_slot for the kvm_read/write functions but not > for those done by the guest on fault. We do account those lazily in > kvm_arch_sync_dirty_log (like x96 in the past). > I think we need to rework the page fault handling on s390 to actually make use of this. This has to happen anyway somewhen (as indicated by the guest enter/exit rework from Mark). Right now we handle KVM page faults directly in the normal system fault handler. It seems we need to make a side turn into KVM for page faults on guests in the long run. Until this is done the dirty logging is really not per CPU on s390 so we need to defer this feature for now. CC other s390 maintainers. The other use case for a page fault handler rework was Mark Rutlands "kvm: fix latent guest entry/exit bugs" series which did not work on s390.
On Thu, Sep 15, 2022, Christian Borntraeger wrote: > Am 15.09.22 um 15:21 schrieb Christian Borntraeger: > > I am wondering if this will work on s390. On s390 we only call > > mark_page_dirty_in_slot for the kvm_read/write functions but not > > for those done by the guest on fault. We do account those lazily in > > kvm_arch_sync_dirty_log (like x96 in the past). > > > > I think we need to rework the page fault handling on s390 to actually make > use of this. This has to happen anyway somewhen (as indicated by the guest > enter/exit rework from Mark). Right now we handle KVM page faults directly > in the normal system fault handler. It seems we need to make a side turn > into KVM for page faults on guests in the long run. s390's page table management came up in conversation at KVM Forum in the context of a common MMU (or MMUs)[*]. If/when we get an MMU that supports multiple architectures, that would be a perfect opportunity for s390 to switch. I don't know that it'll be feasible to make a common MMU support s390 from the get-go, but at the very least we can keep y'all in the loop and not make it unnecessarily difficult to support s390. [*] https://kvmforum2022.sched.com/event/15jJk
On Thu, Sep 15, 2022, Shivam Kumar wrote: > @@ -542,6 +545,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu) > return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE); > } > > +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu) > +{ > + struct kvm_run *run = vcpu->run; > + u64 dirty_quota = READ_ONCE(run->dirty_quota); > + u64 pages_dirtied = vcpu->stat.generic.pages_dirtied; > + > + if (!dirty_quota || (pages_dirtied < dirty_quota)) > + return 1; > + > + run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED; > + run->dirty_quota_exit.count = pages_dirtied; > + run->dirty_quota_exit.quota = dirty_quota; > + return 0; Dead code. > +} ... > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 584a5bab3af3..f315af50037d 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -3298,18 +3298,36 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len) > } > EXPORT_SYMBOL_GPL(kvm_clear_guest); > > +static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu) Ouch, sorry. I suspect you got this name from me[*]. That was a goof on my end, I'm 99% certain I copy-pasted stale code, i.e. didn't intended to suggest a rename. Let's keep kvm_vcpu_check_dirty_quota(), IMO that's still the least awful name. [*] https://lore.kernel.org/all/Yo+82LjHSOdyxKzT@google.com > +{ > + u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota); > + > + if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota)) > + return; > + > + /* > + * Snapshot the quota to report it to userspace. The dirty count will be > + * captured when the request is processed. > + */ > + vcpu->dirty_quota = dirty_quota; > + kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu); Making the request needs to be guarded with an arch opt-in. Pending requests prevent KVM from entering the guest, and so making a request that an arch isn't aware of will effectively hang the vCPU. Obviously userspace would be shooting itself in the foot by setting run->dirty_quota in this case, but KVM shouldn't hand userspace a loaded gun and help them aim. My suggestion from v1[*] about not forcing architectures to opt-in was in the context of a request-less implementation where dirty_quota was a nop until the arch took action. And regardless of arch opt-in, I think this needs a capability so that userspace can detect KVM support. I don't see any reason to wrap the request or vcpu_run field, e.g. something like this should suffice: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ea5847d22aff..93362441215b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3298,8 +3298,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len) } EXPORT_SYMBOL_GPL(kvm_clear_guest); -static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu) +static void kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu) { +#ifdef CONFIG_HAVE_KVM_DIRTY_QUOTA u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota); if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota)) @@ -3311,6 +3312,7 @@ static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu) */ vcpu->dirty_quota = dirty_quota; kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu); +#endif } void mark_page_dirty_in_slot(struct kvm *kvm, @@ -4507,6 +4509,8 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) case KVM_CAP_BINARY_STATS_FD: case KVM_CAP_SYSTEM_EVENT_DATA: return 1; + case KVM_CAP_DIRTY_QUOTA: + return !!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_QUOTA); default: break; } [*] https://lore.kernel.org/all/YZaUENi0ZyQi%2F9M0@google.com
On Fri, Oct 07, 2022, Sean Christopherson wrote: > On Thu, Sep 15, 2022, Shivam Kumar wrote: > Let's keep kvm_vcpu_check_dirty_quota(), IMO that's still the least awful name. > > [*] https://lore.kernel.org/all/Yo+82LjHSOdyxKzT@google.com Actually, I take that back. The code snippet itself is also flawed. If userspace increases the quota (or disables it entirely) between KVM snapshotting the quota and making the request, then there's no need for KVM to exit to userspace. So I think this can be: static void kvm_vcpu_is_dirty_quota_exchausted(struct kvm_vcpu *vcpu) { #ifdef CONFIG_HAVE_KVM_DIRTY_QUOTA u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota); return dirty_quota && (vcpu->stat.generic.pages_dirtied >= dirty_quota); #else return false; #endif } and the usage becomes: if (kvm_vcpu_is_dirty_quota_exhausted(vcpu)) kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu); More thoughts in the x86 patch.
On 07/10/22 11:48 pm, Sean Christopherson wrote: > On Thu, Sep 15, 2022, Christian Borntraeger wrote: >> Am 15.09.22 um 15:21 schrieb Christian Borntraeger: >>> I am wondering if this will work on s390. On s390 we only call >>> mark_page_dirty_in_slot for the kvm_read/write functions but not >>> for those done by the guest on fault. We do account those lazily in >>> kvm_arch_sync_dirty_log (like x96 in the past). >>> >> >> I think we need to rework the page fault handling on s390 to actually make >> use of this. This has to happen anyway somewhen (as indicated by the guest >> enter/exit rework from Mark). Right now we handle KVM page faults directly >> in the normal system fault handler. It seems we need to make a side turn >> into KVM for page faults on guests in the long run. > > s390's page table management came up in conversation at KVM Forum in the context > of a common MMU (or MMUs)[*]. If/when we get an MMU that supports multiple > architectures, that would be a perfect opportunity for s390 to switch. I don't > know that it'll be feasible to make a common MMU support s390 from the get-go, > but at the very least we can keep y'all in the loop and not make it unnecessarily > difficult to support s390. > > [*] https://urldefense.proofpoint.com/v2/url?u=https-3A__kvmforum2022.sched.com_event_15jJk&d=DwIDAw&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=ea1RmNv-vQM3Ad3SEOr7EmyBdSD1qIFG_Vp8yROenZZFQWjn4Dm6CGaOpNE0LIeJ&s=GRTR1AntYFCoIjcd2GajoTeeek3nLNgmU2slGAbzSgI&e= Thank you Sean for the reference to the KVM Forum talk. Should I skip the support for s390 (patch no. 4) for the next patchset?
On 08/10/22 12:50 am, Sean Christopherson wrote: > On Fri, Oct 07, 2022, Sean Christopherson wrote: >> On Thu, Sep 15, 2022, Shivam Kumar wrote: >> Let's keep kvm_vcpu_check_dirty_quota(), IMO that's still the least awful name. >> >> [*] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_all_Yo-2B82LjHSOdyxKzT-40google.com&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=0-XNirx6DRihxIvWzzJHJnErbZelq39geArwcitkIRgMl23nTXBs57QP543DuFnw&s=7zXRbLuhXLpsET-zMv7muSajxOFUoktaL97P3huVuhA&e= > > Actually, I take that back. The code snippet itself is also flawed. If userspace > increases the quota (or disables it entirely) between KVM snapshotting the quota > and making the request, then there's no need for KVM to exit to userspace. > > So I think this can be: > > static void kvm_vcpu_is_dirty_quota_exchausted(struct kvm_vcpu *vcpu) > { > #ifdef CONFIG_HAVE_KVM_DIRTY_QUOTA > u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota); > > return dirty_quota && (vcpu->stat.generic.pages_dirtied >= dirty_quota); > #else > return false; > #endif > } > > and the usage becomes: > > if (kvm_vcpu_is_dirty_quota_exhausted(vcpu)) > kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu); > > More thoughts in the x86 patch. Snapshotting is not a requirement for now anyway. We have plans to lazily update the quota, i.e. only when it needs to do more dirtying. This helps us prevent overthrottling of the VM due to skewed cases where some vcpus are mostly reading and the others are mostly wirting. So, this looks good from every angle. The name kvm_vcpu_is_dirty_quota_exchausted, though verbose, looks good to me.
On 08/10/22 12:38 am, Sean Christopherson wrote: > On Thu, Sep 15, 2022, Shivam Kumar wrote: >> @@ -542,6 +545,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu) >> return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE); >> } >> >> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu) >> +{ >> + struct kvm_run *run = vcpu->run; >> + u64 dirty_quota = READ_ONCE(run->dirty_quota); >> + u64 pages_dirtied = vcpu->stat.generic.pages_dirtied; >> + >> + if (!dirty_quota || (pages_dirtied < dirty_quota)) >> + return 1; >> + >> + run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED; >> + run->dirty_quota_exit.count = pages_dirtied; >> + run->dirty_quota_exit.quota = dirty_quota; >> + return 0; > > Dead code. > >> +} > > ... > >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >> index 584a5bab3af3..f315af50037d 100644 >> --- a/virt/kvm/kvm_main.c >> +++ b/virt/kvm/kvm_main.c >> @@ -3298,18 +3298,36 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len) >> } >> EXPORT_SYMBOL_GPL(kvm_clear_guest); >> >> +static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu) > > Ouch, sorry. I suspect you got this name from me[*]. That was a goof on my end, > I'm 99% certain I copy-pasted stale code, i.e. didn't intended to suggest a > rename. > > Let's keep kvm_vcpu_check_dirty_quota(), IMO that's still the least awful name. > > [*] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_all_Yo-2B82LjHSOdyxKzT-40google.com&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=IIED7C8bZGKU5CadTcY5IR4yLs79Rsv2HVHCn5FIL8TQpiAbpjEBs97euDE3FFZf&s=WyikNnoMe8QzP5dTKKectMN-uxc_fTby1FlKVAaS7zI&e= > >> +{ >> + u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota); >> + >> + if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota)) >> + return; >> + >> + /* >> + * Snapshot the quota to report it to userspace. The dirty count will be >> + * captured when the request is processed. >> + */ >> + vcpu->dirty_quota = dirty_quota; >> + kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu); > > Making the request needs to be guarded with an arch opt-in. Pending requests > prevent KVM from entering the guest, and so making a request that an arch isn't > aware of will effectively hang the vCPU. Obviously userspace would be shooting > itself in the foot by setting run->dirty_quota in this case, but KVM shouldn't > hand userspace a loaded gun and help them aim. > > My suggestion from v1[*] about not forcing architectures to opt-in was in the > context of a request-less implementation where dirty_quota was a nop until the > arch took action. > > And regardless of arch opt-in, I think this needs a capability so that userspace > can detect KVM support. > > I don't see any reason to wrap the request or vcpu_run field, e.g. something like > this should suffice: > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index ea5847d22aff..93362441215b 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -3298,8 +3298,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len) > } > EXPORT_SYMBOL_GPL(kvm_clear_guest); > > -static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu) > +static void kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu) > { > +#ifdef CONFIG_HAVE_KVM_DIRTY_QUOTA > u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota); > > if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota)) > @@ -3311,6 +3312,7 @@ static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu) > */ > vcpu->dirty_quota = dirty_quota; > kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu); > +#endif > } > > void mark_page_dirty_in_slot(struct kvm *kvm, > @@ -4507,6 +4509,8 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) > case KVM_CAP_BINARY_STATS_FD: > case KVM_CAP_SYSTEM_EVENT_DATA: > return 1; > + case KVM_CAP_DIRTY_QUOTA: > + return !!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_QUOTA); > default: > break; > } > > > [*] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_all_YZaUENi0ZyQi-252F9M0-40google.com&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=IIED7C8bZGKU5CadTcY5IR4yLs79Rsv2HVHCn5FIL8TQpiAbpjEBs97euDE3FFZf&s=kbNeQcRDxAxx4SEPIe_tL-_1dYMa9zx-REEozKvO0w0&e= Will add. Thank you so much for the comments.
On 15/09/22 3:40 pm, Shivam Kumar wrote: > Define variables to track and throttle memory dirtying for every vcpu. > > dirty_count: Number of pages the vcpu has dirtied since its creation, > while dirty logging is enabled. > dirty_quota: Number of pages the vcpu is allowed to dirty. To dirty > more, it needs to request more quota by exiting to > userspace. > > Implement the flow for throttling based on dirty quota. > > i) Increment dirty_count for the vcpu whenever it dirties a page. > ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty > count equals/exceeds dirty quota) to request more dirty quota. > > Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com> > Suggested-by: Manish Mishra <manish.mishra@nutanix.com> > Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com> > Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com> > Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com> > --- > Documentation/virt/kvm/api.rst | 35 ++++++++++++++++++++++++++++++++++ > include/linux/kvm_host.h | 20 ++++++++++++++++++- > include/linux/kvm_types.h | 1 + > include/uapi/linux/kvm.h | 12 ++++++++++++ > virt/kvm/kvm_main.c | 26 ++++++++++++++++++++++--- > 5 files changed, 90 insertions(+), 4 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index abd7c32126ce..97030a6a35b4 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6614,6 +6614,26 @@ array field represents return values. The userspace should update the return > values of SBI call before resuming the VCPU. For more details on RISC-V SBI > spec refer, https://github.com/riscv/riscv-sbi-doc. > > +:: > + > + /* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */ > + struct { > + __u64 count; > + __u64 quota; > + } dirty_quota_exit; > + > +If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that the VCPU has > +exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run structure > +makes the following information available to the userspace: > + count: the current count of pages dirtied by the VCPU, can be > + skewed based on the size of the pages accessed by each vCPU. > + quota: the observed dirty quota just before the exit to userspace. > + > +The userspace can design a strategy to allocate the overall scope of dirtying > +for the VM among the vcpus. Based on the strategy and the current state of dirty > +quota throttling, the userspace can make a decision to either update (increase) > +the quota or to put the VCPU to sleep for some time. > + > :: > > /* KVM_EXIT_NOTIFY */ > @@ -6668,6 +6688,21 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set. > > :: > > + /* > + * Number of pages the vCPU is allowed to have dirtied over its entire > + * lifetime. KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the quota > + * is reached/exceeded. > + */ > + __u64 dirty_quota; > + > +Please note that enforcing the quota is best effort, as the guest may dirty > +multiple pages before KVM can recheck the quota. However, unless KVM is using > +a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging, > +KVM will detect quota exhaustion within a handful of dirtied pages. If a > +hardware ring buffer is used, the overrun is bounded by the size of the buffer > +(512 entries for PML). > + > +:: > }; > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index f4519d3689e1..9acb28635d94 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -151,12 +151,13 @@ static inline bool is_error_page(struct page *page) > #define KVM_REQUEST_NO_ACTION BIT(10) > /* > * Architecture-independent vcpu->requests bit members > - * Bits 4-7 are reserved for more arch-independent bits. > + * Bits 5-7 are reserved for more arch-independent bits. > */ > #define KVM_REQ_TLB_FLUSH (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) > #define KVM_REQ_VM_DEAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) > #define KVM_REQ_UNBLOCK 2 > #define KVM_REQ_UNHALT 3 > +#define KVM_REQ_DIRTY_QUOTA_EXIT 4 > #define KVM_REQUEST_ARCH_BASE 8 > > /* > @@ -380,6 +381,8 @@ struct kvm_vcpu { > */ > struct kvm_memory_slot *last_used_slot; > u64 last_used_slot_gen; > + > + u64 dirty_quota; > }; > > /* > @@ -542,6 +545,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu) > return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE); > } > > +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu) > +{ > + struct kvm_run *run = vcpu->run; > + u64 dirty_quota = READ_ONCE(run->dirty_quota); > + u64 pages_dirtied = vcpu->stat.generic.pages_dirtied; > + > + if (!dirty_quota || (pages_dirtied < dirty_quota)) > + return 1; > + > + run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED; > + run->dirty_quota_exit.count = pages_dirtied; > + run->dirty_quota_exit.quota = dirty_quota; > + return 0; > +} > + > /* > * Some of the bitops functions do not support too long bitmaps. > * This number must be determined not to exceed such limits. > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h > index 3ca3db020e0e..263a588f3cd3 100644 > --- a/include/linux/kvm_types.h > +++ b/include/linux/kvm_types.h > @@ -118,6 +118,7 @@ struct kvm_vcpu_stat_generic { > u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT]; > u64 halt_wait_hist[HALT_POLL_HIST_COUNT]; > u64 blocking; > + u64 pages_dirtied; I am reworking the QEMU patches and I am not sure how I can access the pages_dirtied info from the userspace side when the migration starts, i.e. without a dirty quota exit. I need this info to initialise the dirty quota. This is what I am looking to do on the userspace side at the start of dirty quota migration: dirty_quota = pages_dirtied + some initial quota Hoping if you could help, Sean. Thanks in advance. > #define KVM_STATS_NAME_SIZE 48 > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index eed0315a77a6..4c4a65b0f0a5 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -272,6 +272,7 @@ struct kvm_xen_exit { > #define KVM_EXIT_RISCV_SBI 35 > #define KVM_EXIT_RISCV_CSR 36 > #define KVM_EXIT_NOTIFY 37 > +#define KVM_EXIT_DIRTY_QUOTA_EXHAUSTED 38 > > /* For KVM_EXIT_INTERNAL_ERROR */ > /* Emulate instruction failed. */ > @@ -510,6 +511,11 @@ struct kvm_run { > #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) > __u32 flags; > } notify; > + /* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */ > + struct { > + __u64 count; > + __u64 quota; > + } dirty_quota_exit; > /* Fix the size of the union. */ > char padding[256]; > }; > @@ -531,6 +537,12 @@ struct kvm_run { > struct kvm_sync_regs regs; > char padding[SYNC_REGS_SIZE_BYTES]; > } s; > + /* > + * Number of pages the vCPU is allowed to have dirtied over its entire > + * lifetime. KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the > + * quota is reached/exceeded. > + */ > + __u64 dirty_quota; > }; > > /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */ > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 584a5bab3af3..f315af50037d 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -3298,18 +3298,36 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len) > } > EXPORT_SYMBOL_GPL(kvm_clear_guest); > > +static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu) > +{ > + u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota); > + > + if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota)) > + return; > + > + /* > + * Snapshot the quota to report it to userspace. The dirty count will be > + * captured when the request is processed. > + */ > + vcpu->dirty_quota = dirty_quota; > + kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu); > +} > + > void mark_page_dirty_in_slot(struct kvm *kvm, > const struct kvm_memory_slot *memslot, > gfn_t gfn) > { > struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); > > -#ifdef CONFIG_HAVE_KVM_DIRTY_RING > if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm)) > return; > -#endif > > - if (memslot && kvm_slot_dirty_track_enabled(memslot)) { > + if (!memslot) > + return; > + > + WARN_ON_ONCE(!vcpu->stat.generic.pages_dirtied++); > + > + if (kvm_slot_dirty_track_enabled(memslot)) { > unsigned long rel_gfn = gfn - memslot->base_gfn; > u32 slot = (memslot->as_id << 16) | memslot->id; > > @@ -3318,6 +3336,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm, > slot, rel_gfn); > else > set_bit_le(rel_gfn, memslot->dirty_bitmap); > + > + kvm_vcpu_is_dirty_quota_exhausted(vcpu); > } > } > EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot);
Am 09.10.22 um 20:36 schrieb Shivam Kumar: > > > On 07/10/22 11:48 pm, Sean Christopherson wrote: >> On Thu, Sep 15, 2022, Christian Borntraeger wrote: >>> Am 15.09.22 um 15:21 schrieb Christian Borntraeger: >>>> I am wondering if this will work on s390. On s390 we only call >>>> mark_page_dirty_in_slot for the kvm_read/write functions but not >>>> for those done by the guest on fault. We do account those lazily in >>>> kvm_arch_sync_dirty_log (like x96 in the past). >>>> >>> >>> I think we need to rework the page fault handling on s390 to actually make >>> use of this. This has to happen anyway somewhen (as indicated by the guest >>> enter/exit rework from Mark). Right now we handle KVM page faults directly >>> in the normal system fault handler. It seems we need to make a side turn >>> into KVM for page faults on guests in the long run. >> >> s390's page table management came up in conversation at KVM Forum in the context >> of a common MMU (or MMUs)[*]. If/when we get an MMU that supports multiple >> architectures, that would be a perfect opportunity for s390 to switch. I don't >> know that it'll be feasible to make a common MMU support s390 from the get-go, >> but at the very least we can keep y'all in the loop and not make it unnecessarily >> difficult to support s390. >> >> [*] https://kvmforum2022.sched.com/event/15jJk > Thank you Sean for the reference to the KVM Forum talk. Should I skip the support for s390 (patch no. 4) for the next patchset? yes please.
On Mon, Oct 10, 2022, Shivam Kumar wrote: > > > On 08/10/22 12:50 am, Sean Christopherson wrote: > > On Fri, Oct 07, 2022, Sean Christopherson wrote: > > > On Thu, Sep 15, 2022, Shivam Kumar wrote: > > > Let's keep kvm_vcpu_check_dirty_quota(), IMO that's still the least awful name. > > > > > > [*] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_all_Yo-2B82LjHSOdyxKzT-40google.com&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=0-XNirx6DRihxIvWzzJHJnErbZelq39geArwcitkIRgMl23nTXBs57QP543DuFnw&s=7zXRbLuhXLpsET-zMv7muSajxOFUoktaL97P3huVuhA&e= > > > > Actually, I take that back. The code snippet itself is also flawed. If userspace > > increases the quota (or disables it entirely) between KVM snapshotting the quota > > and making the request, then there's no need for KVM to exit to userspace. > > > > So I think this can be: > > > > static void kvm_vcpu_is_dirty_quota_exchausted(struct kvm_vcpu *vcpu) > > { > > #ifdef CONFIG_HAVE_KVM_DIRTY_QUOTA > > u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota); > > > > return dirty_quota && (vcpu->stat.generic.pages_dirtied >= dirty_quota); > > #else > > return false; > > #endif > > } > > > > and the usage becomes: > > > > if (kvm_vcpu_is_dirty_quota_exhausted(vcpu)) > > kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu); > > > > More thoughts in the x86 patch. > > Snapshotting is not a requirement for now anyway. We have plans to lazily > update the quota, i.e. only when it needs to do more dirtying. This helps us > prevent overthrottling of the VM due to skewed cases where some vcpus are > mostly reading and the others are mostly wirting. I don't see how snapshotting can ever be a sane requirement. Requiring KVM to exit if KVM detects an exhausted quota even if userspace changes the quota is nonsensical as the resulting behavior is 100% non-determinstic unless userspace is spying on the number of dirty pages. And if userspace is constly polling the number of dirty pages, what's the point of the exit? Requiring KVM to exit in this case puts extra burden on KVM without any meaningful benefit. In other words, we need consider about how KVM's behavior impacts KVM's uABI, not just about what userspace "needs".
On 10/10/22 11:11 am, Shivam Kumar wrote: > > > On 15/09/22 3:40 pm, Shivam Kumar wrote: >> Define variables to track and throttle memory dirtying for every vcpu. >> >> dirty_count: Number of pages the vcpu has dirtied since its creation, >> while dirty logging is enabled. >> dirty_quota: Number of pages the vcpu is allowed to dirty. To dirty >> more, it needs to request more quota by exiting to >> userspace. >> >> Implement the flow for throttling based on dirty quota. >> >> i) Increment dirty_count for the vcpu whenever it dirties a page. >> ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty >> count equals/exceeds dirty quota) to request more dirty quota. >> >> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com> >> Suggested-by: Manish Mishra <manish.mishra@nutanix.com> >> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com> >> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com> >> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com> >> --- >> Documentation/virt/kvm/api.rst | 35 ++++++++++++++++++++++++++++++++++ >> include/linux/kvm_host.h | 20 ++++++++++++++++++- >> include/linux/kvm_types.h | 1 + >> include/uapi/linux/kvm.h | 12 ++++++++++++ >> virt/kvm/kvm_main.c | 26 ++++++++++++++++++++++--- >> 5 files changed, 90 insertions(+), 4 deletions(-) >> >> diff --git a/Documentation/virt/kvm/api.rst >> b/Documentation/virt/kvm/api.rst >> index abd7c32126ce..97030a6a35b4 100644 >> --- a/Documentation/virt/kvm/api.rst >> +++ b/Documentation/virt/kvm/api.rst >> @@ -6614,6 +6614,26 @@ array field represents return values. The >> userspace should update the return >> values of SBI call before resuming the VCPU. For more details on >> RISC-V SBI >> spec refer, https://github.com/riscv/riscv-sbi-doc. >> +:: >> + >> + /* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */ >> + struct { >> + __u64 count; >> + __u64 quota; >> + } dirty_quota_exit; >> + >> +If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that >> the VCPU has >> +exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run >> structure >> +makes the following information available to the userspace: >> + count: the current count of pages dirtied by the VCPU, can be >> + skewed based on the size of the pages accessed by each vCPU. >> + quota: the observed dirty quota just before the exit to userspace. >> + >> +The userspace can design a strategy to allocate the overall scope of >> dirtying >> +for the VM among the vcpus. Based on the strategy and the current >> state of dirty >> +quota throttling, the userspace can make a decision to either update >> (increase) >> +the quota or to put the VCPU to sleep for some time. >> + >> :: >> /* KVM_EXIT_NOTIFY */ >> @@ -6668,6 +6688,21 @@ values in kvm_run even if the corresponding bit >> in kvm_dirty_regs is not set. >> :: >> + /* >> + * Number of pages the vCPU is allowed to have dirtied over its >> entire >> + * lifetime. KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED >> if the quota >> + * is reached/exceeded. >> + */ >> + __u64 dirty_quota; >> + >> +Please note that enforcing the quota is best effort, as the guest may >> dirty >> +multiple pages before KVM can recheck the quota. However, unless KVM >> is using >> +a hardware-based dirty ring buffer, e.g. Intel's Page Modification >> Logging, >> +KVM will detect quota exhaustion within a handful of dirtied pages. >> If a >> +hardware ring buffer is used, the overrun is bounded by the size of >> the buffer >> +(512 entries for PML). >> + >> +:: >> }; >> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h >> index f4519d3689e1..9acb28635d94 100644 >> --- a/include/linux/kvm_host.h >> +++ b/include/linux/kvm_host.h >> @@ -151,12 +151,13 @@ static inline bool is_error_page(struct page *page) >> #define KVM_REQUEST_NO_ACTION BIT(10) >> /* >> * Architecture-independent vcpu->requests bit members >> - * Bits 4-7 are reserved for more arch-independent bits. >> + * Bits 5-7 are reserved for more arch-independent bits. >> */ >> #define KVM_REQ_TLB_FLUSH (0 | KVM_REQUEST_WAIT | >> KVM_REQUEST_NO_WAKEUP) >> #define KVM_REQ_VM_DEAD (1 | KVM_REQUEST_WAIT | >> KVM_REQUEST_NO_WAKEUP) >> #define KVM_REQ_UNBLOCK 2 >> #define KVM_REQ_UNHALT 3 >> +#define KVM_REQ_DIRTY_QUOTA_EXIT 4 >> #define KVM_REQUEST_ARCH_BASE 8 >> /* >> @@ -380,6 +381,8 @@ struct kvm_vcpu { >> */ >> struct kvm_memory_slot *last_used_slot; >> u64 last_used_slot_gen; >> + >> + u64 dirty_quota; >> }; >> /* >> @@ -542,6 +545,21 @@ static inline int >> kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu) >> return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE); >> } >> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu) >> +{ >> + struct kvm_run *run = vcpu->run; >> + u64 dirty_quota = READ_ONCE(run->dirty_quota); >> + u64 pages_dirtied = vcpu->stat.generic.pages_dirtied; >> + >> + if (!dirty_quota || (pages_dirtied < dirty_quota)) >> + return 1; >> + >> + run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED; >> + run->dirty_quota_exit.count = pages_dirtied; >> + run->dirty_quota_exit.quota = dirty_quota; >> + return 0; >> +} >> + >> /* >> * Some of the bitops functions do not support too long bitmaps. >> * This number must be determined not to exceed such limits. >> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h >> index 3ca3db020e0e..263a588f3cd3 100644 >> --- a/include/linux/kvm_types.h >> +++ b/include/linux/kvm_types.h >> @@ -118,6 +118,7 @@ struct kvm_vcpu_stat_generic { >> u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT]; >> u64 halt_wait_hist[HALT_POLL_HIST_COUNT]; >> u64 blocking; >> + u64 pages_dirtied; > I am reworking the QEMU patches and I am not sure how I can access the > pages_dirtied info from the userspace side when the migration starts, i.e. > without a dirty quota exit. > > I need this info to initialise the dirty quota. This is what I am looking > to do on the userspace side at the start of dirty quota migration: > dirty_quota = pages_dirtied + some initial quota > > Hoping if you could help, Sean. Thanks in advance. I think I can set dirty_quota initially to 1 and let the vpcu exit with exit reason KVM_EXIT_DIRTY_QUOTA_EXHAUSTED. Then, I can set the quota.
On Mon, Oct 17, 2022, Shivam Kumar wrote: > > On 10/10/22 11:11 am, Shivam Kumar wrote: > > > > On 15/09/22 3:40 pm, Shivam Kumar wrote: > > > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h > > > index 3ca3db020e0e..263a588f3cd3 100644 > > > --- a/include/linux/kvm_types.h > > > +++ b/include/linux/kvm_types.h > > > @@ -118,6 +118,7 @@ struct kvm_vcpu_stat_generic { > > > u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT]; > > > u64 halt_wait_hist[HALT_POLL_HIST_COUNT]; > > > u64 blocking; > > > + u64 pages_dirtied; > > I am reworking the QEMU patches and I am not sure how I can access the > > pages_dirtied info from the userspace side when the migration starts, i.e. > > without a dirty quota exit. > > > > I need this info to initialise the dirty quota. This is what I am looking > > to do on the userspace side at the start of dirty quota migration: > > dirty_quota = pages_dirtied + some initial quota > > > > Hoping if you could help, Sean. Thanks in advance. > I think I can set dirty_quota initially to 1 and let the vpcu exit with exit > reason KVM_EXIT_DIRTY_QUOTA_EXHAUSTED. Then, I can set the quota. The vCPU doesn't need to be paused to read stats, pages_dirtied can be read while the vCPU is running via KVM_GET_STATS_FD. Though at a glance, QEMU doesn't yet utilize KVM_GETS_STATS_FD, i.e. QEMU would a need a decent chunk of infrastructure improvements to avoid the unnecessary exit.
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index abd7c32126ce..97030a6a35b4 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6614,6 +6614,26 @@ array field represents return values. The userspace should update the return values of SBI call before resuming the VCPU. For more details on RISC-V SBI spec refer, https://github.com/riscv/riscv-sbi-doc. +:: + + /* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */ + struct { + __u64 count; + __u64 quota; + } dirty_quota_exit; + +If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that the VCPU has +exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run structure +makes the following information available to the userspace: + count: the current count of pages dirtied by the VCPU, can be + skewed based on the size of the pages accessed by each vCPU. + quota: the observed dirty quota just before the exit to userspace. + +The userspace can design a strategy to allocate the overall scope of dirtying +for the VM among the vcpus. Based on the strategy and the current state of dirty +quota throttling, the userspace can make a decision to either update (increase) +the quota or to put the VCPU to sleep for some time. + :: /* KVM_EXIT_NOTIFY */ @@ -6668,6 +6688,21 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set. :: + /* + * Number of pages the vCPU is allowed to have dirtied over its entire + * lifetime. KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the quota + * is reached/exceeded. + */ + __u64 dirty_quota; + +Please note that enforcing the quota is best effort, as the guest may dirty +multiple pages before KVM can recheck the quota. However, unless KVM is using +a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging, +KVM will detect quota exhaustion within a handful of dirtied pages. If a +hardware ring buffer is used, the overrun is bounded by the size of the buffer +(512 entries for PML). + +:: }; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index f4519d3689e1..9acb28635d94 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -151,12 +151,13 @@ static inline bool is_error_page(struct page *page) #define KVM_REQUEST_NO_ACTION BIT(10) /* * Architecture-independent vcpu->requests bit members - * Bits 4-7 are reserved for more arch-independent bits. + * Bits 5-7 are reserved for more arch-independent bits. */ #define KVM_REQ_TLB_FLUSH (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) #define KVM_REQ_VM_DEAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) #define KVM_REQ_UNBLOCK 2 #define KVM_REQ_UNHALT 3 +#define KVM_REQ_DIRTY_QUOTA_EXIT 4 #define KVM_REQUEST_ARCH_BASE 8 /* @@ -380,6 +381,8 @@ struct kvm_vcpu { */ struct kvm_memory_slot *last_used_slot; u64 last_used_slot_gen; + + u64 dirty_quota; }; /* @@ -542,6 +545,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu) return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE); } +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu) +{ + struct kvm_run *run = vcpu->run; + u64 dirty_quota = READ_ONCE(run->dirty_quota); + u64 pages_dirtied = vcpu->stat.generic.pages_dirtied; + + if (!dirty_quota || (pages_dirtied < dirty_quota)) + return 1; + + run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED; + run->dirty_quota_exit.count = pages_dirtied; + run->dirty_quota_exit.quota = dirty_quota; + return 0; +} + /* * Some of the bitops functions do not support too long bitmaps. * This number must be determined not to exceed such limits. diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h index 3ca3db020e0e..263a588f3cd3 100644 --- a/include/linux/kvm_types.h +++ b/include/linux/kvm_types.h @@ -118,6 +118,7 @@ struct kvm_vcpu_stat_generic { u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT]; u64 halt_wait_hist[HALT_POLL_HIST_COUNT]; u64 blocking; + u64 pages_dirtied; }; #define KVM_STATS_NAME_SIZE 48 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index eed0315a77a6..4c4a65b0f0a5 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -272,6 +272,7 @@ struct kvm_xen_exit { #define KVM_EXIT_RISCV_SBI 35 #define KVM_EXIT_RISCV_CSR 36 #define KVM_EXIT_NOTIFY 37 +#define KVM_EXIT_DIRTY_QUOTA_EXHAUSTED 38 /* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -510,6 +511,11 @@ struct kvm_run { #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) __u32 flags; } notify; + /* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */ + struct { + __u64 count; + __u64 quota; + } dirty_quota_exit; /* Fix the size of the union. */ char padding[256]; }; @@ -531,6 +537,12 @@ struct kvm_run { struct kvm_sync_regs regs; char padding[SYNC_REGS_SIZE_BYTES]; } s; + /* + * Number of pages the vCPU is allowed to have dirtied over its entire + * lifetime. KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the + * quota is reached/exceeded. + */ + __u64 dirty_quota; }; /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */ diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 584a5bab3af3..f315af50037d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3298,18 +3298,36 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len) } EXPORT_SYMBOL_GPL(kvm_clear_guest); +static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu) +{ + u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota); + + if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota)) + return; + + /* + * Snapshot the quota to report it to userspace. The dirty count will be + * captured when the request is processed. + */ + vcpu->dirty_quota = dirty_quota; + kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu); +} + void mark_page_dirty_in_slot(struct kvm *kvm, const struct kvm_memory_slot *memslot, gfn_t gfn) { struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); -#ifdef CONFIG_HAVE_KVM_DIRTY_RING if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm)) return; -#endif - if (memslot && kvm_slot_dirty_track_enabled(memslot)) { + if (!memslot) + return; + + WARN_ON_ONCE(!vcpu->stat.generic.pages_dirtied++); + + if (kvm_slot_dirty_track_enabled(memslot)) { unsigned long rel_gfn = gfn - memslot->base_gfn; u32 slot = (memslot->as_id << 16) | memslot->id; @@ -3318,6 +3336,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm, slot, rel_gfn); else set_bit_le(rel_gfn, memslot->dirty_bitmap); + + kvm_vcpu_is_dirty_quota_exhausted(vcpu); } } EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot);