diff mbox series

[v3,2/3] KVM: Documentation: Update kvm_run structure for dirty quota

Message ID 20220306220849.215358-3-shivam.kumar1@nutanix.com (mailing list archive)
State New, archived
Headers show
Series KVM: Dirty quota-based throttling | expand

Commit Message

Shivam Kumar March 6, 2022, 10:08 p.m. UTC
Update the kvm_run structure with a brief description of dirty
quota members and how dirty quota throttling works.

Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
---
 Documentation/virt/kvm/api.rst | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

Comments

Sean Christopherson March 31, 2022, 12:40 a.m. UTC | #1
On Sun, Mar 06, 2022, Shivam Kumar wrote:
> Update the kvm_run structure with a brief description of dirty
> quota members and how dirty quota throttling works.

This should be squashed with patch 1.  I actually had to look ahead to this patch
because I forgot the details since I last reviewed this :-)

> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
> Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
> ---
>  Documentation/virt/kvm/api.rst | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 9f3172376ec3..50e001473b1f 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6125,6 +6125,23 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
> +		struct {
> +			__u64 count;
> +			__u64 quota;
> +		} dirty_quota_exit;
> +If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that the VCPU has
> +exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run structure
> +makes the following information available to the userspace:
> +	'count' field: the current count of pages dirtied by the VCPU,
> +	'quota' field: the observed dirty quota just before the exit to userspace.
> +The userspace can design a strategy to allocate the overall scope of dirtying
> +for the VM among the vcpus. Based on the strategy and the current state of dirty
> +quota throttling, the userspace can make a decision to either update (increase)
> +the quota or to put the VCPU to sleep for some time.
> +
>  ::
>  
>  		/* Fix the size of the union. */
> @@ -6159,6 +6176,17 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
>  
>  ::
>  
> +	/*
> +	 * Number of pages the vCPU is allowed to have dirtied over its entire
> +	 * lifetime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the quota
> +	 * is reached/exceeded.
> +	 */
> +	__u64 dirty_quota;
> +Please note that this quota cannot be strictly enforced if PML is enabled, and
> +the VCPU may end up dirtying pages more than its quota. The difference however
> +is bounded by the PML buffer size.

If you want to be pedantic, I doubt KVM can strictly enforce the quota even if PML
is disabled.  E.g. I can all but guarantee that it's possible to dirty multiple
pages during a single exit.  Probably also worth spelling out PML and genericizing
things.  Maybe

  Please note that enforcing the quota is best effort, as the guest may dirty
  multiple pages before KVM can recheck the quota.  However, unless KVM is using
  a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging,
  KVM will detect quota exhaustion within a handful of dirtied page.  If a
  hardware ring buffer is used, the overrun is bounded by the size of the buffer
  (512 entries for PML).
Shivam Kumar March 31, 2022, 7:30 a.m. UTC | #2
On 31/03/22 6:10 am, Sean Christopherson wrote:
> On Sun, Mar 06, 2022, Shivam Kumar wrote:
>> Update the kvm_run structure with a brief description of dirty
>> quota members and how dirty quota throttling works.
> This should be squashed with patch 1.  I actually had to look ahead to this patch
> because I forgot the details since I last reviewed this :-)
Ack. Thanks.
>> +	__u64 dirty_quota;
>> +Please note that this quota cannot be strictly enforced if PML is enabled, and
>> +the VCPU may end up dirtying pages more than its quota. The difference however
>> +is bounded by the PML buffer size.
> If you want to be pedantic, I doubt KVM can strictly enforce the quota even if PML
> is disabled.  E.g. I can all but guarantee that it's possible to dirty multiple
> pages during a single exit.  Probably also worth spelling out PML and genericizing
> things.  Maybe
>
>    Please note that enforcing the quota is best effort, as the guest may dirty
>    multiple pages before KVM can recheck the quota.  However, unless KVM is using
>    a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging,
>    KVM will detect quota exhaustion within a handful of dirtied page.  If a
>    hardware ring buffer is used, the overrun is bounded by the size of the buffer
>    (512 entries for PML).
Thank you for the blurb. Looks good to me, though I'm curious about the 
exits
that can dirty multiple pages.

Thank you so much for the review, Sean. I'm hoping if I could get 
one-pass review
from you on the third patch in this series (selftests for dirty quota).
Sean Christopherson March 31, 2022, 3:24 p.m. UTC | #3
On Thu, Mar 31, 2022, Shivam Kumar wrote:
> 
> On 31/03/22 6:10 am, Sean Christopherson wrote:
> > On Sun, Mar 06, 2022, Shivam Kumar wrote:
> > > Update the kvm_run structure with a brief description of dirty
> > > quota members and how dirty quota throttling works.
> > This should be squashed with patch 1.  I actually had to look ahead to this patch
> > because I forgot the details since I last reviewed this :-)
> Ack. Thanks.
> > > +	__u64 dirty_quota;
> > > +Please note that this quota cannot be strictly enforced if PML is enabled, and
> > > +the VCPU may end up dirtying pages more than its quota. The difference however
> > > +is bounded by the PML buffer size.
> > If you want to be pedantic, I doubt KVM can strictly enforce the quota even if PML
> > is disabled.  E.g. I can all but guarantee that it's possible to dirty multiple
> > pages during a single exit.  Probably also worth spelling out PML and genericizing
> > things.  Maybe
> > 
> >    Please note that enforcing the quota is best effort, as the guest may dirty
> >    multiple pages before KVM can recheck the quota.  However, unless KVM is using
> >    a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging,
> >    KVM will detect quota exhaustion within a handful of dirtied page.  If a
> >    hardware ring buffer is used, the overrun is bounded by the size of the buffer
> >    (512 entries for PML).
> Thank you for the blurb. Looks good to me, though I'm curious about the exits
> that can dirty multiple pages.

Anything that touches multiple pages.  nested_mark_vmcs12_pages_dirty() is an
easy example.  Emulating L2 with nested TDP.  An emulated instruction that splits
a page.  I'm pretty sure FNAME(sync_page) could dirty an entire page worth of
SPTEs, and that's waaay too deep to bail from.

Oof, loking at sync_page(), that's a bug in patch 1.  make_spte() guards the call
to mark_page_dirty_in_slot() with kvm_slot_dirty_track_enabled(), which means it
won't honor the dirty quota unless dirty logging is enabled.  Probably not an issue
for the intended use case, but it'll result in wrong stats, and technically the
dirty quota can be enabled without dirty logging being enabled.

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4739b53c9734..df0349be388b 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -182,7 +182,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
                  "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
                  get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));

-       if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
+       if (spte & PT_WRITABLE_MASK) {
                /* Enforced by kvm_mmu_hugepage_adjust. */
                WARN_ON(level > PG_LEVEL_4K);
                mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);


And thinking more about silly edge cases, VMX's big emulation loop for invalid
guest state when unrestricted guest is disabled should probably explicitly check
the dirty quota.  Again, I doubt it matters to anyone's use case, but it is treated
as a full run loop for things like pending signals, it'd be good to be consistent.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 84a7500cd80c..5e1ae373634c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5511,6 +5511,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
                 */
                if (__xfer_to_guest_mode_work_pending())
                        return 1;
+
+               if (!kvm_vcpu_check_dirty_quota(vcpu))
+                       return 0;
        }

        return 1;
Sean Christopherson April 1, 2022, 1:49 p.m. UTC | #4
On Thu, Mar 31, 2022, Sean Christopherson wrote:
> Oof, loking at sync_page(), that's a bug in patch 1.  make_spte() guards the call
> to mark_page_dirty_in_slot() with kvm_slot_dirty_track_enabled(), which means it
> won't honor the dirty quota unless dirty logging is enabled.  Probably not an issue
> for the intended use case, but it'll result in wrong stats, and technically the
> dirty quota can be enabled without dirty logging being enabled.
> 
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 4739b53c9734..df0349be388b 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -182,7 +182,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>                   "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
>                   get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
> 
> -       if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
> +       if (spte & PT_WRITABLE_MASK) {
>                 /* Enforced by kvm_mmu_hugepage_adjust. */
>                 WARN_ON(level > PG_LEVEL_4K);
>                 mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);

This pseudopatch is buggy, the WARN_ON() will obviously fire.  Easiest thing would
be to move the condition into the WARN_ON.

		WARN_ON(level > PG_LEVEL_4K && kvm_slot_dirty_track_enabled(slot));

That brings up another thing that's worth documenting: the dirty_count will be
skewed based on the size of the pages accessed by each vCPU.  I still think having
the stat always count will be useful though.
Shivam Kumar April 6, 2022, 12:39 p.m. UTC | #5
On 01/04/22 7:19 pm, Sean Christopherson wrote:
> On Thu, Mar 31, 2022, Sean Christopherson wrote:
>> Oof, loking at sync_page(), that's a bug in patch 1.  make_spte() guards the call
>> to mark_page_dirty_in_slot() with kvm_slot_dirty_track_enabled(), which means it
>> won't honor the dirty quota unless dirty logging is enabled.  Probably not an issue
>> for the intended use case, but it'll result in wrong stats, and technically the
>> dirty quota can be enabled without dirty logging being enabled.
>>
>> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
>> index 4739b53c9734..df0349be388b 100644
>> --- a/arch/x86/kvm/mmu/spte.c
>> +++ b/arch/x86/kvm/mmu/spte.c
>> @@ -182,7 +182,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>>                    "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
>>                    get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
>>
>> -       if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
>> +       if (spte & PT_WRITABLE_MASK) {
>>                  /* Enforced by kvm_mmu_hugepage_adjust. */
>>                  WARN_ON(level > PG_LEVEL_4K);
>>                  mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
> This pseudopatch is buggy, the WARN_ON() will obviously fire.  Easiest thing would
> be to move the condition into the WARN_ON.
>
> 		WARN_ON(level > PG_LEVEL_4K && kvm_slot_dirty_track_enabled(slot));

Thank you. Will add this.

>
> That brings up another thing that's worth documenting: the dirty_count will be
> skewed based on the size of the pages accessed by each vCPU.  I still think having
> the stat always count will be useful though.
I thought it was obvious :)
Shivam Kumar April 6, 2022, 12:44 p.m. UTC | #6
On 06/04/22 6:09 pm, Shivam Kumar wrote:
>
> On 01/04/22 7:19 pm, Sean Christopherson wrote:
>> On Thu, Mar 31, 2022, Sean Christopherson wrote:
>>> Oof, loking at sync_page(), that's a bug in patch 1.  make_spte() 
>>> guards the call
>>>                  WARN_ON(level > PG_LEVEL_4K);
>>> skewed based on the size of the pages accessed by each vCPU. I still 
>>> think having
>>> the stat always count will be useful though.
> I thought it was obvious :)
Hi Sean, I'm grateful for your feedback. I wish you could post some 
feedback on the selftests too (third patch in this patchset).
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 9f3172376ec3..50e001473b1f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6125,6 +6125,23 @@  array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
+		struct {
+			__u64 count;
+			__u64 quota;
+		} dirty_quota_exit;
+If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that the VCPU has
+exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run structure
+makes the following information available to the userspace:
+	'count' field: the current count of pages dirtied by the VCPU,
+	'quota' field: the observed dirty quota just before the exit to userspace.
+The userspace can design a strategy to allocate the overall scope of dirtying
+for the VM among the vcpus. Based on the strategy and the current state of dirty
+quota throttling, the userspace can make a decision to either update (increase)
+the quota or to put the VCPU to sleep for some time.
+
 ::
 
 		/* Fix the size of the union. */
@@ -6159,6 +6176,17 @@  values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 
 ::
 
+	/*
+	 * Number of pages the vCPU is allowed to have dirtied over its entire
+	 * lifetime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the quota
+	 * is reached/exceeded.
+	 */
+	__u64 dirty_quota;
+Please note that this quota cannot be strictly enforced if PML is enabled, and
+the VCPU may end up dirtying pages more than its quota. The difference however
+is bounded by the PML buffer size.
+
+::
   };