diff mbox series

[v10,2/3] KVM: x86: Dirty quota-based throttling of vcpus

Message ID 20240221195125.102479-3-shivam.kumar1@nutanix.com (mailing list archive)
State New, archived
Headers show
Series Per-vCPU dirty quota-based throttling | expand

Commit Message

Shivam Kumar Feb. 21, 2024, 7:51 p.m. UTC
Call update_dirty_quota whenever a page is marked dirty with
appropriate arch-specific page size. Process the KVM request
KVM_REQ_DIRTY_QUOTA_EXIT (raised by update_dirty_quota) to exit to
userspace with exit reason KVM_EXIT_DIRTY_QUOTA_EXHAUSTED.

Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
---
 arch/x86/kvm/Kconfig    | 1 +
 arch/x86/kvm/mmu/mmu.c  | 6 +++++-
 arch/x86/kvm/mmu/spte.c | 1 +
 arch/x86/kvm/vmx/vmx.c  | 3 +++
 arch/x86/kvm/x86.c      | 6 +++++-
 5 files changed, 15 insertions(+), 2 deletions(-)

Comments

Sean Christopherson April 16, 2024, 5:44 p.m. UTC | #1
On Wed, Feb 21, 2024, Shivam Kumar wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2d6cdeab1f8a..fa0b3853ee31 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3397,8 +3397,12 @@ static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu,
>  	if (!try_cmpxchg64(sptep, &old_spte, new_spte))
>  		return false;
>  
> -	if (is_writable_pte(new_spte) && !is_writable_pte(old_spte))
> +	if (is_writable_pte(new_spte) && !is_writable_pte(old_spte)) {
> +		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> +
> +		update_dirty_quota(vcpu->kvm, (1L << SPTE_LEVEL_SHIFT(sp->role.level)));
>  		mark_page_dirty_in_slot(vcpu->kvm, fault->slot, fault->gfn);

Forcing KVM to manually call update_dirty_quota() whenever mark_page_dirty_in_slot()
is invoked is not maintainable, as we inevitably will forget to update the quota
and probably not notice.  We've already had bugs escape where KVM fails to mark
gfns dirty, and those flows are much more testable.

Stepping back, I feel like this series has gone off the rails a bit.
 
I understand Marc's objections to the uAPI not differentiating between page sizes,
but simply updating the quota based on KVM's page size is also flawed.  E.g. if
the guest is backed with 1GiB pages, odds are very good that the dirty quotas are
going to be completely out of whack due to the first vCPU that writes a given 1GiB
region being charged with the entire 1GiB page.

And without a way to trigger detection of writes, e.g. by enabling PML or write-
protecting memory, I don't see how userspace can build anything on the "bytes
dirtied" information.

From v7[*], Marc was specifically objecting to the proposed API effectively being
presented as a general purpose API, but in reality the API was heavily reliant
on dirty logging being enabled.

 : My earlier comments still stand: the proposed API is not usable as a
 : general purpose memory-tracking API because it counts faults instead
 : of memory, making it inadequate except for the most trivial cases.
 : And I cannot believe you were serious when you mentioned that you were
 : happy to make that the API.

To avoid going in circles, I think we need to first agree on the scope of the uAPI.
Specifically, do we want to shoot for a generic write-tracking API, or do we want
something that is explicitly tied to dirty logging?


Marc,

If we figured out a clean-ish way to tie the "gfns dirtied" information to
dirty logging, i.e. didn't misconstrue the counts as generally useful data, would
that be acceptable?  While I like the idea of a generic solution, I don't see a
path to an implementation that isn't deeply flawed without basically doing dirty
logging, i.e. without forcing the use of non-huge pages and write-protecting memory
to intercept "new" writes based on input from userspace.

[*] https://lore.kernel.org/all/20221113170507.208810-2-shivam.kumar1@nutanix.com
diff mbox series

Patch

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 87e3da7b0439..791456233f28 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -44,6 +44,7 @@  config KVM
 	select KVM_XFER_TO_GUEST_WORK
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
 	select KVM_VFIO
+	select HAVE_KVM_DIRTY_QUOTA
 	select HAVE_KVM_PM_NOTIFIER if PM
 	select KVM_GENERIC_HARDWARE_ENABLING
 	help
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2d6cdeab1f8a..fa0b3853ee31 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3397,8 +3397,12 @@  static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu,
 	if (!try_cmpxchg64(sptep, &old_spte, new_spte))
 		return false;
 
-	if (is_writable_pte(new_spte) && !is_writable_pte(old_spte))
+	if (is_writable_pte(new_spte) && !is_writable_pte(old_spte)) {
+		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+
+		update_dirty_quota(vcpu->kvm, (1L << SPTE_LEVEL_SHIFT(sp->role.level)));
 		mark_page_dirty_in_slot(vcpu->kvm, fault->slot, fault->gfn);
+	}
 
 	return true;
 }
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4a599130e9c9..550f9c1d03af 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -241,6 +241,7 @@  bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
 		/* Enforced by kvm_mmu_hugepage_adjust. */
 		WARN_ON_ONCE(level > PG_LEVEL_4K);
+		update_dirty_quota(vcpu->kvm, (1L << SPTE_LEVEL_SHIFT(level)));
 		mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
 	}
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1111d9d08903..e2f8764c16ff 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5864,6 +5864,9 @@  static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
 		 */
 		if (__xfer_to_guest_mode_work_pending())
 			return 1;
+
+		if (kvm_test_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu))
+			return 1;
 	}
 
 	return 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 48a61d283406..4f36c0efb542 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10829,7 +10829,11 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			r = 0;
 			goto out;
 		}
-
+		if (kvm_check_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu)) {
+			vcpu->run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
+			r = 0;
+			goto out;
+		}
 		/*
 		 * KVM_REQ_HV_STIMER has to be processed after
 		 * KVM_REQ_CLOCK_UPDATE, because Hyper-V SynIC timers