Message ID | 20240221195125.102479-3-shivam.kumar1@nutanix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Per-vCPU dirty quota-based throttling | expand |
On Wed, Feb 21, 2024, Shivam Kumar wrote: > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 2d6cdeab1f8a..fa0b3853ee31 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -3397,8 +3397,12 @@ static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, > if (!try_cmpxchg64(sptep, &old_spte, new_spte)) > return false; > > - if (is_writable_pte(new_spte) && !is_writable_pte(old_spte)) > + if (is_writable_pte(new_spte) && !is_writable_pte(old_spte)) { > + struct kvm_mmu_page *sp = sptep_to_sp(sptep); > + > + update_dirty_quota(vcpu->kvm, (1L << SPTE_LEVEL_SHIFT(sp->role.level))); > mark_page_dirty_in_slot(vcpu->kvm, fault->slot, fault->gfn); Forcing KVM to manually call update_dirty_quota() whenever mark_page_dirty_in_slot() is invoked is not maintainable, as we inevitably will forget to update the quota and probably not notice. We've already had bugs escape where KVM fails to mark gfns dirty, and those flows are much more testable. Stepping back, I feel like this series has gone off the rails a bit. I understand Marc's objections to the uAPI not differentiating between page sizes, but simply updating the quota based on KVM's page size is also flawed. E.g. if the guest is backed with 1GiB pages, odds are very good that the dirty quotas are going to be completely out of whack due to the first vCPU that writes a given 1GiB region being charged with the entire 1GiB page. And without a way to trigger detection of writes, e.g. by enabling PML or write- protecting memory, I don't see how userspace can build anything on the "bytes dirtied" information. From v7[*], Marc was specifically objecting to the proposed API effectively being presented as a general purpose API, but in reality the API was heavily reliant on dirty logging being enabled. : My earlier comments still stand: the proposed API is not usable as a : general purpose memory-tracking API because it counts faults instead : of memory, making it inadequate except for the most trivial cases. : And I cannot believe you were serious when you mentioned that you were : happy to make that the API. To avoid going in circles, I think we need to first agree on the scope of the uAPI. Specifically, do we want to shoot for a generic write-tracking API, or do we want something that is explicitly tied to dirty logging? Marc, If we figured out a clean-ish way to tie the "gfns dirtied" information to dirty logging, i.e. didn't misconstrue the counts as generally useful data, would that be acceptable? While I like the idea of a generic solution, I don't see a path to an implementation that isn't deeply flawed without basically doing dirty logging, i.e. without forcing the use of non-huge pages and write-protecting memory to intercept "new" writes based on input from userspace. [*] https://lore.kernel.org/all/20221113170507.208810-2-shivam.kumar1@nutanix.com
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 87e3da7b0439..791456233f28 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -44,6 +44,7 @@ config KVM select KVM_XFER_TO_GUEST_WORK select KVM_GENERIC_DIRTYLOG_READ_PROTECT select KVM_VFIO + select HAVE_KVM_DIRTY_QUOTA select HAVE_KVM_PM_NOTIFIER if PM select KVM_GENERIC_HARDWARE_ENABLING help diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 2d6cdeab1f8a..fa0b3853ee31 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3397,8 +3397,12 @@ static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, if (!try_cmpxchg64(sptep, &old_spte, new_spte)) return false; - if (is_writable_pte(new_spte) && !is_writable_pte(old_spte)) + if (is_writable_pte(new_spte) && !is_writable_pte(old_spte)) { + struct kvm_mmu_page *sp = sptep_to_sp(sptep); + + update_dirty_quota(vcpu->kvm, (1L << SPTE_LEVEL_SHIFT(sp->role.level))); mark_page_dirty_in_slot(vcpu->kvm, fault->slot, fault->gfn); + } return true; } diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c index 4a599130e9c9..550f9c1d03af 100644 --- a/arch/x86/kvm/mmu/spte.c +++ b/arch/x86/kvm/mmu/spte.c @@ -241,6 +241,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) { /* Enforced by kvm_mmu_hugepage_adjust. */ WARN_ON_ONCE(level > PG_LEVEL_4K); + update_dirty_quota(vcpu->kvm, (1L << SPTE_LEVEL_SHIFT(level))); mark_page_dirty_in_slot(vcpu->kvm, slot, gfn); } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 1111d9d08903..e2f8764c16ff 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5864,6 +5864,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) */ if (__xfer_to_guest_mode_work_pending()) return 1; + + if (kvm_test_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu)) + return 1; } return 1; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 48a61d283406..4f36c0efb542 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10829,7 +10829,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) r = 0; goto out; } - + if (kvm_check_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu)) { + vcpu->run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED; + r = 0; + goto out; + } /* * KVM_REQ_HV_STIMER has to be processed after * KVM_REQ_CLOCK_UPDATE, because Hyper-V SynIC timers