mbox series

[RFC,0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes

Message ID 20240801183453.57199-1-seanjc@google.com (mailing list archive)
Headers show
Series KVM: x86/mmu: Preserve Accessed bits on PROT changes | expand

Message

Sean Christopherson Aug. 1, 2024, 6:34 p.m. UTC
This applies on top of the massive "follow pfn" rework[*].  The gist is to
avoid losing accessed information, e.g. because NUMA balancing mucks with
PTEs, by preserving accessed state when KVM zaps SPTEs in response to
mmu_notifier invalidations that are for protection changes, e.g. PROT_NUMA.

RFC as I haven't done any testing to verify whether or not this has any
impact on page aging, let alone has _postivie_ impact.  Personally, I'm not
at all convinced that this is necessary outside of tests that care about
exact counts, e.g. KVM selftests.

That said, I do think patches 1-7 would be worth merging on their own.
Using A/D bits to track state even when A/D bits are disabled in hardware
is a nice cleanup.

[*] https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com

Sean Christopherson (9):
  KVM: x86/mmu: Add a dedicated flag to track if A/D bits are globally
    enabled
  KVM: x86/mmu: Set shadow_accessed_mask for EPT even if A/D bits
    disabled
  KVM: x86/mmu: Set shadow_dirty_mask for EPT even if A/D bits disabled
  KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are
    disabled
  KVM: x86/mmu: Free up A/D bits in FROZEN_SPTE
  KVM: x86/mmu: Process only valid TDP MMU roots when aging a gfn range
  KVM: x86/mmu: Stop processing TDP MMU roots for test_age if young SPTE
    found
  KVM: Plumb mmu_notifier invalidation event type into arch code
  KVM: x86/mmu: Track SPTE accessed info across mmu_notifier PROT
    changes

 arch/x86/kvm/mmu/mmu.c     |  10 ++--
 arch/x86/kvm/mmu/spte.c    |  16 ++++--
 arch/x86/kvm/mmu/spte.h    |  39 +++++--------
 arch/x86/kvm/mmu/tdp_mmu.c | 113 +++++++++++++++++++++----------------
 include/linux/kvm_host.h   |   1 +
 virt/kvm/kvm_main.c        |   1 +
 6 files changed, 99 insertions(+), 81 deletions(-)


base-commit: 93a198738e0aeb3193ca39c9f01f66060b3c4910

Comments

David Matlack Aug. 5, 2024, 4:45 p.m. UTC | #1
On Thu, Aug 1, 2024 at 11:35 AM Sean Christopherson <seanjc@google.com> wrote:
>
> This applies on top of the massive "follow pfn" rework[*].  The gist is to
> avoid losing accessed information, e.g. because NUMA balancing mucks with
> PTEs,

What do you mean by "NUMA balancing mucks with PTEs"?
Sean Christopherson Aug. 5, 2024, 8:11 p.m. UTC | #2
On Mon, Aug 05, 2024, David Matlack wrote:
> On Thu, Aug 1, 2024 at 11:35 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > This applies on top of the massive "follow pfn" rework[*].  The gist is to
> > avoid losing accessed information, e.g. because NUMA balancing mucks with
> > PTEs,
> 
> What do you mean by "NUMA balancing mucks with PTEs"?

When NUMA auto-balancing is enabled, for VMAs the current task has been accessing,
the kernel will periodically change PTEs (in the primary MMU) to PROT_NONE, i.e.
make them !PRESENT.  That in turn results in mmu_notifier invalidations (usually
for the entire VMA, eventually) that cause KVM to unmap SPTEs.  If KVM doesn't
mark folios accessed when SPTEs are zapped, the NUMA-induced zapping effectively
loses the accessed information.

For non-KVM setups, NUMA balancing works quite well because the cost of the #PF
to "fix" the NUMA-induced PROT_NONE is relatively cheap, especially compared to
the long-term costs of accessing remote memory.

For KVM, the cost vs. benefit is very different, as each mmu_notifier invalidation
forces KVM to emit a remote TLB flush, i.e. the cost is much higher.  And it's
also much more feasible (in practice) to affine vCPUs to single NUMA nodes, even
if vCPUs are pinned 1:1 with pCPUs, than it is to affine a random userspace task
to a NUMA node.

Which is why I'm not terribly concerned about optimizing NUMA auto-balancing; it's
already sub-optimal for KVM.