[RFC,0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes

Message ID	20240801183453.57199-1-seanjc@google.com (mailing list archive)
Headers	show Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 56686143883 for <kvm@vger.kernel.org>; Thu, 1 Aug 2024 18:34:57 +0000 (UTC) Reply-To: Sean Christopherson <seanjc@google.com> Date: Thu, 1 Aug 2024 11:34:44 -0700 Precedence: bulk Mime-Version: 1.0 Message-ID: <20240801183453.57199-1-seanjc@google.com> Subject: [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes From: Sean Christopherson <seanjc@google.com> To: Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8"
Series	KVM: x86/mmu: Preserve Accessed bits on PROT changes \| expand [RFC,0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes [RFC,1/9] KVM: x86/mmu: Add a dedicated flag to track if A/D bits are globally enabled [RFC,2/9] KVM: x86/mmu: Set shadow_accessed_mask for EPT even if A/D bits disabled [RFC,3/9] KVM: x86/mmu: Set shadow_dirty_mask for EPT even if A/D bits disabled [RFC,4/9] KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are disabled [RFC,5/9] KVM: x86/mmu: Free up A/D bits in FROZEN_SPTE [RFC,6/9] KVM: x86/mmu: Process only valid TDP MMU roots when aging a gfn range [RFC,7/9] KVM: x86/mmu: Stop processing TDP MMU roots for test_age if young SPTE found [RFC,8/9] KVM: Plumb mmu_notifier invalidation event type into arch code [RFC,9/9] KVM: x86/mmu: Track SPTE accessed info across mmu_notifier PROT changes

Message ID

20240801183453.57199-1-seanjc@google.com (mailing list archive)

Headers

Reply-To: Sean Christopherson <seanjc@google.com>
Date: Thu,  1 Aug 2024 11:34:44 -0700
Precedence: bulk
Mime-Version: 1.0
Message-ID: <20240801183453.57199-1-seanjc@google.com>
Subject: [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes
From: Sean Christopherson <seanjc@google.com>
To: Sean Christopherson <seanjc@google.com>,
 Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"

Series

KVM: x86/mmu: Preserve Accessed bits on PROT changes | expand

Message

Sean Christopherson Aug. 1, 2024, 6:34 p.m. UTC

This applies on top of the massive "follow pfn" rework[*].  The gist is to
avoid losing accessed information, e.g. because NUMA balancing mucks with
PTEs, by preserving accessed state when KVM zaps SPTEs in response to
mmu_notifier invalidations that are for protection changes, e.g. PROT_NUMA.

RFC as I haven't done any testing to verify whether or not this has any
impact on page aging, let alone has _postivie_ impact.  Personally, I'm not
at all convinced that this is necessary outside of tests that care about
exact counts, e.g. KVM selftests.

That said, I do think patches 1-7 would be worth merging on their own.
Using A/D bits to track state even when A/D bits are disabled in hardware
is a nice cleanup.

[*] https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com

Sean Christopherson (9):
  KVM: x86/mmu: Add a dedicated flag to track if A/D bits are globally
    enabled
  KVM: x86/mmu: Set shadow_accessed_mask for EPT even if A/D bits
    disabled
  KVM: x86/mmu: Set shadow_dirty_mask for EPT even if A/D bits disabled
  KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are
    disabled
  KVM: x86/mmu: Free up A/D bits in FROZEN_SPTE
  KVM: x86/mmu: Process only valid TDP MMU roots when aging a gfn range
  KVM: x86/mmu: Stop processing TDP MMU roots for test_age if young SPTE
    found
  KVM: Plumb mmu_notifier invalidation event type into arch code
  KVM: x86/mmu: Track SPTE accessed info across mmu_notifier PROT
    changes

 arch/x86/kvm/mmu/mmu.c     |  10 ++--
 arch/x86/kvm/mmu/spte.c    |  16 ++++--
 arch/x86/kvm/mmu/spte.h    |  39 +++++--------
 arch/x86/kvm/mmu/tdp_mmu.c | 113 +++++++++++++++++++++----------------
 include/linux/kvm_host.h   |   1 +
 virt/kvm/kvm_main.c        |   1 +
 6 files changed, 99 insertions(+), 81 deletions(-)


base-commit: 93a198738e0aeb3193ca39c9f01f66060b3c4910

Comments

David Matlack Aug. 5, 2024, 4:45 p.m. UTC | #1

On Thu, Aug 1, 2024 at 11:35 AM Sean Christopherson <seanjc@google.com> wrote:
>
> This applies on top of the massive "follow pfn" rework[*].  The gist is to
> avoid losing accessed information, e.g. because NUMA balancing mucks with
> PTEs,

What do you mean by "NUMA balancing mucks with PTEs"?

Sean Christopherson Aug. 5, 2024, 8:11 p.m. UTC | #2

On Mon, Aug 05, 2024, David Matlack wrote:
> On Thu, Aug 1, 2024 at 11:35 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > This applies on top of the massive "follow pfn" rework[*].  The gist is to
> > avoid losing accessed information, e.g. because NUMA balancing mucks with
> > PTEs,
> 
> What do you mean by "NUMA balancing mucks with PTEs"?

When NUMA auto-balancing is enabled, for VMAs the current task has been accessing,
the kernel will periodically change PTEs (in the primary MMU) to PROT_NONE, i.e.
make them !PRESENT.  That in turn results in mmu_notifier invalidations (usually
for the entire VMA, eventually) that cause KVM to unmap SPTEs.  If KVM doesn't
mark folios accessed when SPTEs are zapped, the NUMA-induced zapping effectively
loses the accessed information.

For non-KVM setups, NUMA balancing works quite well because the cost of the #PF
to "fix" the NUMA-induced PROT_NONE is relatively cheap, especially compared to
the long-term costs of accessing remote memory.

For KVM, the cost vs. benefit is very different, as each mmu_notifier invalidation
forces KVM to emit a remote TLB flush, i.e. the cost is much higher.  And it's
also much more feasible (in practice) to affine vCPUs to single NUMA nodes, even
if vCPUs are pinned 1:1 with pCPUs, than it is to affine a random userspace task
to a NUMA node.

Which is why I'm not terribly concerned about optimizing NUMA auto-balancing; it's
already sub-optimal for KVM.