mbox series

[v2,00/20] Introduce the TDP MMU

Message ID 20201014182700.2888246-1-bgardon@google.com (mailing list archive)
Headers show
Series Introduce the TDP MMU | expand

Message

Ben Gardon Oct. 14, 2020, 6:26 p.m. UTC
Over the years, the needs for KVM's x86 MMU have grown from running small
guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
we previously depended on shadow paging to run all guests, we now have
two dimensional paging (TDP). This patch set introduces a new
implementation of much of the KVM MMU, optimized for running guests with
TDP. We have re-implemented many of the MMU functions to take advantage of
the relative simplicity of TDP and eliminate the need for an rmap.
Building on this simplified implementation, a future patch set will change
the synchronization model for this "TDP MMU" to enable more parallelism
than the monolithic MMU lock. A TDP MMU is currently in use at Google
and has given us the performance necessary to live migrate our 416 vCPU,
12TiB m2-ultramem-416 VMs.

This work was motivated by the need to handle page faults in parallel for
very large VMs. When VMs have hundreds of vCPUs and terabytes of memory,
KVM's MMU lock suffers extreme contention, resulting in soft-lockups and
long latency on guest page faults. This contention can be easily seen
running the KVM selftests demand_paging_test with a couple hundred vCPUs.
Over a 1 second profile of the demand_paging_test, with 416 vCPUs and 4G
per vCPU, 98% of the time was spent waiting for the MMU lock. At Google,
the TDP MMU reduced the test duration by 89% and the execution was
dominated by get_user_pages and the user fault FD ioctl instead of the
MMU lock.

This series is the first of two. In this series we add a basic
implementation of the TDP MMU. In the next series we will improve the
performance of the TDP MMU and allow it to execute MMU operations
in parallel.

The overall purpose of the KVM MMU is to program paging structures
(CR3/EPT/NPT) to encode the mapping of guest addresses to host physical
addresses (HPA), and to provide utilities for other KVM features, for
example dirty logging. The definition of the L1 guest physical address
(GPA) to HPA mapping comes in two parts: KVM's memslots map GPA to HVA,
and the kernel MM/x86 host page tables map HVA -> HPA. Without TDP, the
MMU must program the x86 page tables to encode the full translation of
guest virtual addresses (GVA) to HPA. This requires "shadowing" the
guest's page tables to create a composite x86 paging structure. This
solution is complicated, requires separate paging structures for each
guest CR3, and requires emulating guest page table changes. The TDP case
is much simpler. In this case, KVM lets the guest control CR3 and programs
the EPT/NPT paging structures with the GPA -> HPA mapping. The guest has
no way to change this mapping and only one version of the paging structure
is needed per L1 paging mode. In this case the paging mode is some
combination of the number of levels in the paging structure, the address
space (normal execution or system management mode, on x86), and other
attributes. Most VMs only ever use 1 paging mode and so only ever need one
TDP structure.

This series implements a "TDP MMU" through alternative implementations of
MMU functions for running L1 guests with TDP. The TDP MMU falls back to
the existing shadow paging implementation when TDP is not available, and
interoperates with the existing shadow paging implementation for nesting.
The use of the TDP MMU can be controlled by a module parameter which is
snapshot on VM creation and follows the life of the VM. This snapshot
is used in many functions to decide whether or not to use TDP MMU handlers
for a given operation.

This series can also be viewed in Gerrit here:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
(Thanks to Dmitry Vyukov <dvyukov@google.com> for setting up the
Gerrit instance)

Changes v1 -> v2:
  Big thanks to Paolo and Sean for your thorough reviews!
  - Moved some function definitions to mmu_internal.h instead of just
    declaring them there.
  - Dropped the commit to add an as_id field to memslots in favor of
    Peter Xu's which is part of the dirty ring patch set. I've included a
    copy of that patch from v13 of the patch set in this series.
  - Fixed comment style on SPDX license headers
  - Added a min_level to the tdp_iter and removed the tdp_iter_no_step_down
    function
  - Removed the tdp_mmu module parameter and defaulted the global to false
  - Unified more NX reclaim code
  - Added helper functions for setting SPTEs in the TDP MMU
  - Renamed tdp_iter macros to for clarity
  - Renamed kvm_tdp_mmu_page_fault to kvm_tdp_mmu_map and gave it the same
    signature as __direct_map
  - Converted some WARN_ONs to BUG_ONs or removed them
  - Changed dlog to dirty_log to match convention
  - Switched make_spte to return a return code and use a return parameter
    for the new SPTE
  - Refactored TDP MMU root allocation
  - Other misc cleanups and bug fixes

Ben Gardon (19):
  kvm: x86/mmu: Separate making SPTEs from set_spte
  kvm: x86/mmu: Introduce tdp_iter
  kvm: x86/mmu: Init / Uninit the TDP MMU
  kvm: x86/mmu: Allocate and free TDP MMU roots
  kvm: x86/mmu: Add functions to handle changed TDP SPTEs
  kvm: x86/mmu: Support zapping SPTEs in the TDP MMU
  kvm: x86/mmu: Separate making non-leaf sptes from link_shadow_page
  kvm: x86/mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator
    arg
  kvm: x86/mmu: Add TDP MMU PF handler
  kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
  kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU
  kvm: x86/mmu: Add access tracking for tdp_mmu
  kvm: x86/mmu: Support changed pte notifier in tdp MMU
  kvm: x86/mmu: Support dirty logging for the TDP MMU
  kvm: x86/mmu: Support disabling dirty logging for the tdp MMU
  kvm: x86/mmu: Support write protection for nesting in tdp MMU
  kvm: x86/mmu: Support MMIO in the TDP MMU
  kvm: x86/mmu: Don't clear write flooding count for direct roots
  kvm: x86/mmu: NX largepage recovery for TDP MMU

Peter Xu (1):
  KVM: Cache as_id in kvm_memory_slot

 arch/x86/include/asm/kvm_host.h |   14 +
 arch/x86/kvm/Makefile           |    3 +-
 arch/x86/kvm/mmu/mmu.c          |  487 +++++++------
 arch/x86/kvm/mmu/mmu_internal.h |  242 +++++++
 arch/x86/kvm/mmu/paging_tmpl.h  |    3 +-
 arch/x86/kvm/mmu/tdp_iter.c     |  181 +++++
 arch/x86/kvm/mmu/tdp_iter.h     |   60 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 1154 +++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      |   48 ++
 include/linux/kvm_host.h        |    2 +
 virt/kvm/kvm_main.c             |   12 +-
 11 files changed, 1944 insertions(+), 262 deletions(-)
 create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
 create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
 create mode 100644 arch/x86/kvm/mmu/tdp_mmu.c
 create mode 100644 arch/x86/kvm/mmu/tdp_mmu.h

Comments

Paolo Bonzini Oct. 16, 2020, 4:50 p.m. UTC | #1
On 14/10/20 20:26, Ben Gardon wrote:
>  arch/x86/include/asm/kvm_host.h |   14 +
>  arch/x86/kvm/Makefile           |    3 +-
>  arch/x86/kvm/mmu/mmu.c          |  487 +++++++------
>  arch/x86/kvm/mmu/mmu_internal.h |  242 +++++++
>  arch/x86/kvm/mmu/paging_tmpl.h  |    3 +-
>  arch/x86/kvm/mmu/tdp_iter.c     |  181 +++++
>  arch/x86/kvm/mmu/tdp_iter.h     |   60 ++
>  arch/x86/kvm/mmu/tdp_mmu.c      | 1154 +++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu/tdp_mmu.h      |   48 ++
>  include/linux/kvm_host.h        |    2 +
>  virt/kvm/kvm_main.c             |   12 +-
>  11 files changed, 1944 insertions(+), 262 deletions(-)
>  create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
>  create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
>  create mode 100644 arch/x86/kvm/mmu/tdp_mmu.c
>  create mode 100644 arch/x86/kvm/mmu/tdp_mmu.h
> 

My implementation of tdp_iter_set_spte was completely different, but
of course that's not an issue; I would still like to understand and
comment on why the bool arguments to __tdp_mmu_set_spte are needed.

Apart from splitting tdp_mmu_iter_flush_cond_resched from
tdp_mmu_iter_cond_resched, my remaining changes on top are pretty
small and mostly cosmetic.  I'll give it another go next week
and send it Linus's way if everything's all right.

Paolo

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f8525c89fc95..baf260421a56 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -7,20 +7,15 @@
 #include "tdp_mmu.h"
 #include "spte.h"
 
+#ifdef CONFIG_X86_64
 static bool __read_mostly tdp_mmu_enabled = false;
+module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
+#endif
 
 static bool is_tdp_mmu_enabled(void)
 {
 #ifdef CONFIG_X86_64
-	if (!READ_ONCE(tdp_mmu_enabled))
-		return false;
-
-	if (WARN_ONCE(!tdp_enabled,
-		      "Creating a VM with TDP MMU enabled requires TDP."))
-		return false;
-
-	return true;
-
+	return tdp_enabled && READ_ONCE(tdp_mmu_enabled);
 #else
 	return false;
 #endif /* CONFIG_X86_64 */
@@ -277,8 +277,8 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 			unaccount_huge_nx_page(kvm, sp);
 
 		for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
-			old_child_spte = *(pt + i);
-			*(pt + i) = 0;
+			old_child_spte = READ_ONCE(*(pt + i));
+			WRITE_ONCE(*(pt + i), 0);
 			handle_changed_spte(kvm, as_id,
 				gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
 				old_child_spte, 0, level - 1);
@@ -309,7 +309,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
 	int as_id = kvm_mmu_page_as_id(root);
 
-	*iter->sptep = new_spte;
+	WRITE_ONCE(*iter->sptep, new_spte);
 
 	__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
 			      iter->level);
@@ -361,16 +361,28 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 	for_each_tdp_pte(_iter, __va(_mmu->root_hpa),		\
 			 _mmu->shadow_root_level, _start, _end)
 
-static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
+/*
+ * Flush the TLB if the process should drop kvm->mmu_lock.
+ * Return whether the caller still needs to flush the tlb.
+ */
+static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
 {
 	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
 		kvm_flush_remote_tlbs(kvm);
 		cond_resched_lock(&kvm->mmu_lock);
 		tdp_iter_refresh_walk(iter);
+		return false;
+	} else {
 		return true;
 	}
+}
 
-	return false;
+static void tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
+{
+	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+		cond_resched_lock(&kvm->mmu_lock);
+		tdp_iter_refresh_walk(iter);
+	}
 }
 
 /*
@@ -407,7 +419,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		tdp_mmu_set_spte(kvm, &iter, 0);
 
 		if (can_yield)
-			flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
+			flush_needed = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
 		else
 			flush_needed = true;
 	}
@@ -479,7 +479,10 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
 					 map_writable, !shadow_accessed_mask,
 					 &new_spte);
 
-	tdp_mmu_set_spte(vcpu->kvm, iter, new_spte);
+	if (new_spte == iter->old_spte)
+		ret = RET_PF_SPURIOUS;
+	else
+		tdp_mmu_set_spte(vcpu->kvm, iter, new_spte);
 
 	/*
 	 * If the page fault was caused by a write but the page is write
@@ -496,7 +496,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
 	}
 
 	/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
-	if (unlikely(is_mmio_spte(new_spte)))
+	else if (unlikely(is_mmio_spte(new_spte)))
 		ret = RET_PF_EMULATE;
 
 	trace_kvm_mmu_set_spte(iter->level, iter->gfn, iter->sptep);
@@ -528,8 +528,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	int level;
 	int req_level;
 
-	BUG_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
-	BUG_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa));
+	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
+		return RET_PF_ENTRY;
+	if (WARN_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)))
+		return RET_PF_ENTRY;
 
 	level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
 					huge_page_disallowed, &req_level);
@@ -579,7 +581,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		}
 	}
 
-	BUG_ON(iter.level != level);
+	if (WARN_ON(iter.level != level))
+		return RET_PF_RETRY;
 
 	ret = tdp_mmu_map_handle_target_level(vcpu, write, map_writable, &iter,
 					      pfn, prefault);
@@ -817,9 +829,8 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
 		 */
 		kvm_mmu_get_root(kvm, root);
 
-		spte_set = wrprot_gfn_range(kvm, root, slot->base_gfn,
-				slot->base_gfn + slot->npages, min_level) ||
-			   spte_set;
+		spte_set |= wrprot_gfn_range(kvm, root, slot->base_gfn,
+			     slot->base_gfn + slot->npages, min_level);
 
 		kvm_mmu_put_root(kvm, root);
 	}
@@ -886,8 +897,8 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
 		 */
 		kvm_mmu_get_root(kvm, root);
 
-		spte_set = clear_dirty_gfn_range(kvm, root, slot->base_gfn,
-				slot->base_gfn + slot->npages) || spte_set;
+		spte_set |= clear_dirty_gfn_range(kvm, root, slot->base_gfn,
+				slot->base_gfn + slot->npages);
 
 		kvm_mmu_put_root(kvm, root);
 	}
@@ -1009,8 +1020,8 @@ bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
 		 */
 		kvm_mmu_get_root(kvm, root);
 
-		spte_set = set_dirty_gfn_range(kvm, root, slot->base_gfn,
-				slot->base_gfn + slot->npages) || spte_set;
+		spte_set |= set_dirty_gfn_range(kvm, root, slot->base_gfn,
+				slot->base_gfn + slot->npages);
 
 		kvm_mmu_put_root(kvm, root);
 	}
@@ -1042,9 +1053,9 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 			continue;
 
 		tdp_mmu_set_spte(kvm, &iter, 0);
-		spte_set = true;
 
-		spte_set = !tdp_mmu_iter_cond_resched(kvm, &iter);
+		spte_set = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
 	}
 
 	if (spte_set)
Ben Gardon Oct. 19, 2020, 6:15 p.m. UTC | #2
On Fri, Oct 16, 2020 at 9:50 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 14/10/20 20:26, Ben Gardon wrote:
> >  arch/x86/include/asm/kvm_host.h |   14 +
> >  arch/x86/kvm/Makefile           |    3 +-
> >  arch/x86/kvm/mmu/mmu.c          |  487 +++++++------
> >  arch/x86/kvm/mmu/mmu_internal.h |  242 +++++++
> >  arch/x86/kvm/mmu/paging_tmpl.h  |    3 +-
> >  arch/x86/kvm/mmu/tdp_iter.c     |  181 +++++
> >  arch/x86/kvm/mmu/tdp_iter.h     |   60 ++
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 1154 +++++++++++++++++++++++++++++++
> >  arch/x86/kvm/mmu/tdp_mmu.h      |   48 ++
> >  include/linux/kvm_host.h        |    2 +
> >  virt/kvm/kvm_main.c             |   12 +-
> >  11 files changed, 1944 insertions(+), 262 deletions(-)
> >  create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
> >  create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
> >  create mode 100644 arch/x86/kvm/mmu/tdp_mmu.c
> >  create mode 100644 arch/x86/kvm/mmu/tdp_mmu.h
> >
>
> My implementation of tdp_iter_set_spte was completely different, but
> of course that's not an issue; I would still like to understand and
> comment on why the bool arguments to __tdp_mmu_set_spte are needed.

The simplest explanation for those options to not mark the page as
dirty in the dirty bitmap or not mark the page accessed is simply that
the legacy MMU doesn't do it, but I will outline why it doesn't more
specifically.

Let's consider dirty logging first. When getting the dirty log, we
follow the following steps:
1. Atomically get and clear an unsigned long of the dirty bitmap
2. For each GFN in the range of pages covered by the unsigned long mask:
    3. Clear the dirty or writable bit on the SPTE
4. Copy the mask of dirty pages to be returned to userspace

If we mark the page as dirty in the dirty bitmap in step 3, we'll
report the page as dirty twice - once in this dirty log call, and
again in the next one. This can lead to unexpected behavior:
1. Pause all vCPUs
2. Get the dirty log <--- Returns all pages dirtied before the vCPUs were paused
3. Get the dirty log again <--- Unexpectedly returns a non-zero number
of dirty pages even though no pages were actually dirtied

I believe a similar process happens for access tracking though MMU
notifiers which would lead to incorrect behavior if we called
kvm_set_pfn_accessed during the handler for notifier_clear_young or
notifier_clear_flush_young

>
> Apart from splitting tdp_mmu_iter_flush_cond_resched from
> tdp_mmu_iter_cond_resched, my remaining changes on top are pretty
> small and mostly cosmetic.  I'll give it another go next week
> and send it Linus's way if everything's all right.

Fantastic, thank you!

>
> Paolo
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f8525c89fc95..baf260421a56 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -7,20 +7,15 @@
>  #include "tdp_mmu.h"
>  #include "spte.h"
>
> +#ifdef CONFIG_X86_64
>  static bool __read_mostly tdp_mmu_enabled = false;
> +module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
> +#endif
>
>  static bool is_tdp_mmu_enabled(void)
>  {
>  #ifdef CONFIG_X86_64
> -       if (!READ_ONCE(tdp_mmu_enabled))
> -               return false;
> -
> -       if (WARN_ONCE(!tdp_enabled,
> -                     "Creating a VM with TDP MMU enabled requires TDP."))
> -               return false;
> -
> -       return true;
> -
> +       return tdp_enabled && READ_ONCE(tdp_mmu_enabled);
>  #else
>         return false;
>  #endif /* CONFIG_X86_64 */
> @@ -277,8 +277,8 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>                         unaccount_huge_nx_page(kvm, sp);
>
>                 for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> -                       old_child_spte = *(pt + i);
> -                       *(pt + i) = 0;
> +                       old_child_spte = READ_ONCE(*(pt + i));
> +                       WRITE_ONCE(*(pt + i), 0);
>                         handle_changed_spte(kvm, as_id,
>                                 gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
>                                 old_child_spte, 0, level - 1);
> @@ -309,7 +309,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
>         struct kvm_mmu_page *root = sptep_to_sp(root_pt);
>         int as_id = kvm_mmu_page_as_id(root);
>
> -       *iter->sptep = new_spte;
> +       WRITE_ONCE(*iter->sptep, new_spte);
>
>         __handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
>                               iter->level);
> @@ -361,16 +361,28 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
>         for_each_tdp_pte(_iter, __va(_mmu->root_hpa),           \
>                          _mmu->shadow_root_level, _start, _end)
>
> -static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
> +/*
> + * Flush the TLB if the process should drop kvm->mmu_lock.
> + * Return whether the caller still needs to flush the tlb.
> + */
> +static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
>  {
>         if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
>                 kvm_flush_remote_tlbs(kvm);
>                 cond_resched_lock(&kvm->mmu_lock);
>                 tdp_iter_refresh_walk(iter);
> +               return false;
> +       } else {
>                 return true;
>         }
> +}
>
> -       return false;
> +static void tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
> +{
> +       if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> +               cond_resched_lock(&kvm->mmu_lock);
> +               tdp_iter_refresh_walk(iter);
> +       }
>  }
>
>  /*
> @@ -407,7 +419,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>                 tdp_mmu_set_spte(kvm, &iter, 0);
>
>                 if (can_yield)
> -                       flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
> +                       flush_needed = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
>                 else
>                         flush_needed = true;
>         }
> @@ -479,7 +479,10 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
>                                          map_writable, !shadow_accessed_mask,
>                                          &new_spte);
>
> -       tdp_mmu_set_spte(vcpu->kvm, iter, new_spte);
> +       if (new_spte == iter->old_spte)
> +               ret = RET_PF_SPURIOUS;
> +       else
> +               tdp_mmu_set_spte(vcpu->kvm, iter, new_spte);
>
>         /*
>          * If the page fault was caused by a write but the page is write
> @@ -496,7 +496,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
>         }
>
>         /* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
> -       if (unlikely(is_mmio_spte(new_spte)))
> +       else if (unlikely(is_mmio_spte(new_spte)))
>                 ret = RET_PF_EMULATE;
>
>         trace_kvm_mmu_set_spte(iter->level, iter->gfn, iter->sptep);
> @@ -528,8 +528,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>         int level;
>         int req_level;
>
> -       BUG_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
> -       BUG_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa));
> +       if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
> +               return RET_PF_ENTRY;
> +       if (WARN_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)))
> +               return RET_PF_ENTRY;
>
>         level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
>                                         huge_page_disallowed, &req_level);
> @@ -579,7 +581,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>                 }
>         }
>
> -       BUG_ON(iter.level != level);
> +       if (WARN_ON(iter.level != level))
> +               return RET_PF_RETRY;
>
>         ret = tdp_mmu_map_handle_target_level(vcpu, write, map_writable, &iter,
>                                               pfn, prefault);
> @@ -817,9 +829,8 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
>                  */
>                 kvm_mmu_get_root(kvm, root);
>
> -               spte_set = wrprot_gfn_range(kvm, root, slot->base_gfn,
> -                               slot->base_gfn + slot->npages, min_level) ||
> -                          spte_set;
> +               spte_set |= wrprot_gfn_range(kvm, root, slot->base_gfn,
> +                            slot->base_gfn + slot->npages, min_level);
>
>                 kvm_mmu_put_root(kvm, root);
>         }
> @@ -886,8 +897,8 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
>                  */
>                 kvm_mmu_get_root(kvm, root);
>
> -               spte_set = clear_dirty_gfn_range(kvm, root, slot->base_gfn,
> -                               slot->base_gfn + slot->npages) || spte_set;
> +               spte_set |= clear_dirty_gfn_range(kvm, root, slot->base_gfn,
> +                               slot->base_gfn + slot->npages);
>
>                 kvm_mmu_put_root(kvm, root);
>         }
> @@ -1009,8 +1020,8 @@ bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
>                  */
>                 kvm_mmu_get_root(kvm, root);
>
> -               spte_set = set_dirty_gfn_range(kvm, root, slot->base_gfn,
> -                               slot->base_gfn + slot->npages) || spte_set;
> +               spte_set |= set_dirty_gfn_range(kvm, root, slot->base_gfn,
> +                               slot->base_gfn + slot->npages);
>
>                 kvm_mmu_put_root(kvm, root);
>         }
> @@ -1042,9 +1053,9 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>                         continue;
>
>                 tdp_mmu_set_spte(kvm, &iter, 0);
> -               spte_set = true;
>
> -               spte_set = !tdp_mmu_iter_cond_resched(kvm, &iter);
> +               spte_set = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
>         }
>
>         if (spte_set)
>
Paolo Bonzini Oct. 20, 2020, 8:07 a.m. UTC | #3
On 19/10/20 20:15, Ben Gardon wrote:
> When getting the dirty log, we
> follow the following steps:
> 1. Atomically get and clear an unsigned long of the dirty bitmap
> 2. For each GFN in the range of pages covered by the unsigned long mask:
>     3. Clear the dirty or writable bit on the SPTE
> 4. Copy the mask of dirty pages to be returned to userspace
> 
> If we mark the page as dirty in the dirty bitmap in step 3, we'll
> report the page as dirty twice - once in this dirty log call, and
> again in the next one. This can lead to unexpected behavior:
> 1. Pause all vCPUs
> 2. Get the dirty log <--- Returns all pages dirtied before the vCPUs were paused
> 3. Get the dirty log again <--- Unexpectedly returns a non-zero number
> of dirty pages even though no pages were actually dirtied

Got it, that might also fail the dirty_log_test.  Thanks!

Paolo