diff mbox series

[15/22] kvm: mmu: Support changed pte notifier in tdp MMU

Message ID 20200925212302.3979661-16-bgardon@google.com (mailing list archive)
State New, archived
Headers show
Series Introduce the TDP MMU | expand

Commit Message

Ben Gardon Sept. 25, 2020, 9:22 p.m. UTC
In order to interoperate correctly with the rest of KVM and other Linux
subsystems, the TDP MMU must correctly handle various MMU notifiers. Add
a hook and handle the change_pte MMU notifier.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 46 +++++++++++++------------
 arch/x86/kvm/mmu/mmu_internal.h | 13 +++++++
 arch/x86/kvm/mmu/tdp_mmu.c      | 61 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      |  3 ++
 4 files changed, 102 insertions(+), 21 deletions(-)

Comments

Paolo Bonzini Sept. 26, 2020, 12:33 a.m. UTC | #1
On 25/09/20 23:22, Ben Gardon wrote:
> @@ -1708,6 +1695,21 @@ static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>  	return kvm_zap_rmapp(kvm, rmap_head);
>  }
>  
> +u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn)
> +{
> +	u64 new_spte;
> +
> +	new_spte = old_spte & ~PT64_BASE_ADDR_MASK;
> +	new_spte |= (u64)new_pfn << PAGE_SHIFT;
> +
> +	new_spte &= ~PT_WRITABLE_MASK;
> +	new_spte &= ~SPTE_HOST_WRITEABLE;
> +
> +	new_spte = mark_spte_for_access_track(new_spte);
> +
> +	return new_spte;
> +}
> +

And another. :)
Paolo Bonzini Sept. 28, 2020, 3:11 p.m. UTC | #2
On 25/09/20 23:22, Ben Gardon wrote:
> +		*iter.sptep = 0;
> +		handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
> +				    new_spte, iter.level);
> +

Can you explain why new_spte is passed here instead of 0?

All calls to handle_changed_spte are preceded by "*something = 
new_spte" except this one, so I'm thinking of having a change_spte 
function like

static void change_spte(struct kvm *kvm, int as_id, gfn_t gfn,
                        u64 *sptep, u64 new_spte, int level)
{
        u64 old_spte = *sptep;
        *sptep = new_spte;

        __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
        handle_changed_spte_acc_track(old_spte, new_spte, level);
        handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte, new_spte, level);
}

in addition to the previously-mentioned cleanup of always calling
handle_changed_spte instead of special-casing calls to two of the
three functions.  It would be a nice place to add the
trace_kvm_mmu_set_spte tracepoint, too.

Paolo
Ben Gardon Oct. 7, 2020, 4:53 p.m. UTC | #3
On Mon, Sep 28, 2020 at 8:11 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 25/09/20 23:22, Ben Gardon wrote:
> > +             *iter.sptep = 0;
> > +             handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
> > +                                 new_spte, iter.level);
> > +
>
> Can you explain why new_spte is passed here instead of 0?

That's just a bug. Thank you for catching it.

>
> All calls to handle_changed_spte are preceded by "*something =
> new_spte" except this one, so I'm thinking of having a change_spte
> function like
>
> static void change_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>                         u64 *sptep, u64 new_spte, int level)
> {
>         u64 old_spte = *sptep;
>         *sptep = new_spte;
>
>         __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
>         handle_changed_spte_acc_track(old_spte, new_spte, level);
>         handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte, new_spte, level);
> }
>
> in addition to the previously-mentioned cleanup of always calling
> handle_changed_spte instead of special-casing calls to two of the
> three functions.  It would be a nice place to add the
> trace_kvm_mmu_set_spte tracepoint, too.

I'm not sure we can avoid special casing calls to the access tracking
and dirty logging handler functions. At least in the past that's
created bugs with things being marked dirty or accessed when they
shouldn't be. I'll revisit those assumptions. It would certainly be
nice to get rid of that complexity.

I agree that putting the SPTE assignment and handler functions in a
helper function would clean up the code. I'll do that. I got some
feedback on the RFC I sent last year which led me to open-code a lot
more, but I think this is still a good cleanup.

Re tracepoints, I was planning to just insert them all once this code
is stabilized, if that's alright.

>
> Paolo
>
Paolo Bonzini Oct. 7, 2020, 5:18 p.m. UTC | #4
On 07/10/20 18:53, Ben Gardon wrote:
>> in addition to the previously-mentioned cleanup of always calling
>> handle_changed_spte instead of special-casing calls to two of the
>> three functions.  It would be a nice place to add the
>> trace_kvm_mmu_set_spte tracepoint, too.
> I'm not sure we can avoid special casing calls to the access tracking
> and dirty logging handler functions. At least in the past that's
> created bugs with things being marked dirty or accessed when they
> shouldn't be. I'll revisit those assumptions. It would certainly be
> nice to get rid of that complexity.
> 
> I agree that putting the SPTE assignment and handler functions in a
> helper function would clean up the code. I'll do that.

Well that's not easy if you have to think of which functions have to be
called.

I'll take a closer look at the access tracking and dirty logging cases
to try and understand what those bugs can be.  Apart from that I have my
suggested changes and I can probably finish testing them and send them
out tomorrow.

Paolo

> I got some
> feedback on the RFC I sent last year which led me to open-code a lot
> more, but I think this is still a good cleanup.
Ben Gardon Oct. 7, 2020, 5:30 p.m. UTC | #5
On Wed, Oct 7, 2020 at 10:18 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 07/10/20 18:53, Ben Gardon wrote:
> >> in addition to the previously-mentioned cleanup of always calling
> >> handle_changed_spte instead of special-casing calls to two of the
> >> three functions.  It would be a nice place to add the
> >> trace_kvm_mmu_set_spte tracepoint, too.
> > I'm not sure we can avoid special casing calls to the access tracking
> > and dirty logging handler functions. At least in the past that's
> > created bugs with things being marked dirty or accessed when they
> > shouldn't be. I'll revisit those assumptions. It would certainly be
> > nice to get rid of that complexity.
> >
> > I agree that putting the SPTE assignment and handler functions in a
> > helper function would clean up the code. I'll do that.
>
> Well that's not easy if you have to think of which functions have to be
> called.
>
> I'll take a closer look at the access tracking and dirty logging cases
> to try and understand what those bugs can be.  Apart from that I have my
> suggested changes and I can probably finish testing them and send them
> out tomorrow.

Awesome, thank you. I'll look forward to seeing them. Will you be
applying those changes to the tdp_mmu branch you created as well?

>
> Paolo
>
> > I got some
> > feedback on the RFC I sent last year which led me to open-code a lot
> > more, but I think this is still a good cleanup.
>
Paolo Bonzini Oct. 7, 2020, 5:54 p.m. UTC | #6
On 07/10/20 19:30, Ben Gardon wrote:
>> Well that's not easy if you have to think of which functions have to be
>> called.
>>
>> I'll take a closer look at the access tracking and dirty logging cases
>> to try and understand what those bugs can be.  Apart from that I have my
>> suggested changes and I can probably finish testing them and send them
>> out tomorrow.
> Awesome, thank you. I'll look forward to seeing them. Will you be
> applying those changes to the tdp_mmu branch you created as well?
> 

No, only to the rebased version.

Paolo
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8c1e806b3d53f..0d80abe82ca93 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -122,9 +122,6 @@  module_param(dbg, bool, 0644);
 
 #define PTE_PREFETCH_NUM		8
 
-#define PT_FIRST_AVAIL_BITS_SHIFT 10
-#define PT64_SECOND_AVAIL_BITS_SHIFT 54
-
 /*
  * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
  * Access Tracking SPTEs.
@@ -147,13 +144,6 @@  module_param(dbg, bool, 0644);
 #define PT32_INDEX(address, level)\
 	(((address) >> PT32_LEVEL_SHIFT(level)) & ((1 << PT32_LEVEL_BITS) - 1))
 
-
-#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
-#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
-#else
-#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
-#endif
-
 #define PT32_BASE_ADDR_MASK PAGE_MASK
 #define PT32_DIR_BASE_ADDR_MASK \
 	(PAGE_MASK & ~((1ULL << (PAGE_SHIFT + PT32_LEVEL_BITS)) - 1))
@@ -170,9 +160,6 @@  module_param(dbg, bool, 0644);
 
 #include <trace/events/kvm.h>
 
-#define SPTE_HOST_WRITEABLE	(1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
-#define SPTE_MMU_WRITEABLE	(1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
-
 /* make pte_list_desc fit well in cache line */
 #define PTE_LIST_EXT 3
 
@@ -1708,6 +1695,21 @@  static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	return kvm_zap_rmapp(kvm, rmap_head);
 }
 
+u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn)
+{
+	u64 new_spte;
+
+	new_spte = old_spte & ~PT64_BASE_ADDR_MASK;
+	new_spte |= (u64)new_pfn << PAGE_SHIFT;
+
+	new_spte &= ~PT_WRITABLE_MASK;
+	new_spte &= ~SPTE_HOST_WRITEABLE;
+
+	new_spte = mark_spte_for_access_track(new_spte);
+
+	return new_spte;
+}
+
 static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 			     struct kvm_memory_slot *slot, gfn_t gfn, int level,
 			     unsigned long data)
@@ -1733,13 +1735,8 @@  static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 			pte_list_remove(rmap_head, sptep);
 			goto restart;
 		} else {
-			new_spte = *sptep & ~PT64_BASE_ADDR_MASK;
-			new_spte |= (u64)new_pfn << PAGE_SHIFT;
-
-			new_spte &= ~PT_WRITABLE_MASK;
-			new_spte &= ~SPTE_HOST_WRITEABLE;
-
-			new_spte = mark_spte_for_access_track(new_spte);
+			new_spte = kvm_mmu_changed_pte_notifier_make_spte(
+					*sptep, new_pfn);
 
 			mmu_spte_clear_track_bits(sptep);
 			mmu_spte_set(sptep, new_spte);
@@ -1895,7 +1892,14 @@  int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
 
 int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
 {
-	return kvm_handle_hva(kvm, hva, (unsigned long)&pte, kvm_set_pte_rmapp);
+	int r;
+
+	r = kvm_handle_hva(kvm, hva, (unsigned long)&pte, kvm_set_pte_rmapp);
+
+	if (kvm->arch.tdp_mmu_enabled)
+		r |= kvm_tdp_mmu_set_spte_hva(kvm, hva, &pte);
+
+	return r;
 }
 
 static int kvm_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 228bda0885552..8eaa6e4764bce 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -80,6 +80,12 @@  bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	(PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
 						* PT64_LEVEL_BITS))) - 1))
 
+#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
+#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
+#else
+#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
+#endif
+
 extern u64 shadow_user_mask;
 extern u64 shadow_accessed_mask;
 extern u64 shadow_present_mask;
@@ -89,6 +95,12 @@  extern u64 shadow_present_mask;
 #define ACC_USER_MASK    PT_USER_MASK
 #define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
 
+#define PT_FIRST_AVAIL_BITS_SHIFT 10
+#define PT64_SECOND_AVAIL_BITS_SHIFT 54
+
+#define SPTE_HOST_WRITEABLE	(1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
+#define SPTE_MMU_WRITEABLE	(1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
+
 /* Functions for interpreting SPTEs */
 kvm_pfn_t spte_to_pfn(u64 pte);
 bool is_mmio_spte(u64 spte);
@@ -138,5 +150,6 @@  bool is_nx_huge_page_enabled(void);
 void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 
 u64 mark_spte_for_access_track(u64 spte);
+u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);
 
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0a4b98669b3ef..3119583409131 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -722,3 +722,64 @@  int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva)
 	return kvm_tdp_mmu_handle_hva_range(kvm, hva, hva + 1, 0,
 					    test_age_gfn);
 }
+
+/*
+ * Handle the changed_pte MMU notifier for the TDP MMU.
+ * data is a pointer to the new pte_t mapping the HVA specified by the MMU
+ * notifier.
+ * Returns non-zero if a flush is needed before releasing the MMU lock.
+ */
+static int set_tdp_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
+			struct kvm_mmu_page *root, gfn_t gfn, gfn_t unused,
+			unsigned long data)
+{
+	struct tdp_iter iter;
+	pte_t *ptep = (pte_t *)data;
+	kvm_pfn_t new_pfn;
+	u64 new_spte;
+	int need_flush = 0;
+	int as_id = kvm_mmu_page_as_id(root);
+
+	WARN_ON(pte_huge(*ptep));
+
+	new_pfn = pte_pfn(*ptep);
+
+	for_each_tdp_pte_root(iter, root, gfn, gfn + 1) {
+		if (iter.level != PG_LEVEL_4K)
+			continue;
+
+		if (!is_shadow_present_pte(iter.old_spte))
+			break;
+
+		*iter.sptep = 0;
+		handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
+				    new_spte, iter.level);
+
+		kvm_flush_remote_tlbs_with_address(kvm, iter.gfn, 1);
+
+		if (!pte_write(*ptep)) {
+			new_spte = kvm_mmu_changed_pte_notifier_make_spte(
+					iter.old_spte, new_pfn);
+
+			*iter.sptep = new_spte;
+			handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
+					    new_spte, iter.level);
+		}
+
+		need_flush = 1;
+	}
+
+	if (need_flush)
+		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
+
+	return 0;
+}
+
+int kvm_tdp_mmu_set_spte_hva(struct kvm *kvm, unsigned long address,
+			     pte_t *host_ptep)
+{
+	return kvm_tdp_mmu_handle_hva_range(kvm, address, address + 1,
+					    (unsigned long)host_ptep,
+					    set_tdp_spte);
+}
+
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index f316773b7b5a8..5a399aa60b8d8 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -25,4 +25,7 @@  int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
 int kvm_tdp_mmu_age_hva_range(struct kvm *kvm, unsigned long start,
 			      unsigned long end);
 int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva);
+
+int kvm_tdp_mmu_set_spte_hva(struct kvm *kvm, unsigned long address,
+			     pte_t *host_ptep);
 #endif /* __KVM_X86_MMU_TDP_MMU_H */