diff mbox series

[RFC,v8,48/56] KVM: SVM: Add SNP-specific handling for memory attribute updates

Message ID 20230220183847.59159-49-michael.roth@amd.com (mailing list archive)
State New
Headers show
Series Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand

Commit Message

Michael Roth Feb. 20, 2023, 6:38 p.m. UTC
This will handle RMP table updates and direct map changes needed for
page state conversions requested by userspace.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 arch/x86/kvm/svm/sev.c | 126 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c |   1 +
 arch/x86/kvm/svm/svm.h |   2 +
 3 files changed, 129 insertions(+)

Comments

Dave Hansen March 1, 2023, 11:37 p.m. UTC | #1
On 2/20/23 10:38, Michael Roth wrote:
> +		/*
> +		 * TODO: The RMP entry's hugepage bit is ignored for
> +		 * shared/unassigned pages. Either handle looping through each
> +		 * sub-page as part of snp_make_page_shared(), or remove the
> +		 * level argument.
> +		 */
> +		if (op == SNP_PAGE_STATE_PRIVATE && order &&
> +		    IS_ALIGNED(gfn, 1 << order) && (gfn + (1 << order)) <= end) {
> +			level = order_to_level(order);
> +			npages = 1 << order;
> +		}

That's a wee bit obtuse.

First of all, I assume that the 'RFC' is because of these TODOs and they
won't survive to the point when you ask for this to be merged.

BTW, what keeps the restrictedmem_get_page() offset and the gfn aligned?

Let's start with this:

> +static inline u8 order_to_level(int order)
> +{
> +	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> +
> +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> +		return PG_LEVEL_1G;
> +
> +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> +		return PG_LEVEL_2M;
> +
> +	return PG_LEVEL_4K;
> +}

Right now, 'order' comes only from restrictedmem_get_page(), which I dug
out of:

> https://github.com/mdroth/linux/commit/c6792672cd11737fd255dff10b2d5b6bccc626a8

That order is *only* filled in by THPs.  That makes the PG_LEVEL_1G
stuff here kinda silly.  I guess it might be seen as thorough, but it's
dead code.  I'd probably just make this work on order==9 || order==0 and
warn on anything else.

I'd also highly recommend some comments about how racy this all is.  I
guess it probably works, but it would be good to add some comments about
page splits and collapsing.

It's also not obvious why this only cares about private pages.

Anyway, this is the exact kind of thing where I really like a
well-commented helper:

bool can_install_large_rmp_entry(gfn, order)
{
	// small pages, blah blah
	if (!order)
		return false;

	// The region being updated must be aligned
	if (!IS_ALIGNED(gfn, 1 << order))
		return false;
	// ... and fit
	if (gfn + (1 << order)) > end)
		return false;

	return true;	
}

Which gets used like this:

	if (op == SNP_PAGE_STATE_PRIVATE &&
	    can_install_large_rmp_entry(gfn, order)) {
		level = ...
	}
Michael Roth April 5, 2023, 11:48 p.m. UTC | #2
On Wed, Mar 01, 2023 at 03:37:00PM -0800, Dave Hansen wrote:
> On 2/20/23 10:38, Michael Roth wrote:
> > +		/*
> > +		 * TODO: The RMP entry's hugepage bit is ignored for
> > +		 * shared/unassigned pages. Either handle looping through each
> > +		 * sub-page as part of snp_make_page_shared(), or remove the
> > +		 * level argument.
> > +		 */
> > +		if (op == SNP_PAGE_STATE_PRIVATE && order &&
> > +		    IS_ALIGNED(gfn, 1 << order) && (gfn + (1 << order)) <= end) {
> > +			level = order_to_level(order);
> > +			npages = 1 << order;
> > +		}
> 
> That's a wee bit obtuse.
> 
> First of all, I assume that the 'RFC' is because of these TODOs and they
> won't survive to the point when you ask for this to be merged.

Yes, will make sure to have all the TODOs in the tree addressed before
dropping the RFC tag.

> 
> BTW, what keeps the restrictedmem_get_page() offset and the gfn aligned?

I don't think anything enforces that currently, but there is a TODO in the
UPM v10 patchset to enforce that:

  https://github.com/AMDESE/linux/commit/5c86db7f98701f614c48946b733f2542c962f139#diff-e7514a224c92c2e47224f99919405a37ee7edc4612953135229cfb6e07a680d8R2131

So currently this patch relies on the following:

 - the fact that the memslot alignment/sizes for a standard x86 guest's
   associated memory regions are 2M-aligned, so when they are bound to a
   restrictedmem FD they are naturally packed in at restrictedmem offsets
   that are also 2M-aligned. But of course we can't assume userspace will
   live up to this assumption and need the above TODO in KVM to enforce
   this when registering new memslots.

 - that restrictedmem/shmem will ensure that THPs are only allocated for
   restrictedmem offsets that are 2M-aligned. I think this enforcement
   happens in shmem_alloc_hugefolio().

which both seem to hold in testing. But it's probably a good idea to add
an explicit check for this, at least until KVM implements something to
enforce this earlier in guest life-cycle.

> 
> Let's start with this:
> 
> > +static inline u8 order_to_level(int order)
> > +{
> > +	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > +
> > +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > +		return PG_LEVEL_1G;
> > +
> > +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > +		return PG_LEVEL_2M;
> > +
> > +	return PG_LEVEL_4K;
> > +}
> 
> Right now, 'order' comes only from restrictedmem_get_page(), which I dug
> out of:
> 
> > https://github.com/mdroth/linux/commit/c6792672cd11737fd255dff10b2d5b6bccc626a8
> 
> That order is *only* filled in by THPs.  That makes the PG_LEVEL_1G
> stuff here kinda silly.  I guess it might be seen as thorough, but it's
> dead code.  I'd probably just make this work on order==9 || order==0 and
> warn on anything else.

Ok, makes sense.

> 
> I'd also highly recommend some comments about how racy this all is.  I
> guess it probably works, but it would be good to add some comments about
> page splits and collapsing.

Collapsing while in this code path should be ok since the 4K sub-pages will
just end up getting mapped as 4K in RMP table. KVM MMU will then map
them into nested page table as 4K as well and we'll get non-optimal
performance, but things should otherwise work.

Splitting is a similar story: if we map as 2M in RMP table, and then
afterward the page gets split, then KVM MMU during fault time would map
the pages in the NPT as 4K, and when the guest attempts to access
private pages of this sort they'll generate a nested page fault with
PFERR_GUEST_RMP_BIT and PFERR_GUEST_SIZEM_BIT set, and the code in
handle_rmp_page_fault() will issue a PSMASH instruction to split the
2M RMP entry into 512 4K RMP entries.

Will add some comments around this.

> 
> It's also not obvious why this only cares about private pages.

Mainly because the shared memory that actually get mapped into the guest
is always shared in the RMP table. It is normal VMA memory that is not
allocated by UPM/restrictedmem. We will never attempt to make them
private, so there is never a need to bother with switching them back to
shared.

So we only need to handle RMP updates for the UPM/restrictedmem PFNs.
Obviously for shared->private conversion before mapping it into the
guest, but also for private->shared conversion since we will still
get RMP check failures if we try to leave the PFNs as private in the
RMP table and map the above-mentioned VMA memory into the guest
instead.

Will add some more comments around this.

> 
> Anyway, this is the exact kind of thing where I really like a
> well-commented helper:
> 
> bool can_install_large_rmp_entry(gfn, order)
> {
> 	// small pages, blah blah
> 	if (!order)
> 		return false;
> 
> 	// The region being updated must be aligned
> 	if (!IS_ALIGNED(gfn, 1 << order))
> 		return false;
> 	// ... and fit
> 	if (gfn + (1 << order)) > end)
> 		return false;
> 
> 	return true;	
> }
> 
> Which gets used like this:
> 
> 	if (op == SNP_PAGE_STATE_PRIVATE &&
> 	    can_install_large_rmp_entry(gfn, order)) {
> 		level = ...
> 	}

Makes sense, will implement something along these lines.

Thanks!

-Mike
diff mbox series

Patch

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index b2f1a12685ed..73d614c538da 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3381,6 +3381,31 @@  static int snp_rmptable_psmash(struct kvm *kvm, kvm_pfn_t pfn)
 	return psmash(pfn);
 }
 
+static int snp_make_page_shared(struct kvm *kvm, gpa_t gpa, kvm_pfn_t pfn, int level)
+{
+	int rc, rmp_level;
+
+	rc = snp_lookup_rmpentry(pfn, &rmp_level);
+	if (rc < 0)
+		return -EINVAL;
+
+	/* If page is not assigned then do nothing */
+	if (!rc)
+		return 0;
+
+	/*
+	 * Is the page part of an existing 2MB RMP entry ? Split the 2MB into
+	 * multiple of 4K-page before making the memory shared.
+	 */
+	if (level == PG_LEVEL_4K && rmp_level == PG_LEVEL_2M) {
+		rc = snp_rmptable_psmash(kvm, pfn);
+		if (rc)
+			return rc;
+	}
+
+	return rmp_make_shared(pfn, level);
+}
+
 /*
  * TODO: need to get the value set by userspace in vcpu->run->vmgexit.ghcb_msr
  * and process that here accordingly.
@@ -4373,3 +4398,104 @@  void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
 	kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
 	put_page(pfn_to_page(pfn));
 }
+
+static inline u8 order_to_level(int order)
+{
+	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
+
+	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
+		return PG_LEVEL_1G;
+
+	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+		return PG_LEVEL_2M;
+
+	return PG_LEVEL_4K;
+}
+
+int sev_update_mem_attr(struct kvm_memory_slot *slot, unsigned int attr,
+			gfn_t start, gfn_t end)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(slot->kvm)->sev_info;
+	enum psc_op op = (attr & KVM_MEMORY_ATTRIBUTE_PRIVATE) ? SNP_PAGE_STATE_PRIVATE
+							       : SNP_PAGE_STATE_SHARED;
+	gfn_t gfn = start;
+
+	pr_debug("%s: GFN 0x%llx - 0x%llx, op: %d\n", __func__, start, end, op);
+
+	if (!sev_snp_guest(slot->kvm))
+		return 0;
+
+	if (!kvm_slot_can_be_private(slot)) {
+		pr_err_ratelimited("%s: memslot for gfn: 0x%llx is not private.\n",
+				   __func__, gfn);
+		return -EPERM;
+	}
+
+	while (gfn < end) {
+		kvm_pfn_t pfn;
+		int level = PG_LEVEL_4K; /* TODO: take actual order into account */
+		gpa_t gpa = gfn_to_gpa(gfn);
+		int npages = 1;
+		int order;
+		int rc;
+
+		/*
+		 * No work to do if there was never a page allocated from private
+		 * memory. If there was a page that was deallocated previously,
+		 * the invalidation notifier should have restored the page to
+		 * shared.
+		 */
+		rc = kvm_restrictedmem_get_pfn(slot, gfn, &pfn, &order);
+		if (rc) {
+			pr_warn_ratelimited("%s: failed to retrieve gfn 0x%llx from private FD\n",
+					    __func__, gfn);
+			gfn++;
+			continue;
+		}
+
+		/*
+		 * TODO: The RMP entry's hugepage bit is ignored for
+		 * shared/unassigned pages. Either handle looping through each
+		 * sub-page as part of snp_make_page_shared(), or remove the
+		 * level argument.
+		 */
+		if (op == SNP_PAGE_STATE_PRIVATE && order &&
+		    IS_ALIGNED(gfn, 1 << order) && (gfn + (1 << order)) <= end) {
+			level = order_to_level(order);
+			npages = 1 << order;
+		}
+
+		/*
+		 * Grab the PFN from private memslot and update the RMP entry.
+		 * It may be worthwhile to go ahead and map it into the TDP at
+		 * this point if the guest is doing lazy acceptance, but for
+		 * up-front bulk shared->private conversions it's not likely
+		 * the guest will try to access the PFN any time soon, so for
+		 * now just take the let KVM MMU handle faulting it on the next
+		 * access.
+		 */
+		switch (op) {
+		case SNP_PAGE_STATE_SHARED:
+			rc = snp_make_page_shared(slot->kvm, gpa, pfn, level);
+			break;
+		case SNP_PAGE_STATE_PRIVATE:
+			rc = rmp_make_private(pfn, gpa, level, sev->asid, false);
+			break;
+		default:
+			rc = PSC_INVALID_ENTRY;
+			break;
+		}
+
+		put_page(pfn_to_page(pfn));
+
+		if (rc) {
+			pr_err_ratelimited("%s: failed op %d gpa %llx pfn %llx level %d rc %d\n",
+					   __func__, op, gpa, pfn, level, rc);
+			return -EINVAL;
+		}
+
+		gfn += npages;
+	}
+
+	return 0;
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 539926b07ee5..e2edc4700e55 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4860,6 +4860,7 @@  static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.alloc_apic_backing_page = svm_alloc_apic_backing_page,
 
 	.adjust_mapping_level = sev_adjust_mapping_level,
+	.update_mem_attr = sev_update_mem_attr,
 };
 
 /*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 37bd7b728d52..50a2bcaf3fd7 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -725,6 +725,8 @@  struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
 void sev_adjust_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *level);
 void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
+int sev_update_mem_attr(struct kvm_memory_slot *slot, unsigned int attr,
+			gfn_t start, gfn_t end);
 
 /* vmenter.S */