[v11,18/35] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command

Message ID	20231230172351.574091-19-michael.roth@amd.com (mailing list archive)
State	New, archived
Headers	show Received: from NAM12-MW2-obe.outbound.protection.outlook.com (mail-mw2nam12on2049.outbound.protection.outlook.com [40.107.244.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4426A14A8A; Sat, 30 Dec 2023 17:27:54 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C From: Michael Roth <michael.roth@amd.com> To: <kvm@vger.kernel.org> CC: <linux-coco@lists.linux.dev>, <linux-mm@kvack.org>, <linux-crypto@vger.kernel.org>, <x86@kernel.org>, <linux-kernel@vger.kernel.org>, <tglx@linutronix.de>, <mingo@redhat.com>, <jroedel@suse.de>, <thomas.lendacky@amd.com>, <hpa@zytor.com>, <ardb@kernel.org>, <pbonzini@redhat.com>, <seanjc@google.com>, <vkuznets@redhat.com>, <jmattson@google.com>, <luto@kernel.org>, <dave.hansen@linux.intel.com>, <slp@redhat.com>, <pgonda@google.com>, <peterz@infradead.org>, <srinivas.pandruvada@linux.intel.com>, <rientjes@google.com>, <dovmurik@linux.ibm.com>, <tobin@ibm.com>, <bp@alien8.de>, <vbabka@suse.cz>, <kirill@shutemov.name>, <ak@linux.intel.com>, <tony.luck@intel.com>, <sathyanarayanan.kuppuswamy@linux.intel.com>, <alpergun@google.com>, <jarkko@kernel.org>, <ashish.kalra@amd.com>, <nikunj.dadhania@amd.com>, <pankaj.gupta@amd.com>, <liam.merwick@oracle.com>, <zhi.a.wang@intel.com>, Brijesh Singh <brijesh.singh@amd.com> Subject: [PATCH v11 18/35] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command Date: Sat, 30 Dec 2023 11:23:34 -0600 Message-ID: <20231230172351.574091-19-michael.roth@amd.com> In-Reply-To: <20231230172351.574091-1-michael.roth@amd.com> References: <20231230172351.574091-1-michael.roth@amd.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support \| expand [v11,00/35] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support [v11,01/35] KVM: Add hugepage support for dedicated guest memory [v11,02/35] mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory [v11,03/35] KVM: Use AS_INACCESSIBLE when creating guest_memfd inode [v11,04/35] KVM: x86: Add gmem hook for initializing memory [v11,05/35] KVM: x86: Add gmem hook for invalidating memory [v11,06/35] KVM: x86/mmu: Pass around full 64-bit error code for KVM page faults [v11,07/35] KVM: x86: Add KVM_X86_SNP_VM vm_type [v11,08/35] KVM: x86: Define RMP page fault error bits for #NPF [v11,09/35] KVM: x86: Determine shared/private faults based on vm_type [v11,10/35] KVM: SEV: Do not intercept accesses to MSR_IA32_XSS for SEV-ES guests [v11,11/35] KVM: SEV: Select KVM_GENERIC_PRIVATE_MEM when CONFIG_KVM_AMD_SEV=y [v11,12/35] KVM: SEV: Add support to handle AP reset MSR protocol [v11,13/35] KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests [v11,14/35] KVM: SEV: Add initial SEV-SNP support [v11,15/35] KVM: SEV: Add KVM_SNP_INIT command [v11,16/35] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command [v11,17/35] KVM: Add HVA range operator [v11,18/35] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command [v11,19/35] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH command [v11,20/35] KVM: SEV: Add support to handle GHCB GPA register VMGEXIT [v11,21/35] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT [v11,22/35] KVM: SEV: Add support to handle Page State Change VMGEXIT [v11,23/35] KVM: x86: Export the kvm_zap_gfn_range() for the SNP use [v11,24/35] KVM: SEV: Add support to handle RMP nested page faults [v11,25/35] KVM: SEV: Use a VMSA physical address variable for populating VMCB [v11,26/35] KVM: SEV: Support SEV-SNP AP Creation NAE event [v11,27/35] KVM: SEV: Add support for GHCB-based termination requests [v11,28/35] KVM: SEV: Implement gmem hook for initializing private pages [v11,29/35] KVM: SEV: Implement gmem hook for invalidating private pages [v11,30/35] KVM: x86: Add gmem hook for determining max NPT mapping level [v11,31/35] KVM: SEV: Avoid WBINVD for HVA-based MMU notifications for SNP [v11,32/35] KVM: SVM: Add module parameter to enable the SEV-SNP [v11,33/35] KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event [v11,34/35] crypto: ccp: Add the SNP_SET_CONFIG_{START,END} commands [v11,35/35] KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event

Michael Roth Dec. 30, 2023, 5:23 p.m. UTC

From: Brijesh Singh <brijesh.singh@amd.com>

The KVM_SEV_SNP_LAUNCH_UPDATE command can be used to insert data into
the guest's memory. The data is encrypted with the cryptographic context
created with the KVM_SEV_SNP_LAUNCH_START.

In addition to the inserting data, it can insert a two special pages
into the guests memory: the secrets page and the CPUID page.

While terminating the guest, reclaim the guest pages added in the RMP
table. If the reclaim fails, then the page is no longer safe to be
released back to the system and leak them.

For more information see the SEV-SNP specification.

Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 .../virt/kvm/x86/amd-memory-encryption.rst    |  28 +++
 arch/x86/kvm/svm/sev.c                        | 181 ++++++++++++++++++
 include/uapi/linux/kvm.h                      |  19 ++
 3 files changed, 228 insertions(+)

Sean Christopherson Jan. 10, 2024, 3:45 p.m. UTC | #1

On Sat, Dec 30, 2023, Michael Roth wrote:
> From: Brijesh Singh <brijesh.singh@amd.com>
> 
> The KVM_SEV_SNP_LAUNCH_UPDATE command can be used to insert data into
> the guest's memory. The data is encrypted with the cryptographic context
> created with the KVM_SEV_SNP_LAUNCH_START.
> 
> In addition to the inserting data, it can insert a two special pages
> into the guests memory: the secrets page and the CPUID page.
> 
> While terminating the guest, reclaim the guest pages added in the RMP
> table. If the reclaim fails, then the page is no longer safe to be
> released back to the system and leak them.
> 
> For more information see the SEV-SNP specification.

Please rewrite all changelogs to explain what *KVM* support is being added, why
the proposed uAPI looks like it does, and how the new uAPI is intended be used.

Porividing a crash course on the relevant hardware behavior is definitely helpful,
but the changelog absolutely needs to explain/justify the patch.

> Co-developed-by: Michael Roth <michael.roth@amd.com>
> Signed-off-by: Michael Roth <michael.roth@amd.com>
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
> ---
>  .../virt/kvm/x86/amd-memory-encryption.rst    |  28 +++
>  arch/x86/kvm/svm/sev.c                        | 181 ++++++++++++++++++
>  include/uapi/linux/kvm.h                      |  19 ++
>  3 files changed, 228 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> index b1beb2fe8766..d4325b26724c 100644
> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> @@ -485,6 +485,34 @@ Returns: 0 on success, -negative on error
>  
>  See the SEV-SNP specification for further detail on the launch input.
>  
> +20. KVM_SNP_LAUNCH_UPDATE
> +-------------------------
> +
> +The KVM_SNP_LAUNCH_UPDATE is used for encrypting a memory region. It also
> +calculates a measurement of the memory contents. The measurement is a signature
> +of the memory contents that can be sent to the guest owner as an attestation
> +that the memory was encrypted correctly by the firmware.
> +
> +Parameters (in): struct  kvm_snp_launch_update
> +
> +Returns: 0 on success, -negative on error
> +
> +::
> +
> +        struct kvm_sev_snp_launch_update {
> +                __u64 start_gfn;        /* Guest page number to start from. */
> +                __u64 uaddr;            /* userspace address need to be encrypted */

Huh?  Why is KVM taking a userspace address?  IIUC, the address unconditionally
gets translated into a gfn, so why not pass a gfn?

And speaking of gfns, AFAICT start_gfn is never used.

Oof, reading more of the code, this *requires* an effective in-place copy-and-convert
of guest memory.

> +                __u32 len;              /* length of memory region */

Bytes?  Pages?  One field above operates on frame numbers, one apparently operates
on a byte-granularity address.

> +                __u8 imi_page;          /* 1 if memory is part of the IMI */

What's "the IMI"?  Initial Measurement Image?  I assume this is essentially just
a flag that communicates whether or not the page should be measured?

> +                __u8 page_type;         /* page type */
> +                __u8 vmpl3_perms;       /* VMPL3 permission mask */
> +                __u8 vmpl2_perms;       /* VMPL2 permission mask */
> +                __u8 vmpl1_perms;       /* VMPL1 permission mask */

Why?  KVM doesn't support VMPLs.

> +static int snp_page_reclaim(u64 pfn)
> +{
> +	struct sev_data_snp_page_reclaim data = {0};
> +	int err, rc;
> +
> +	data.paddr = __sme_set(pfn << PAGE_SHIFT);
> +	rc = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
> +	if (rc) {
> +		/*
> +		 * If the reclaim failed, then page is no longer safe
> +		 * to use.

Uh, why can reclaim fail, and why does the kernel apparently not care about
leaking pages?  AFAICT, nothing ever complains beyond a pr_debug.  That sounds
bonkers to me, i.e. at the very minimum, why doesn't this warrant a WARN_ON_ONCE?

> +		 */
> +		snp_leak_pages(pfn, 1);
> +	}
> +
> +	return rc;
> +}
> +
> +static int host_rmp_make_shared(u64 pfn, enum pg_level level, bool leak)
> +{
> +	int rc;
> +
> +	rc = rmp_make_shared(pfn, level);
> +	if (rc && leak)
> +		snp_leak_pages(pfn,
> +			       page_level_size(level) >> PAGE_SHIFT);

Completely unnecessary wrap.

> +
> +	return rc;
> +}
> +
>  static void sev_unbind_asid(struct kvm *kvm, unsigned int handle)
>  {
>  	struct sev_data_deactivate deactivate;
> @@ -1990,6 +2020,154 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
>  	return rc;
>  }
>  
> +static int snp_launch_update_gfn_handler(struct kvm *kvm,
> +					 struct kvm_gfn_range *range,
> +					 void *opaque)
> +{
> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +	struct kvm_memory_slot *memslot = range->slot;
> +	struct sev_data_snp_launch_update data = {0};
> +	struct kvm_sev_snp_launch_update params;
> +	struct kvm_sev_cmd *argp = opaque;
> +	int *error = &argp->error;
> +	int i, n = 0, ret = 0;
> +	unsigned long npages;
> +	kvm_pfn_t *pfns;
> +	gfn_t gfn;
> +
> +	if (!kvm_slot_can_be_private(memslot)) {
> +		pr_err("SEV-SNP requires private memory support via guest_memfd.\n");

Yeah, no.  Sprinkling pr_err() all over the place in user-triggerable error paths
is not acceptable.  I get that it's often hard to extract "what went wrong" out
of the kernel, but adding pr_err() everywhere is not a viable solution.

> +		return -EINVAL;
> +	}
> +
> +	if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params))) {
> +		pr_err("Failed to copy user parameters for SEV-SNP launch.\n");
> +		return -EFAULT;
> +	}
> +
> +	data.gctx_paddr = __psp_pa(sev->snp_context);
> +
> +	npages = range->end - range->start;
> +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL_ACCOUNT);
> +	if (!pfns)
> +		return -ENOMEM;
> +
> +	pr_debug("%s: GFN range 0x%llx-0x%llx, type %d\n", __func__,
> +		 range->start, range->end, params.page_type);
> +
> +	for (gfn = range->start, i = 0; gfn < range->end; gfn++, i++) {
> +		int order, level;
> +		bool assigned;
> +		void *kvaddr;
> +
> +		ret = __kvm_gmem_get_pfn(kvm, memslot, gfn, &pfns[i], &order, false);
> +		if (ret)
> +			goto e_release;
> +
> +		n++;
> +		ret = snp_lookup_rmpentry((u64)pfns[i], &assigned, &level);
> +		if (ret || assigned) {
> +			pr_err("Failed to ensure GFN 0x%llx is in initial shared state, ret: %d, assigned: %d\n",
> +			       gfn, ret, assigned);
> +			return -EFAULT;
> +		}
> +
> +		kvaddr = pfn_to_kaddr(pfns[i]);
> +		if (!virt_addr_valid(kvaddr)) {

I really, really don't like that this assume guest_memfd is backed by struct page.

> +			pr_err("Invalid HVA 0x%llx for GFN 0x%llx\n", (uint64_t)kvaddr, gfn);
> +			ret = -EINVAL;
> +			goto e_release;
> +		}
> +
> +		ret = kvm_read_guest_page(kvm, gfn, kvaddr, 0, PAGE_SIZE);

Good gravy.  If I'm reading this correctly, KVM:

  1. Translates an HVA into a GFN.
  2. Gets the PFN for that GFN from guest_memfd
  3. Verifies the PFN is not assigned to the guest
  4. Copies memory from the shared memslot page to the guest_memfd page
  5. Converts the page to private and asks the PSP to encrypt it

(a) As above, why is #1 a thing?
(b) Why are KVM's memory attributes never consulted?
(c) What prevents TOCTOU issues with respect to the RMP?
(d) Why is *KVM* copying memory into guest_memfd?
(e) What guarantees the direct map is valid for guest_memfd?
(f) Why does KVM's uAPI *require* the source page to come from the same memslot?

> +		if (ret) {
> +			pr_err("Guest read failed, ret: 0x%x\n", ret);
> +			goto e_release;
> +		}
> +
> +		ret = rmp_make_private(pfns[i], gfn << PAGE_SHIFT, PG_LEVEL_4K,
> +				       sev_get_asid(kvm), true);
> +		if (ret) {
> +			ret = -EFAULT;
> +			goto e_release;
> +		}
> +
> +		data.address = __sme_set(pfns[i] << PAGE_SHIFT);
> +		data.page_size = PG_LEVEL_TO_RMP(PG_LEVEL_4K);
> +		data.page_type = params.page_type;
> +		data.vmpl3_perms = params.vmpl3_perms;
> +		data.vmpl2_perms = params.vmpl2_perms;
> +		data.vmpl1_perms = params.vmpl1_perms;
> +		ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE,
> +				      &data, error);
> +		if (ret) {
> +			pr_err("SEV-SNP launch update failed, ret: 0x%x, fw_error: 0x%x\n",
> +			       ret, *error);
> +			snp_page_reclaim(pfns[i]);
> +
> +			/*
> +			 * When invalid CPUID function entries are detected, the firmware
> +			 * corrects these entries for debugging purpose and leaves the
> +			 * page unencrypted so it can be provided users for debugging
> +			 * and error-reporting.
> +			 *
> +			 * Copy the corrected CPUID page back to shared memory so
> +			 * userpsace can retrieve this information.

Why?  IIUC, this is basically backdooring reads/writes into guest_memfd to avoid
having to add proper mmap() support.
 
> +			 */
> +			if (params.page_type == SNP_PAGE_TYPE_CPUID &&
> +			    *error == SEV_RET_INVALID_PARAM) {
> +				int ret;

Ugh, do not shadow variables.

> +
> +				host_rmp_make_shared(pfns[i], PG_LEVEL_4K, true);
> +
> +				ret = kvm_write_guest_page(kvm, gfn, kvaddr, 0, PAGE_SIZE);
> +				if (ret)
> +					pr_err("Failed to write CPUID page back to userspace, ret: 0x%x\n",
> +					       ret);
> +			}
> +
> +			goto e_release;
> +		}
> +	}
> +
> +e_release:
> +	/* Content of memory is updated, mark pages dirty */
> +	for (i = 0; i < n; i++) {
> +		set_page_dirty(pfn_to_page(pfns[i]));
> +		mark_page_accessed(pfn_to_page(pfns[i]));
> +
> +		/*
> +		 * If its an error, then update RMP entry to change page ownership
> +		 * to the hypervisor.
> +		 */
> +		if (ret)
> +			host_rmp_make_shared(pfns[i], PG_LEVEL_4K, true);
> +
> +		put_page(pfn_to_page(pfns[i]));
> +	}
> +
> +	kvfree(pfns);

Saving PFNs from guest_memfd, which is fully owned by KVM, is so unnecessarily
complex.  Add a guest_memfd API (or three) to do this safely, e.g. to lock the
pages, do (and track) the RMP conversion, etc...

Michael Roth Jan. 16, 2024, 4:14 a.m. UTC | #2

On Wed, Jan 10, 2024 at 07:45:36AM -0800, Sean Christopherson wrote:
> On Sat, Dec 30, 2023, Michael Roth wrote:
> > diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > index b1beb2fe8766..d4325b26724c 100644
> > --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > @@ -485,6 +485,34 @@ Returns: 0 on success, -negative on error
> >  
> >  See the SEV-SNP specification for further detail on the launch input.
> >  
> > +20. KVM_SNP_LAUNCH_UPDATE
> > +-------------------------
> > +
> > +The KVM_SNP_LAUNCH_UPDATE is used for encrypting a memory region. It also
> > +calculates a measurement of the memory contents. The measurement is a signature
> > +of the memory contents that can be sent to the guest owner as an attestation
> > +that the memory was encrypted correctly by the firmware.
> > +
> > +Parameters (in): struct  kvm_snp_launch_update
> > +
> > +Returns: 0 on success, -negative on error
> > +
> > +::
> > +
> > +        struct kvm_sev_snp_launch_update {
> > +                __u64 start_gfn;        /* Guest page number to start from. */
> > +                __u64 uaddr;            /* userspace address need to be encrypted */
> 
> Huh?  Why is KVM taking a userspace address?  IIUC, the address unconditionally
> gets translated into a gfn, so why not pass a gfn?
> 
> And speaking of gfns, AFAICT start_gfn is never used.

I think having both the uaddr and start_gfn parameters makes sense. I
think it's only awkward because how I'm using the memslot to translate
the uaddr to a GFN in the current implementation, but:

 a) It's actually not a requirement that uaddr be associated with a
    memslot. It could just as easily be any random userspace address
    containing the payload that we want to copy into the actual gmem
    pages associated with start_gfn. I think TDX does something similar
    in that regard, and it makes sense to give VMMs the option of
    handling things that way.

 b) If we switch to just having start_gfn, and no uaddr, then things get
    awkward because then you really do need to have a memslot set up to
    get at the payload, and have some way of pre-populating the gmem pages
    prior to conversion, either the way the current code does it (via
    copying shared memory prior to conversion), or by having some way to
    populate the gmem pages directly, which is even more painful.

> 
> Oof, reading more of the code, this *requires* an effective in-place copy-and-convert
> of guest memory.

Yes, I'm having some trouble locating the various threads, but initially
there were some discussions about having a userspace API that allows for
UPM/gmem pages to be pre-populated before they are in-place encrypted, but
we'd all eventually decided that having KVM handle this internally was
the simplest approach.

So that's how it's done here, KVM_SNP_LAUNCH_UPDATE copies the pages into
gmem, then passes those pages on to firmware for encryption. Then the
VMM is expected to mark the GFN range as private via
KVM_SET_MEMORY_ATTRIBUTES, since the VMM understands what constitutes
initial private/encrypted payload. I should document that better in
KVM_SNP_LAUNCH_UPDATE docs however.

> 
> > +                __u32 len;              /* length of memory region */
> 
> Bytes?  Pages?  One field above operates on frame numbers, one apparently operates
> on a byte-granularity address.

If we implement things as mentioned above, it makes sense to decouple
uaddr from any page alignment/size restrictions since it would always be
copied into the target gmem page starting at byte 0. This sort of
assumes that the gmem page will initially be zero'd however, which is
the case currently, but there's a TODO in kvm_gmem_get_folio() about
potentially off-loading that to firmware. I'm not sure it would ever
be applicable for these pages though. Worst case, KVM_SNP_LAUNCH_UPDATE
can pad with 0's.

> 
> > +                __u8 imi_page;          /* 1 if memory is part of the IMI */
> 
> What's "the IMI"?  Initial Measurement Image?

Yes, though the SNP Firmware ABI also references it as "Incoming Migration
Image", which I think is a little clearer about its purpose and so that's
the terminology I've been using in the kernel.

> What's "the IMI"?  Initial Measurement Image?  I assume this is essentially just
> a flag that communicates whether or not the page should be measured?

This is actually for loading a measured migration agent into the target
system so that it can handle receiving the encrypted guest data from the
source. There's still a good deal of planning around how live migration
will be handled however so if you think it's premature to expose this
via KVM I can remove the related fields.

> 
> > +                __u8 page_type;         /* page type */
> > +                __u8 vmpl3_perms;       /* VMPL3 permission mask */
> > +                __u8 vmpl2_perms;       /* VMPL2 permission mask */
> > +                __u8 vmpl1_perms;       /* VMPL1 permission mask */
> 
> Why?  KVM doesn't support VMPLs.

It does actually get used by the SVSM. I can remove these but then we'd
need some capability bit or something to know when they are available if
they get re-introduced. But that may be needed anyway since KVM needs
some additional changes to handle scheduling threads running at
different VMPL levels.

> 
> > +static int snp_page_reclaim(u64 pfn)
> > +{
> > +	struct sev_data_snp_page_reclaim data = {0};
> > +	int err, rc;
> > +
> > +	data.paddr = __sme_set(pfn << PAGE_SHIFT);
> > +	rc = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
> > +	if (rc) {
> > +		/*
> > +		 * If the reclaim failed, then page is no longer safe
> > +		 * to use.
> 
> Uh, why can reclaim fail, and why does the kernel apparently not care about
> leaking pages?  AFAICT, nothing ever complains beyond a pr_debug.  That sounds
> bonkers to me, i.e. at the very minimum, why doesn't this warrant a WARN_ON_ONCE?

PAGE_RECLAIM shouldn't happen in practice, so yes, it makes sense to warn
about this when it does. snp_leak_pages() is probably the most
consistent/user-friendly place to convey these failures so I'll add a
pr_warn() there.

> 
> > +		 */
> > +		snp_leak_pages(pfn, 1);
> > +	}
> > +
> > +	return rc;
> > +}
> > +
> > +static int host_rmp_make_shared(u64 pfn, enum pg_level level, bool leak)
> > +{
> > +	int rc;
> > +
> > +	rc = rmp_make_shared(pfn, level);
> > +	if (rc && leak)
> > +		snp_leak_pages(pfn,
> > +			       page_level_size(level) >> PAGE_SHIFT);
> 
> Completely unnecessary wrap.
> 
> > +
> > +	return rc;
> > +}
> > +
> >  static void sev_unbind_asid(struct kvm *kvm, unsigned int handle)
> >  {
> >  	struct sev_data_deactivate deactivate;
> > @@ -1990,6 +2020,154 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
> >  	return rc;
> >  }
> >  
> > +static int snp_launch_update_gfn_handler(struct kvm *kvm,
> > +					 struct kvm_gfn_range *range,
> > +					 void *opaque)
> > +{
> > +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> > +	struct kvm_memory_slot *memslot = range->slot;
> > +	struct sev_data_snp_launch_update data = {0};
> > +	struct kvm_sev_snp_launch_update params;
> > +	struct kvm_sev_cmd *argp = opaque;
> > +	int *error = &argp->error;
> > +	int i, n = 0, ret = 0;
> > +	unsigned long npages;
> > +	kvm_pfn_t *pfns;
> > +	gfn_t gfn;
> > +
> > +	if (!kvm_slot_can_be_private(memslot)) {
> > +		pr_err("SEV-SNP requires private memory support via guest_memfd.\n");
> 
> Yeah, no.  Sprinkling pr_err() all over the place in user-triggerable error paths
> is not acceptable.  I get that it's often hard to extract "what went wrong" out
> of the kernel, but adding pr_err() everywhere is not a viable solution.

Makes sense, I'll drop this.

> 
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params))) {
> > +		pr_err("Failed to copy user parameters for SEV-SNP launch.\n");
> > +		return -EFAULT;
> > +	}
> > +
> > +	data.gctx_paddr = __psp_pa(sev->snp_context);
> > +
> > +	npages = range->end - range->start;
> > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL_ACCOUNT);
> > +	if (!pfns)
> > +		return -ENOMEM;
> > +
> > +	pr_debug("%s: GFN range 0x%llx-0x%llx, type %d\n", __func__,
> > +		 range->start, range->end, params.page_type);
> > +
> > +	for (gfn = range->start, i = 0; gfn < range->end; gfn++, i++) {
> > +		int order, level;
> > +		bool assigned;
> > +		void *kvaddr;
> > +
> > +		ret = __kvm_gmem_get_pfn(kvm, memslot, gfn, &pfns[i], &order, false);
> > +		if (ret)
> > +			goto e_release;
> > +
> > +		n++;
> > +		ret = snp_lookup_rmpentry((u64)pfns[i], &assigned, &level);
> > +		if (ret || assigned) {
> > +			pr_err("Failed to ensure GFN 0x%llx is in initial shared state, ret: %d, assigned: %d\n",
> > +			       gfn, ret, assigned);
> > +			return -EFAULT;
> > +		}
> > +
> > +		kvaddr = pfn_to_kaddr(pfns[i]);
> > +		if (!virt_addr_valid(kvaddr)) {
> 
> I really, really don't like that this assume guest_memfd is backed by struct page.

There are similar enforcements in the SEV/SEV-ES code. There was some
initial discussion about relaxing this for SNP so we could support
things like /dev/mem-mapped guest memory, but then guest_memfd came
along and made that to be an unlikely use-case in the near-term given
that it relies on alloc_pages() currently and explicitly guards against
mmap()'ing pages in userspace.

I think it makes to keep the current tightened restrictions in-place
until such a use-case comes along, since otherwise we are relaxing a
bunch of currently-useful sanity checks that span all throughout the code
to support some nebulous potential use-case that might never come along.
I think it makes more sense to cross that bridge when we get there.

> 
> > +			pr_err("Invalid HVA 0x%llx for GFN 0x%llx\n", (uint64_t)kvaddr, gfn);
> > +			ret = -EINVAL;
> > +			goto e_release;
> > +		}
> > +
> > +		ret = kvm_read_guest_page(kvm, gfn, kvaddr, 0, PAGE_SIZE);
> 
> Good gravy.  If I'm reading this correctly, KVM:
> 
>   1. Translates an HVA into a GFN.
>   2. Gets the PFN for that GFN from guest_memfd
>   3. Verifies the PFN is not assigned to the guest
>   4. Copies memory from the shared memslot page to the guest_memfd page
>   5. Converts the page to private and asks the PSP to encrypt it
> 
> (a) As above, why is #1 a thing?

Yah, it's probably best to avoid this, as proposed above.

> (b) Why are KVM's memory attributes never consulted?

It doesn't really matter if the attributes are set before or after
KVM_SNP_LAUNCH_UPDATE, only that by the time the guest actually launches
they pages get set to private so they get faulted in from gmem. We could
document our expectations and enforce them here if that's preferable
however. Maybe requiring KVM_SET_MEMORY_ATTRIBUTES(private) in advance
would make it easier to enforce that userspace does the right thing.
I'll see how that looks if there are no objections.

> (c) What prevents TOCTOU issues with respect to the RMP?

Time-of-use will be when the guest faults the gmem page in with C-bit
set. If it is not in the expected Guest-owned/pre-validated state that
SEV_CMD_SNP_LAUNCH_UPDATE expected/set, then the guest will get an RMP
fault or #VC exception for unvalidated page access. It will also fail
attestation. Not sure if that covers the scenarios you had in mind.

> (d) Why is *KVM* copying memory into guest_memfd?

As mentioned above, there were various discussions of ways of allowing
userspace to pwrite() to the guest_memfd in advance of
"sealing"/"binding" it and then encrypting it in place. I think this was
one of the related threads:

  https://lore.kernel.org/linux-mm/YkyKywkQYbr9U0CA@google.com/

My read at the time was that the requirements between pKVM/TDX/SNP were all
so unique that we'd spin forever trying to come up with a userspace ABI
that worked for everyone. At the time you'd suggested for pKVM to handle
their specific requirements internally in pKVM to avoid unecessary churn on
TDX/SNP side, and I took the same approach with SNP in implementing it
internally in SNP's KVM interfaces since it seemed unlikely there would
be much common ground with how TDX handles it via KVM_TDX_INIT_MEM_REGION.

> (e) What guarantees the direct map is valid for guest_memfd?

Are you suggesting this may change in the near-term? If so, we can
re-work the code to write to guest_memfd via a temporary mapping or
something, but otherwise it seems awkward to account for that scenario
in current code given that SNP specifically has hooks to remove/re-add
directmap entries based on RMPUPDATEs to avoid host breakage, so we
would necessarily need to implement changes if guest_memfd ever made
any changes in this regard.

And we had prior discussions about handling directmap invalidation in
guest_memfd, but Kirill mentioned here[1] that special handling didn't
actually seem to be a requirement of TDX private memory, and so it
didn't seem likely that pushing that into gmem would be a welcome
change.

All that said, TDX does still seem to invalidate directmap entries as
part of tdh_mem_page_add(), so maybe there is a requirement there and
this is worth revisiting?

If so though, it's worth mentioning that cpa_lock contention on
directmap updates is actually a significant contributor to some
scalability issues we've noticed with lots of guests/vCPUs doing lazy
acceptance and needing to frequently invalidate directmap entries as part
of rmpupdate() during gmem allocations, so we're considering just forcing
a 4K directmap for SNP until directmap updates can scale better, so that
might be another reason to not have guest_memfd in the business of
managing directmap updates until there's some concrete use-case in
sight, like being able to rebuild 2MB/1GB directmap entries in a
scalable way during run-time.

[1] https://lore.kernel.org/linux-mm/20221102212656.6giugw542jdxsvhh@amd.com/

> (f) Why does KVM's uAPI *require* the source page to come from the same memslot?

As mentioned above, I think it makes sense to do away with this
requirement and just treat source page as any other user-provided blob.

> 
> > +		if (ret) {
> > +			pr_err("Guest read failed, ret: 0x%x\n", ret);
> > +			goto e_release;
> > +		}
> > +
> > +		ret = rmp_make_private(pfns[i], gfn << PAGE_SHIFT, PG_LEVEL_4K,
> > +				       sev_get_asid(kvm), true);
> > +		if (ret) {
> > +			ret = -EFAULT;
> > +			goto e_release;
> > +		}
> > +
> > +		data.address = __sme_set(pfns[i] << PAGE_SHIFT);
> > +		data.page_size = PG_LEVEL_TO_RMP(PG_LEVEL_4K);
> > +		data.page_type = params.page_type;
> > +		data.vmpl3_perms = params.vmpl3_perms;
> > +		data.vmpl2_perms = params.vmpl2_perms;
> > +		data.vmpl1_perms = params.vmpl1_perms;
> > +		ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE,
> > +				      &data, error);
> > +		if (ret) {
> > +			pr_err("SEV-SNP launch update failed, ret: 0x%x, fw_error: 0x%x\n",
> > +			       ret, *error);
> > +			snp_page_reclaim(pfns[i]);
> > +
> > +			/*
> > +			 * When invalid CPUID function entries are detected, the firmware
> > +			 * corrects these entries for debugging purpose and leaves the
> > +			 * page unencrypted so it can be provided users for debugging
> > +			 * and error-reporting.
> > +			 *
> > +			 * Copy the corrected CPUID page back to shared memory so
> > +			 * userpsace can retrieve this information.
> 
> Why?  IIUC, this is basically backdooring reads/writes into guest_memfd to avoid
> having to add proper mmap() support.

The CPUID page is private/encrypted, so it needs to be a gmem page.
SNP firmware is doing the backdooring when it writes the unencrypted,
expected contents into the page during failure. What's wrong with copying
the contents back into the source page so userspace can be use of it?
If we implement the above-mentioned changes then the source page is just
a userspace buffer that isn't necessarily associated with a memslot, so
it becomes even more straightforward.

Would that be acceptable? I'm not sure what you're proposing with
mmap().

>  
> > +			 */
> > +			if (params.page_type == SNP_PAGE_TYPE_CPUID &&
> > +			    *error == SEV_RET_INVALID_PARAM) {
> > +				int ret;
> 
> Ugh, do not shadow variables.

Will fix.

> 
> > +
> > +				host_rmp_make_shared(pfns[i], PG_LEVEL_4K, true);
> > +
> > +				ret = kvm_write_guest_page(kvm, gfn, kvaddr, 0, PAGE_SIZE);
> > +				if (ret)
> > +					pr_err("Failed to write CPUID page back to userspace, ret: 0x%x\n",
> > +					       ret);
> > +			}
> > +
> > +			goto e_release;
> > +		}
> > +	}
> > +
> > +e_release:
> > +	/* Content of memory is updated, mark pages dirty */
> > +	for (i = 0; i < n; i++) {
> > +		set_page_dirty(pfn_to_page(pfns[i]));
> > +		mark_page_accessed(pfn_to_page(pfns[i]));
> > +
> > +		/*
> > +		 * If its an error, then update RMP entry to change page ownership
> > +		 * to the hypervisor.
> > +		 */
> > +		if (ret)
> > +			host_rmp_make_shared(pfns[i], PG_LEVEL_4K, true);
> > +
> > +		put_page(pfn_to_page(pfns[i]));
> > +	}
> > +
> > +	kvfree(pfns);
> 
> Saving PFNs from guest_memfd, which is fully owned by KVM, is so unnecessarily
> complex.  Add a guest_memfd API (or three) to do this safely, e.g. to lock the
> pages, do (and track) the RMP conversion, etc...

Is adding 3 gmem APIs and tracking RMP states inside gmem really less
complex than what's going on here? The PFNs are only held on to for the
duration of this single function so they can be cleanly rolled back, and
we're using blessed interfaces like kvm_gmem_get_pfn() to get at them.

There's some nuances here that I'm not sure will map to a re-usable gmem
API that would benefit other users. For instance, we need to:

  1) grab the gmem PFN
  2) initialize it in some platform-specific way (copy from source buffer in this case)
  3) switch it private in RMP table
  4) execute SEV_CMD_SNP_LAUNCH_UPDATE firmware cmd

If 2-4 can all be done with self-contained platform-specific callback,
then I could add a gmem API like:

  gmem_initialize_gfn_range(start, end, func, opaque)

  where:
    func: does roughly what snp_launch_update_gfn_handler currently does for
          each PFN it is handed
    opaque: some data structure that would provide the source buffer to
            initialize the gmem pages from

Is that along the lines of what you're suggesting? It wouldn't involve
"tracking" RMP conversions, 'func' would be aware of that for each
PFN it is handed, but it's simple enough that it is easily re-usable for
other platforms without too much fuss.

If you really want to build some deeper tracking of RMP table states
into gmem internals, then I could really use your feedback on the
gmem_prepare() hook I added in this RFC[2] and included as part of this
series, because I ended up not implementing tracking for number of
reason detailed under "Hooks for preparing gmem pages" in the RFC cover
letter and would likely need to revisit that aspect first before
building out this interface.

[2] https://lore.kernel.org/kvm/20231016115028.996656-1-michael.roth@amd.com/

Thanks,

Mike

Sean Christopherson Feb. 2, 2024, 10:54 p.m. UTC | #3

On Mon, Jan 15, 2024, Michael Roth wrote:
> On Wed, Jan 10, 2024 at 07:45:36AM -0800, Sean Christopherson wrote:
> > On Sat, Dec 30, 2023, Michael Roth wrote:
> > > diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > > index b1beb2fe8766..d4325b26724c 100644
> > > --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > > +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > > @@ -485,6 +485,34 @@ Returns: 0 on success, -negative on error
> > >  
> > >  See the SEV-SNP specification for further detail on the launch input.
> > >  
> > > +20. KVM_SNP_LAUNCH_UPDATE
> > > +-------------------------
> > > +
> > > +The KVM_SNP_LAUNCH_UPDATE is used for encrypting a memory region. It also
> > > +calculates a measurement of the memory contents. The measurement is a signature
> > > +of the memory contents that can be sent to the guest owner as an attestation
> > > +that the memory was encrypted correctly by the firmware.
> > > +
> > > +Parameters (in): struct  kvm_snp_launch_update
> > > +
> > > +Returns: 0 on success, -negative on error
> > > +
> > > +::
> > > +
> > > +        struct kvm_sev_snp_launch_update {
> > > +                __u64 start_gfn;        /* Guest page number to start from. */
> > > +                __u64 uaddr;            /* userspace address need to be encrypted */
> > 
> > Huh?  Why is KVM taking a userspace address?  IIUC, the address unconditionally
> > gets translated into a gfn, so why not pass a gfn?
> > 
> > And speaking of gfns, AFAICT start_gfn is never used.
> 
> I think having both the uaddr and start_gfn parameters makes sense. I
> think it's only awkward because how I'm using the memslot to translate
> the uaddr to a GFN in the current implementation,

Yes.

> > Oof, reading more of the code, this *requires* an effective in-place copy-and-convert
> > of guest memory.
> 
> Yes, I'm having some trouble locating the various threads, but initially
> there were some discussions about having a userspace API that allows for
> UPM/gmem pages to be pre-populated before they are in-place encrypted, but
> we'd all eventually decided that having KVM handle this internally was
> the simplest approach.
> 
> So that's how it's done here, KVM_SNP_LAUNCH_UPDATE copies the pages into
> gmem, then passes those pages on to firmware for encryption. Then the
> VMM is expected to mark the GFN range as private via
> KVM_SET_MEMORY_ATTRIBUTES, since the VMM understands what constitutes
> initial private/encrypted payload. I should document that better in
> KVM_SNP_LAUNCH_UPDATE docs however.

That's fine.  As above, my complaints are that the unencrypted source *must* be
covered by a memslot.  That's beyond ugly.

> > What's "the IMI"?  Initial Measurement Image?  I assume this is essentially just
> > a flag that communicates whether or not the page should be measured?
> 
> This is actually for loading a measured migration agent into the target
> system so that it can handle receiving the encrypted guest data from the
> source. There's still a good deal of planning around how live migration
> will be handled however so if you think it's premature to expose this
> via KVM I can remove the related fields.

Yes, please.  Though FWIW, I honestly hope KVM_SEV_SNP_LAUNCH_UPDATE goes away
and we end up with a common uAPI for populating guest memory:

https://lore.kernel.org/all/Zbrj5WKVgMsUFDtb@google.com

> > > +                __u8 page_type;         /* page type */
> > > +                __u8 vmpl3_perms;       /* VMPL3 permission mask */
> > > +                __u8 vmpl2_perms;       /* VMPL2 permission mask */
> > > +                __u8 vmpl1_perms;       /* VMPL1 permission mask */
> > 
> > Why?  KVM doesn't support VMPLs.
> 
> It does actually get used by the SVSM.

> I can remove these but then we'd need some capability bit or something to
> know when they are available if they get re-introduced.

_If_.  We don't merge dead code, and we _definitely_ don't merge dead code that
creates ABI.

> > > +		kvaddr = pfn_to_kaddr(pfns[i]);
> > > +		if (!virt_addr_valid(kvaddr)) {
> > 
> > I really, really don't like that this assume guest_memfd is backed by struct page.
> 
> There are similar enforcements in the SEV/SEV-ES code. There was some
> initial discussion about relaxing this for SNP so we could support
> things like /dev/mem-mapped guest memory, but then guest_memfd came
> along and made that to be an unlikely use-case in the near-term given
> that it relies on alloc_pages() currently and explicitly guards against
> mmap()'ing pages in userspace.
> 
> I think it makes to keep the current tightened restrictions in-place
> until such a use-case comes along, since otherwise we are relaxing a
> bunch of currently-useful sanity checks that span all throughout the code
> to support some nebulous potential use-case that might never come along.
> I think it makes more sense to cross that bridge when we get there.

I disagree.  You say "sanity checks", while I see a bunch of arbitrary assumptions
that don't need to exist.  Yes, today guest_memfd always uses struct page memory,
but that should have _zero_ impact on SNP.  Arbitrary assumptions often cause a
lot of confusion for future readers, e.g. a few years from now, if the above code
still exists, someone might wonder what is so special about struct page memory,
and then waste a bunch of time trying to figure out that there's no technical
reason SNP "requires" struct page memory.

This is partly why I was pushing for guest_memfd to handle some of this; the
gmem code _knows_ what backing type it's using, it _knows_ if the direct map is
valid, etc.

> > > +			pr_err("Invalid HVA 0x%llx for GFN 0x%llx\n", (uint64_t)kvaddr, gfn);
> > > +			ret = -EINVAL;
> > > +			goto e_release;
> > > +		}
> > > +
> > > +		ret = kvm_read_guest_page(kvm, gfn, kvaddr, 0, PAGE_SIZE);
> > 
> > Good gravy.  If I'm reading this correctly, KVM:
> > 
> >   1. Translates an HVA into a GFN.
> >   2. Gets the PFN for that GFN from guest_memfd
> >   3. Verifies the PFN is not assigned to the guest
> >   4. Copies memory from the shared memslot page to the guest_memfd page
> >   5. Converts the page to private and asks the PSP to encrypt it
> > 
> > (a) As above, why is #1 a thing?
> 
> Yah, it's probably best to avoid this, as proposed above.
> 
> > (b) Why are KVM's memory attributes never consulted?
> 
> It doesn't really matter if the attributes are set before or after
> KVM_SNP_LAUNCH_UPDATE, only that by the time the guest actually launches
> they pages get set to private so they get faulted in from gmem. We could
> document our expectations and enforce them here if that's preferable
> however. Maybe requiring KVM_SET_MEMORY_ATTRIBUTES(private) in advance
> would make it easier to enforce that userspace does the right thing.
> I'll see how that looks if there are no objections.

Userspace owns whether a page is PRIVATE or SHARED, full stop.  If KVM can't
honor that, then we need to come up with better uAPI.

> > (c) What prevents TOCTOU issues with respect to the RMP?
> 
> Time-of-use will be when the guest faults the gmem page in with C-bit
> set. If it is not in the expected Guest-owned/pre-validated state that
> SEV_CMD_SNP_LAUNCH_UPDATE expected/set, then the guest will get an RMP
> fault or #VC exception for unvalidated page access. It will also fail
> attestation. Not sure if that covers the scenarios you had in mind.

I don't think this covers what I was asking, but I suspect my concern will go
away once the new APIs come along, so let's table this for now.

> 
> > (d) Why is *KVM* copying memory into guest_memfd?
> 
> As mentioned above, there were various discussions of ways of allowing
> userspace to pwrite() to the guest_memfd in advance of
> "sealing"/"binding" it and then encrypting it in place. I think this was
> one of the related threads:
> 
>   https://lore.kernel.org/linux-mm/YkyKywkQYbr9U0CA@google.com/
> 
> My read at the time was that the requirements between pKVM/TDX/SNP were all
> so unique that we'd spin forever trying to come up with a userspace ABI
> that worked for everyone. At the time you'd suggested for pKVM to handle
> their specific requirements internally in pKVM to avoid unecessary churn on
> TDX/SNP side, and I took the same approach with SNP in implementing it
> internally in SNP's KVM interfaces since it seemed unlikely there would
> be much common ground with how TDX handles it via KVM_TDX_INIT_MEM_REGION.

Yeah, the whole "intra-memslot copy" thing threw me.

> > (e) What guarantees the direct map is valid for guest_memfd?
> 
> Are you suggesting this may change in the near-term?

I asking because, when I asked, I was unaware that the plan was to shatter the
direct to address the 2MiB vs. 4KiB erratum (as opposed to unmapping guest_memfd
pfns).

> > > +		if (ret) {
> > > +			pr_err("SEV-SNP launch update failed, ret: 0x%x, fw_error: 0x%x\n",
> > > +			       ret, *error);
> > > +			snp_page_reclaim(pfns[i]);
> > > +
> > > +			/*
> > > +			 * When invalid CPUID function entries are detected, the firmware
> > > +			 * corrects these entries for debugging purpose and leaves the
> > > +			 * page unencrypted so it can be provided users for debugging
> > > +			 * and error-reporting.
> > > +			 *
> > > +			 * Copy the corrected CPUID page back to shared memory so
> > > +			 * userpsace can retrieve this information.
> > 
> > Why?  IIUC, this is basically backdooring reads/writes into guest_memfd to avoid
> > having to add proper mmap() support.
> 
> The CPUID page is private/encrypted, so it needs to be a gmem page.
> SNP firmware is doing the backdooring when it writes the unencrypted,
> expected contents into the page during failure. What's wrong with copying
> the contents back into the source page so userspace can be use of it?
> If we implement the above-mentioned changes then the source page is just
> a userspace buffer that isn't necessarily associated with a memslot, so
> it becomes even more straightforward.
> 
> Would that be acceptable?

Yes, I am specifically complaining about writing guest memory on failure, which is
all kinds of weird.

> > > +	kvfree(pfns);
> > 
> > Saving PFNs from guest_memfd, which is fully owned by KVM, is so unnecessarily
> > complex.  Add a guest_memfd API (or three) to do this safely, e.g. to lock the
> > pages, do (and track) the RMP conversion, etc...
> 
> Is adding 3 gmem APIs and tracking RMP states inside gmem really less
> complex than what's going on here?

I think we covered this in PUCK?  Holler if you still have questions here.

Paolo Bonzini Feb. 6, 2024, 11:43 p.m. UTC | #4

On Fri, Feb 2, 2024 at 11:55 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > +        struct kvm_sev_snp_launch_update {
> > > > +                __u64 start_gfn;        /* Guest page number to start from. */
> > > > +                __u64 uaddr;            /* userspace address need to be encrypted */
> > >
> > > Huh?  Why is KVM taking a userspace address?  IIUC, the address unconditionally
> > > gets translated into a gfn, so why not pass a gfn?
> > >
> > > And speaking of gfns, AFAICT start_gfn is never used.
> >
> > I think having both the uaddr and start_gfn parameters makes sense. I
> > think it's only awkward because how I'm using the memslot to translate
> > the uaddr to a GFN in the current implementation,
>
> Yes.
>
> > > Oof, reading more of the code, this *requires* an effective in-place copy-and-convert
> > > of guest memory.
> >
> > So that's how it's done here, KVM_SNP_LAUNCH_UPDATE copies the pages into
> > gmem, then passes those pages on to firmware for encryption. Then the
> > VMM is expected to mark the GFN range as private via
> > KVM_SET_MEMORY_ATTRIBUTES, since the VMM understands what constitutes
> > initial private/encrypted payload. I should document that better in
> > KVM_SNP_LAUNCH_UPDATE docs however.
>
> That's fine.  As above, my complaints are that the unencrypted source *must* be
> covered by a memslot.  That's beyond ugly.

Yes, if there's one field that has to go it's uaddr, and then you'll
have a non-in-place encrypt (any copy performed by KVM it is hidden).

> > > > +         kvaddr = pfn_to_kaddr(pfns[i]);
> > > > +         if (!virt_addr_valid(kvaddr)) {
> > >
> > > I really, really don't like that this assume guest_memfd is backed by struct page.
> >
> > There are similar enforcements in the SEV/SEV-ES code. There was some
> > initial discussion about relaxing this for SNP so we could support
> > things like /dev/mem-mapped guest memory, but then guest_memfd came
> > along and made that to be an unlikely use-case in the near-term given
> > that it relies on alloc_pages() currently and explicitly guards against
> > mmap()'ing pages in userspace.
> >
> > I think it makes to keep the current tightened restrictions in-place
> > until such a use-case comes along, since otherwise we are relaxing a
> > bunch of currently-useful sanity checks that span all throughout the code

What sanity is being checked for, in other words why are they useful?
If all you get for breaking the promise is a KVM_BUG_ON, for example,
that's par for the course. If instead you get an oops, then we have a
problem.

I may be a bit less draconian than Sean, but the assumptions need to
be documented and explained because they _are_ going to go away.

> > > (b) Why are KVM's memory attributes never consulted?
> >
> > It doesn't really matter if the attributes are set before or after
> > KVM_SNP_LAUNCH_UPDATE, only that by the time the guest actually launches
> > they pages get set to private so they get faulted in from gmem. We could
> > document our expectations and enforce them here if that's preferable
> > however. Maybe requiring KVM_SET_MEMORY_ATTRIBUTES(private) in advance
> > would make it easier to enforce that userspace does the right thing.
> > I'll see how that looks if there are no objections.
>
> Userspace owns whether a page is PRIVATE or SHARED, full stop.  If KVM can't
> honor that, then we need to come up with better uAPI.

Can you explain more verbosely what you mean?

> > > > +                  * When invalid CPUID function entries are detected, the firmware
> > > > +                  * corrects these entries for debugging purpose and leaves the
> > > > +                  * page unencrypted so it can be provided users for debugging
> > > > +                  * and error-reporting.
> > >
> > > Why?  IIUC, this is basically backdooring reads/writes into guest_memfd to avoid
> > > having to add proper mmap() support.
>
> Yes, I am specifically complaining about writing guest memory on failure, which is
> all kinds of weird.

It is weird but I am not sure if you are complaining about firmware
behavior or something else.

Paolo

Sean Christopherson Feb. 7, 2024, 2:43 a.m. UTC | #5

On Wed, Feb 07, 2024, Paolo Bonzini wrote:
> On Fri, Feb 2, 2024 at 11:55 PM Sean Christopherson <seanjc@google.com> wrote:
> > > It doesn't really matter if the attributes are set before or after
> > > KVM_SNP_LAUNCH_UPDATE, only that by the time the guest actually launches
> > > they pages get set to private so they get faulted in from gmem. We could
> > > document our expectations and enforce them here if that's preferable
> > > however. Maybe requiring KVM_SET_MEMORY_ATTRIBUTES(private) in advance
> > > would make it easier to enforce that userspace does the right thing.
> > > I'll see how that looks if there are no objections.
> >
> > Userspace owns whether a page is PRIVATE or SHARED, full stop.  If KVM can't
> > honor that, then we need to come up with better uAPI.
> 
> Can you explain more verbosely what you mean?

As proposed, snp_launch_update_gfn_handler() doesn't verify the state of the
gfns' attributes.  But that's a minor problem and probably not a sticking point.

My overarching complaint is that the code is to be wildly unsafe, or at the very
least brittle.  Without guest_memfd's knowledge, and without holding any locks
beyond kvm->lock, it 

 1) checks if a pfn is shared in the RMP
 2) copies data to the page
 3) converts the page to private in the RMP
 4) does PSP stuff
 5) on failure, converts the page back to shared in RMP
 6) conditionally on failure, writes to the page via a gfn

I'm not at all confident that 1-4 isn't riddled with TOCTOU bugs, and that's
before KVM gains support for intrahost migration, i.e. before KVM allows multiple
VM instances to bind to a single guest_memfd.

But I _think_ we mostly sorted this out at PUCK.  IIRC, the plan is to have guest_memfd
provide (kernel) APIs to allow arch/vendor code to initialize a guest_memfd range.
That will give guest_memfd complete control over the state of a given page, will
allow guest_memfd to take the appropriate locks, and if we're lucky, will be reusable
by other CoCo flavors beyond SNP.

> > > > > +                  * When invalid CPUID function entries are detected, the firmware
> > > > > +                  * corrects these entries for debugging purpose and leaves the
> > > > > +                  * page unencrypted so it can be provided users for debugging
> > > > > +                  * and error-reporting.
> > > >
> > > > Why?  IIUC, this is basically backdooring reads/writes into guest_memfd to avoid
> > > > having to add proper mmap() support.
> >
> > Yes, I am specifically complaining about writing guest memory on failure, which is
> > all kinds of weird.
> 
> It is weird but I am not sure if you are complaining about firmware
> behavior or something else.

This proposed KVM code:

+                               host_rmp_make_shared(pfns[i], PG_LEVEL_4K, true);
+
+                               ret = kvm_write_guest_page(kvm, gfn, kvaddr, 0, PAGE_SIZE);
+                               if (ret)
+                                       pr_err("Failed to write CPUID page back to userspace, ret: 0x%x\n",
+                                              ret);

I have no objection to propagating error/debug information back to userspace,
but it needs to be routed through the source page (or I suppose some dedicated
error page, but that seems like overkill).  Shoving the error information into
guest memory is gross.

But this should naturally go away when the requirement that the source be
covered by the same memslot also goes away.

Paolo Bonzini Feb. 7, 2024, 8:03 a.m. UTC | #6

On Wed, Feb 7, 2024 at 3:43 AM Sean Christopherson <seanjc@google.com> wrote:
> > > Userspace owns whether a page is PRIVATE or SHARED, full stop.  If KVM can't
> > > honor that, then we need to come up with better uAPI.
> >
> > Can you explain more verbosely what you mean?
>
> As proposed, snp_launch_update_gfn_handler() doesn't verify the state of the
> gfns' attributes.  But that's a minor problem and probably not a sticking point.
>
> My overarching complaint is that the code is to be wildly unsafe, or at the very
> least brittle.  Without guest_memfd's knowledge, and without holding any locks
> beyond kvm->lock, it
>
>  1) checks if a pfn is shared in the RMP
>  2) copies data to the page
>  3) converts the page to private in the RMP
>  4) does PSP stuff
>  5) on failure, converts the page back to shared in RMP
>  6) conditionally on failure, writes to the page via a gfn
>
> I'm not at all confident that 1-4 isn't riddled with TOCTOU bugs, and that's
> before KVM gains support for intrahost migration, i.e. before KVM allows multiple
> VM instances to bind to a single guest_memfd.

Absolutely.

> But I _think_ we mostly sorted this out at PUCK.  IIRC, the plan is to have guest_memfd
> provide (kernel) APIs to allow arch/vendor code to initialize a guest_memfd range.
> That will give guest_memfd complete control over the state of a given page, will
> allow guest_memfd to take the appropriate locks, and if we're lucky, will be reusable
> by other CoCo flavors beyond SNP.

Ok, either way it's clear that guest_memfd needs to be in charge.
Whether it's MEMORY_ENCRYPT_OP that calls into guest_memfd or vice
versa, that only matters so much.

> > > Yes, I am specifically complaining about writing guest memory on failure, which is
> > > all kinds of weird.
> >
> > It is weird but I am not sure if you are complaining about firmware
> > behavior or something else.
>
> This proposed KVM code:
>
> +                               host_rmp_make_shared(pfns[i], PG_LEVEL_4K, true);
> +
> +                               ret = kvm_write_guest_page(kvm, gfn, kvaddr, 0, PAGE_SIZE);
> +                               if (ret)
> +                                       pr_err("Failed to write CPUID page back to userspace, ret: 0x%x\n",
> +                                              ret);
>
>
> I have no objection to propagating error/debug information back to userspace,
> but it needs to be routed through the source page (or I suppose some dedicated
> error page, but that seems like overkill).  Shoving the error information into
> guest memory is gross.

Yes, but it should be just a consequence of not actually using
start_gfn. Having to copy back remains weird, but what can you do.

Paolo

Michael Roth Feb. 9, 2024, 1:52 a.m. UTC | #7

On Wed, Feb 07, 2024 at 12:43:02AM +0100, Paolo Bonzini wrote:
> On Fri, Feb 2, 2024 at 11:55 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > +        struct kvm_sev_snp_launch_update {
> > > > > +                __u64 start_gfn;        /* Guest page number to start from. */
> > > > > +                __u64 uaddr;            /* userspace address need to be encrypted */
> > > >
> > > > Huh?  Why is KVM taking a userspace address?  IIUC, the address unconditionally
> > > > gets translated into a gfn, so why not pass a gfn?
> > > >
> > > > And speaking of gfns, AFAICT start_gfn is never used.
> > >
> > > I think having both the uaddr and start_gfn parameters makes sense. I
> > > think it's only awkward because how I'm using the memslot to translate
> > > the uaddr to a GFN in the current implementation,
> >
> > Yes.
> >
> > > > Oof, reading more of the code, this *requires* an effective in-place copy-and-convert
> > > > of guest memory.
> > >
> > > So that's how it's done here, KVM_SNP_LAUNCH_UPDATE copies the pages into
> > > gmem, then passes those pages on to firmware for encryption. Then the
> > > VMM is expected to mark the GFN range as private via
> > > KVM_SET_MEMORY_ATTRIBUTES, since the VMM understands what constitutes
> > > initial private/encrypted payload. I should document that better in
> > > KVM_SNP_LAUNCH_UPDATE docs however.
> >
> > That's fine.  As above, my complaints are that the unencrypted source *must* be
> > covered by a memslot.  That's beyond ugly.
> 
> Yes, if there's one field that has to go it's uaddr, and then you'll
> have a non-in-place encrypt (any copy performed by KVM it is hidden).
> 
> > > > > +         kvaddr = pfn_to_kaddr(pfns[i]);
> > > > > +         if (!virt_addr_valid(kvaddr)) {
> > > >
> > > > I really, really don't like that this assume guest_memfd is backed by struct page.
> > >
> > > There are similar enforcements in the SEV/SEV-ES code. There was some
> > > initial discussion about relaxing this for SNP so we could support
> > > things like /dev/mem-mapped guest memory, but then guest_memfd came
> > > along and made that to be an unlikely use-case in the near-term given
> > > that it relies on alloc_pages() currently and explicitly guards against
> > > mmap()'ing pages in userspace.
> > >
> > > I think it makes to keep the current tightened restrictions in-place
> > > until such a use-case comes along, since otherwise we are relaxing a
> > > bunch of currently-useful sanity checks that span all throughout the code
> 
> What sanity is being checked for, in other words why are they useful?
> If all you get for breaking the promise is a KVM_BUG_ON, for example,
> that's par for the course. If instead you get an oops, then we have a
> problem.
> 
> I may be a bit less draconian than Sean, but the assumptions need to
> be documented and explained because they _are_ going to go away.

Maybe in this case sanity-check isn't the right word, but for instance
the occurance Sean objected to:

  kvaddr = pfn_to_kaddr(pfns[i]);
  if (!virt_addr_valid(kvaddr)) {
    ...
    ret = -EINVAL;

where there are pfn_valid() checks underneath the covers that provide
some assurance this is normal struct-page-backed/kernel-tracked memory
that has a mapping in the directmap we can use here. Dropping that
assumption means we need to create temporary mappings to access the PFN,
which complicates the code for a potential use-case that doesn't yet
exist. But if the maintainers are telling me this will change then I
have no objection to making those changes :) That was just my thinking
at the time.

And yes, if we move more of this sort of functionality closer to gmem
then those assumptions become reality and we can keep the code more
closely in sync will how memory is actually allocated.

I'll rework this to something closer to what Sean mentioned during the
PUCK call: a gmem interface that can be called to handle populating
initial gmem pages, and drop remaining any assumptions about
struct-backed/directmapped in PFNs in the code that remains afterward.

I'm hoping if we do move to a unified KVM API that a similar approach
will work in that case too. It may be a bit tricky with how TDX does
a lot of this through KVM MMU / SecureEPT hooks, this may complicate
locking expectations and not necessarily fit nicely into the same flow
as SNP, but we'll see how it goes.

-Mike

Sean Christopherson Feb. 9, 2024, 2:34 p.m. UTC | #8

On Thu, Feb 08, 2024, Michael Roth wrote:
> On Wed, Feb 07, 2024 at 12:43:02AM +0100, Paolo Bonzini wrote:
> > On Fri, Feb 2, 2024 at 11:55 PM Sean Christopherson <seanjc@google.com> wrote:
> > What sanity is being checked for, in other words why are they useful?
> > If all you get for breaking the promise is a KVM_BUG_ON, for example,
> > that's par for the course. If instead you get an oops, then we have a
> > problem.
> > 
> > I may be a bit less draconian than Sean, but the assumptions need to
> > be documented and explained because they _are_ going to go away.
> 
> Maybe in this case sanity-check isn't the right word, but for instance
> the occurance Sean objected to:
> 
>   kvaddr = pfn_to_kaddr(pfns[i]);
>   if (!virt_addr_valid(kvaddr)) {
>     ...
>     ret = -EINVAL;
> 
> where there are pfn_valid() checks underneath the covers that provide
> some assurance this is normal struct-page-backed/kernel-tracked memory
> that has a mapping in the directmap we can use here. Dropping that
> assumption means we need to create temporary mappings to access the PFN,

No, you don't.  kvm_vcpu_map() does all of the lifting for you, with the small
caveat that it obviously needs a vCPU.  But that's trivial to solve with a minor
refactoring, *if* we need to solve that problem (it's not clear to me whether or
not the APIs for copying data into guest_memfd will be VM-scoped or vCPU-scoped).

Peter Gonda March 18, 2024, 9:02 p.m. UTC | #9

On Sat, Dec 30, 2023 at 10:27 AM Michael Roth <michael.roth@amd.com> wrote:
>
> From: Brijesh Singh <brijesh.singh@amd.com>
>
> The KVM_SEV_SNP_LAUNCH_UPDATE command can be used to insert data into
> the guest's memory. The data is encrypted with the cryptographic context
> created with the KVM_SEV_SNP_LAUNCH_START.
>
> In addition to the inserting data, it can insert a two special pages
> into the guests memory: the secrets page and the CPUID page.
>
> While terminating the guest, reclaim the guest pages added in the RMP
> table. If the reclaim fails, then the page is no longer safe to be
> released back to the system and leak them.
>
> For more information see the SEV-SNP specification.
>
> Co-developed-by: Michael Roth <michael.roth@amd.com>
> Signed-off-by: Michael Roth <michael.roth@amd.com>
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
> ---
>  .../virt/kvm/x86/amd-memory-encryption.rst    |  28 +++
>  arch/x86/kvm/svm/sev.c                        | 181 ++++++++++++++++++
>  include/uapi/linux/kvm.h                      |  19 ++
>  3 files changed, 228 insertions(+)
>
> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> index b1beb2fe8766..d4325b26724c 100644
> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> @@ -485,6 +485,34 @@ Returns: 0 on success, -negative on error
>
>  See the SEV-SNP specification for further detail on the launch input.
>
> +20. KVM_SNP_LAUNCH_UPDATE
> +-------------------------
> +
> +The KVM_SNP_LAUNCH_UPDATE is used for encrypting a memory region. It also
> +calculates a measurement of the memory contents. The measurement is a signature
> +of the memory contents that can be sent to the guest owner as an attestation
> +that the memory was encrypted correctly by the firmware.

Nit: The measurement is a rolling hash of all the launch updated pages
and their metadata. The attestation quote contains a signature of
information about the SNP VM including this measurement.

Also technically the attestation doesn't confirm to the guest owner
the memory was encrypted correctly, I don't think we can
cryptographically prove that. But the attestation does provide the
guest owner confirmation about the exact steps the ASP took in
creating the SNP VMs initial memory context. If the ASP firmware is
bug free and follows the spec, your 'memory was encrypted correctly by
the firmware' line is implied.

> +
> +Parameters (in): struct  kvm_snp_launch_update
> +
> +Returns: 0 on success, -negative on error
> +
> +::
> +
> +        struct kvm_sev_snp_launch_update {
> +                __u64 start_gfn;        /* Guest page number to start from. */
> +                __u64 uaddr;            /* userspace address need to be encrypted */
> +                __u32 len;              /* length of memory region */
> +                __u8 imi_page;          /* 1 if memory is part of the IMI */
> +                __u8 page_type;         /* page type */
> +                __u8 vmpl3_perms;       /* VMPL3 permission mask */
> +                __u8 vmpl2_perms;       /* VMPL2 permission mask */
> +                __u8 vmpl1_perms;       /* VMPL1 permission mask */
> +        };
> +
> +See the SEV-SNP spec for further details on how to build the VMPL permission
> +mask and page type.
> +
>  References
>  ==========
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index e2f4d4bc125c..d60209e6e68b 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -245,6 +245,36 @@ static void sev_decommission(unsigned int handle)
>         sev_guest_decommission(&decommission, NULL);
>  }
>
> +static int snp_page_reclaim(u64 pfn)
> +{
> +       struct sev_data_snp_page_reclaim data = {0};
> +       int err, rc;
> +
> +       data.paddr = __sme_set(pfn << PAGE_SHIFT);
> +       rc = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
> +       if (rc) {
> +               /*
> +                * If the reclaim failed, then page is no longer safe
> +                * to use.
> +                */
> +               snp_leak_pages(pfn, 1);
> +       }
> +
> +       return rc;
> +}
> +
> +static int host_rmp_make_shared(u64 pfn, enum pg_level level, bool leak)
> +{
> +       int rc;
> +
> +       rc = rmp_make_shared(pfn, level);
> +       if (rc && leak)
> +               snp_leak_pages(pfn,
> +                              page_level_size(level) >> PAGE_SHIFT);
> +
> +       return rc;
> +}
> +
>  static void sev_unbind_asid(struct kvm *kvm, unsigned int handle)
>  {
>         struct sev_data_deactivate deactivate;
> @@ -1990,6 +2020,154 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
>         return rc;
>  }
>
> +static int snp_launch_update_gfn_handler(struct kvm *kvm,
> +                                        struct kvm_gfn_range *range,
> +                                        void *opaque)
> +{
> +       struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +       struct kvm_memory_slot *memslot = range->slot;
> +       struct sev_data_snp_launch_update data = {0};
> +       struct kvm_sev_snp_launch_update params;
> +       struct kvm_sev_cmd *argp = opaque;
> +       int *error = &argp->error;
> +       int i, n = 0, ret = 0;
> +       unsigned long npages;
> +       kvm_pfn_t *pfns;
> +       gfn_t gfn;
> +
> +       if (!kvm_slot_can_be_private(memslot)) {
> +               pr_err("SEV-SNP requires private memory support via guest_memfd.\n");
> +               return -EINVAL;
> +       }
> +
> +       if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params))) {
> +               pr_err("Failed to copy user parameters for SEV-SNP launch.\n");
> +               return -EFAULT;
> +       }
> +
> +       data.gctx_paddr = __psp_pa(sev->snp_context);
> +
> +       npages = range->end - range->start;
> +       pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL_ACCOUNT);
> +       if (!pfns)
> +               return -ENOMEM;
> +
> +       pr_debug("%s: GFN range 0x%llx-0x%llx, type %d\n", __func__,
> +                range->start, range->end, params.page_type);
> +
> +       for (gfn = range->start, i = 0; gfn < range->end; gfn++, i++) {
> +               int order, level;
> +               bool assigned;
> +               void *kvaddr;
> +
> +               ret = __kvm_gmem_get_pfn(kvm, memslot, gfn, &pfns[i], &order, false);
> +               if (ret)
> +                       goto e_release;
> +
> +               n++;
> +               ret = snp_lookup_rmpentry((u64)pfns[i], &assigned, &level);
> +               if (ret || assigned) {
> +                       pr_err("Failed to ensure GFN 0x%llx is in initial shared state, ret: %d, assigned: %d\n",
> +                              gfn, ret, assigned);
> +                       return -EFAULT;
> +               }
> +
> +               kvaddr = pfn_to_kaddr(pfns[i]);
> +               if (!virt_addr_valid(kvaddr)) {
> +                       pr_err("Invalid HVA 0x%llx for GFN 0x%llx\n", (uint64_t)kvaddr, gfn);
> +                       ret = -EINVAL;
> +                       goto e_release;
> +               }
> +
> +               ret = kvm_read_guest_page(kvm, gfn, kvaddr, 0, PAGE_SIZE);
> +               if (ret) {
> +                       pr_err("Guest read failed, ret: 0x%x\n", ret);

Should these be pr_debugs()?  This could get noisy.

> +                       goto e_release;
> +               }
> +
> +               ret = rmp_make_private(pfns[i], gfn << PAGE_SHIFT, PG_LEVEL_4K,
> +                                      sev_get_asid(kvm), true);
> +               if (ret) {
> +                       ret = -EFAULT;
> +                       goto e_release;
> +               }
> +
> +               data.address = __sme_set(pfns[i] << PAGE_SHIFT);
> +               data.page_size = PG_LEVEL_TO_RMP(PG_LEVEL_4K);
> +               data.page_type = params.page_type;
> +               data.vmpl3_perms = params.vmpl3_perms;
> +               data.vmpl2_perms = params.vmpl2_perms;
> +               data.vmpl1_perms = params.vmpl1_perms;
> +               ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE,
> +                                     &data, error);
> +               if (ret) {
> +                       pr_err("SEV-SNP launch update failed, ret: 0x%x, fw_error: 0x%x\n",
> +                              ret, *error);
> +                       snp_page_reclaim(pfns[i]);
> +
> +                       /*
> +                        * When invalid CPUID function entries are detected, the firmware
> +                        * corrects these entries for debugging purpose and leaves the
> +                        * page unencrypted so it can be provided users for debugging
> +                        * and error-reporting.
> +                        *
> +                        * Copy the corrected CPUID page back to shared memory so
> +                        * userpsace can retrieve this information.

Typo: userpsace

[v11,18/35] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command

Commit Message

Comments

Patch