diff mbox series

[Part2,v5,39/45] KVM: SVM: Introduce ops for the post gfn map and unmap

Message ID 20210820155918.7518-40-brijesh.singh@amd.com (mailing list archive)
State Not Applicable
Delegated to: Herbert Xu
Headers show
Series Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand

Commit Message

Brijesh Singh Aug. 20, 2021, 3:59 p.m. UTC
When SEV-SNP is enabled in the guest VM, the guest memory pages can
either be a private or shared. A write from the hypervisor goes through
the RMP checks. If hardware sees that hypervisor is attempting to write
to a guest private page, then it triggers an RMP violation #PF.

To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
used to verify that its safe to map a given guest page. Use the SRCU to
protect against the page state change for existing mapped pages.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  2 +
 arch/x86/include/asm/kvm_host.h    |  4 ++
 arch/x86/kvm/svm/sev.c             | 69 +++++++++++++++++++++-----
 arch/x86/kvm/svm/svm.c             |  4 ++
 arch/x86/kvm/svm/svm.h             |  8 +++
 arch/x86/kvm/x86.c                 | 78 +++++++++++++++++++++++++++---
 6 files changed, 146 insertions(+), 19 deletions(-)

Comments

Sean Christopherson Oct. 13, 2021, 12:23 a.m. UTC | #1
On Fri, Aug 20, 2021, Brijesh Singh wrote:
> When SEV-SNP is enabled in the guest VM, the guest memory pages can
> either be a private or shared. A write from the hypervisor goes through
> the RMP checks. If hardware sees that hypervisor is attempting to write
> to a guest private page, then it triggers an RMP violation #PF.
> 
> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
> used to verify that its safe to map a given guest page. Use the SRCU to
> protect against the page state change for existing mapped pages.

SRCU isn't protecting anything.  The synchronize_srcu_expedited() in the PSC code
forces it to wait for existing maps to go away, but it doesn't prevent new maps
from being created while the actual RMP updates are in-flight.  Most telling is
that the RMP updates happen _after_ the synchronize_srcu_expedited() call.

This also doesn't handle kvm_{read,write}_guest_cached().

I can't help but note that the private memslots idea[*] would handle this gracefully,
e.g. the memslot lookup would fail, and any change in private memslots would
invalidate the cache due to a generation mismatch.

KSM is another mess that would Just Work.

I'm not saying that SNP should be blocked on support for unmapping guest private
memory, but I do think we should strongly consider focusing on that effort rather
than trying to fix things piecemeal throughout KVM.  I don't think it's too absurd
to say that it might actually be faster overall.  And I 100% think that having a
cohesive design and uABI for SNP and TDX would be hugely beneficial to KVM.

[*] https://lkml.kernel.org/r/20210824005248.200037-1-seanjc@google.com

> +int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *token)
> +{
> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +	int level;
> +
> +	if (!sev_snp_guest(kvm))
> +		return 0;
> +
> +	*token = srcu_read_lock(&sev->psc_srcu);
> +
> +	/* If pfn is not added as private then fail */

This comment and the pr_err() are backwards, and confused the heck out of me.
snp_lookup_rmpentry() returns '1' if the pfn is assigned, a.k.a. private.  That
means this code throws an error if the page is private, i.e. requires the page
to be shared.  Which makes sense given the use cases, it's just incredibly
confusing.

> +	if (snp_lookup_rmpentry(pfn, &level) == 1) {

Any reason not to provide e.g. rmp_is_shared() and rmp_is_private() so that
callers don't have to care as much about the return values?  The -errno/0/1
semantics are all but guarantee to bite us in the rear at some point.

Actually, peeking at other patches, I think it already has.  This usage in
__unregister_enc_region_locked() is wrong:

	/*
	 * The guest memory pages are assigned in the RMP table. Unassign it
	 * before releasing the memory.
	 */
	if (sev_snp_guest(kvm)) {
		for (i = 0; i < region->npages; i++) {
			pfn = page_to_pfn(region->pages[i]);

			if (!snp_lookup_rmpentry(pfn, &level))  <-- attempts make_shared() on non-existent entry
				continue;

			cond_resched();

			if (level > PG_LEVEL_4K)
				pfn &= ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);

			host_rmp_make_shared(pfn, level, true);
		}
	}


> +		srcu_read_unlock(&sev->psc_srcu, *token);
> +		pr_err_ratelimited("failed to map private gfn 0x%llx pfn 0x%llx\n", gfn, pfn);
> +		return -EBUSY;
> +	}
> +
> +	return 0;
> +}
>  static struct kvm_x86_init_ops svm_init_ops __initdata = {
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index d10f7166b39d..ff91184f9b4a 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -76,16 +76,22 @@ struct kvm_sev_info {
>  	bool active;		/* SEV enabled guest */
>  	bool es_active;		/* SEV-ES enabled guest */
>  	bool snp_active;	/* SEV-SNP enabled guest */
> +
>  	unsigned int asid;	/* ASID used for this guest */
>  	unsigned int handle;	/* SEV firmware handle */
>  	int fd;			/* SEV device fd */
> +
>  	unsigned long pages_locked; /* Number of pages locked */
>  	struct list_head regions_list;  /* List of registered regions */
> +
>  	u64 ap_jump_table;	/* SEV-ES AP Jump Table address */
> +
>  	struct kvm *enc_context_owner; /* Owner of copied encryption context */
>  	struct misc_cg *misc_cg; /* For misc cgroup accounting */
> +

Unrelated whitespace changes.

>  	u64 snp_init_flags;
>  	void *snp_context;      /* SNP guest context page */
> +	struct srcu_struct psc_srcu;
>  };
Brijesh Singh Oct. 13, 2021, 6:10 p.m. UTC | #2
On 10/12/21 5:23 PM, Sean Christopherson wrote:
> On Fri, Aug 20, 2021, Brijesh Singh wrote:
>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
>> either be a private or shared. A write from the hypervisor goes through
>> the RMP checks. If hardware sees that hypervisor is attempting to write
>> to a guest private page, then it triggers an RMP violation #PF.
>>
>> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
>> used to verify that its safe to map a given guest page. Use the SRCU to
>> protect against the page state change for existing mapped pages.
> SRCU isn't protecting anything.  The synchronize_srcu_expedited() in the PSC code
> forces it to wait for existing maps to go away, but it doesn't prevent new maps
> from being created while the actual RMP updates are in-flight.  Most telling is
> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
>
> This also doesn't handle kvm_{read,write}_guest_cached().

Hmm, I thought the kvm_{read_write}_guest_cached() uses the
copy_{to,from}_user(). Writing to the private will cause a #PF and will
fail the copy_to_user(). Am I missing something?


>
> I can't help but note that the private memslots idea[*] would handle this gracefully,
> e.g. the memslot lookup would fail, and any change in private memslots would
> invalidate the cache due to a generation mismatch.
>
> KSM is another mess that would Just Work.
>
> I'm not saying that SNP should be blocked on support for unmapping guest private
> memory, but I do think we should strongly consider focusing on that effort rather
> than trying to fix things piecemeal throughout KVM.  I don't think it's too absurd
> to say that it might actually be faster overall.  And I 100% think that having a
> cohesive design and uABI for SNP and TDX would be hugely beneficial to KVM.
>
> [*] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20210824005248.200037-1-seanjc%40google.com&amp;data=04%7C01%7Cbrijesh.singh%40amd.com%7Cd1717d511a1f473cedc408d98ddfb027%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637696814148744591%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=3LF77%2BcqmpUdiP6YAk7LpImisBzjRGUzdI3raqjJxl0%3D&amp;reserved=0
>
>> +int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *token)
>> +{
>> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> +	int level;
>> +
>> +	if (!sev_snp_guest(kvm))
>> +		return 0;
>> +
>> +	*token = srcu_read_lock(&sev->psc_srcu);
>> +
>> +	/* If pfn is not added as private then fail */
> This comment and the pr_err() are backwards, and confused the heck out of me.
> snp_lookup_rmpentry() returns '1' if the pfn is assigned, a.k.a. private.  That
> means this code throws an error if the page is private, i.e. requires the page
> to be shared.  Which makes sense given the use cases, it's just incredibly
> confusing.
Actually I followed your recommendation from the previous feedback that
snp_lookup_rmpentry() should return 1 for the assigned page, 0 for the
shared and -negative for invalid. I can clarify it here  again.
>
>> +	if (snp_lookup_rmpentry(pfn, &level) == 1) {
> Any reason not to provide e.g. rmp_is_shared() and rmp_is_private() so that
> callers don't have to care as much about the return values?  The -errno/0/1
> semantics are all but guarantee to bite us in the rear at some point.

If we look at the previous series, I had a macro rmp_is_assigned() for
exactly the same purpose but the feedback was to drop those macros and
return the state indirectly through the snp_lookup_rmpentry(). I can
certainly add a helper rmp_is_{shared,private}() if it makes code more
readable.


> Actually, peeking at other patches, I think it already has.  This usage in
> __unregister_enc_region_locked() is wrong:
>
> 	/*
> 	 * The guest memory pages are assigned in the RMP table. Unassign it
> 	 * before releasing the memory.
> 	 */
> 	if (sev_snp_guest(kvm)) {
> 		for (i = 0; i < region->npages; i++) {
> 			pfn = page_to_pfn(region->pages[i]);
>
> 			if (!snp_lookup_rmpentry(pfn, &level))  <-- attempts make_shared() on non-existent entry
> 				continue;
>
> 			cond_resched();
>
> 			if (level > PG_LEVEL_4K)
> 				pfn &= ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
>
> 			host_rmp_make_shared(pfn, level, true);
> 		}
> 	}
>
>
>> +		srcu_read_unlock(&sev->psc_srcu, *token);
>> +		pr_err_ratelimited("failed to map private gfn 0x%llx pfn 0x%llx\n", gfn, pfn);
>> +		return -EBUSY;
>> +	}
>> +
>> +	return 0;
>> +}
>>  static struct kvm_x86_init_ops svm_init_ops __initdata = {
>> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
>> index d10f7166b39d..ff91184f9b4a 100644
>> --- a/arch/x86/kvm/svm/svm.h
>> +++ b/arch/x86/kvm/svm/svm.h
>> @@ -76,16 +76,22 @@ struct kvm_sev_info {
>>  	bool active;		/* SEV enabled guest */
>>  	bool es_active;		/* SEV-ES enabled guest */
>>  	bool snp_active;	/* SEV-SNP enabled guest */
>> +
>>  	unsigned int asid;	/* ASID used for this guest */
>>  	unsigned int handle;	/* SEV firmware handle */
>>  	int fd;			/* SEV device fd */
>> +
>>  	unsigned long pages_locked; /* Number of pages locked */
>>  	struct list_head regions_list;  /* List of registered regions */
>> +
>>  	u64 ap_jump_table;	/* SEV-ES AP Jump Table address */
>> +
>>  	struct kvm *enc_context_owner; /* Owner of copied encryption context */
>>  	struct misc_cg *misc_cg; /* For misc cgroup accounting */
>> +
> Unrelated whitespace changes.
>
>>  	u64 snp_init_flags;
>>  	void *snp_context;      /* SNP guest context page */
>> +	struct srcu_struct psc_srcu;
>>  };
Sean Christopherson Oct. 13, 2021, 8:10 p.m. UTC | #3
On Wed, Oct 13, 2021, Brijesh Singh wrote:
> 
> On 10/12/21 5:23 PM, Sean Christopherson wrote:
> > On Fri, Aug 20, 2021, Brijesh Singh wrote:
> >> When SEV-SNP is enabled in the guest VM, the guest memory pages can
> >> either be a private or shared. A write from the hypervisor goes through
> >> the RMP checks. If hardware sees that hypervisor is attempting to write
> >> to a guest private page, then it triggers an RMP violation #PF.
> >>
> >> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
> >> used to verify that its safe to map a given guest page. Use the SRCU to
> >> protect against the page state change for existing mapped pages.
> > SRCU isn't protecting anything.  The synchronize_srcu_expedited() in the PSC code
> > forces it to wait for existing maps to go away, but it doesn't prevent new maps
> > from being created while the actual RMP updates are in-flight.  Most telling is
> > that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
> >
> > This also doesn't handle kvm_{read,write}_guest_cached().
> 
> Hmm, I thought the kvm_{read_write}_guest_cached() uses the
> copy_{to,from}_user(). Writing to the private will cause a #PF and will
> fail the copy_to_user(). Am I missing something?

Doh, right you are.  I was thinking they cached the kmap, but it's just the
gpa->hva that gets cached.

> > I can't help but note that the private memslots idea[*] would handle this gracefully,
> > e.g. the memslot lookup would fail, and any change in private memslots would
> > invalidate the cache due to a generation mismatch.
> >
> > KSM is another mess that would Just Work.
> >
> > I'm not saying that SNP should be blocked on support for unmapping guest private
> > memory, but I do think we should strongly consider focusing on that effort rather
> > than trying to fix things piecemeal throughout KVM.  I don't think it's too absurd
> > to say that it might actually be faster overall.  And I 100% think that having a
> > cohesive design and uABI for SNP and TDX would be hugely beneficial to KVM.
> >
> > [*] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20210824005248.200037-1-seanjc%40google.com&amp;data=04%7C01%7Cbrijesh.singh%40amd.com%7Cd1717d511a1f473cedc408d98ddfb027%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637696814148744591%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=3LF77%2BcqmpUdiP6YAk7LpImisBzjRGUzdI3raqjJxl0%3D&amp;reserved=0
> >
> >> +int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *token)
> >> +{
> >> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> >> +	int level;
> >> +
> >> +	if (!sev_snp_guest(kvm))
> >> +		return 0;
> >> +
> >> +	*token = srcu_read_lock(&sev->psc_srcu);
> >> +
> >> +	/* If pfn is not added as private then fail */
> > This comment and the pr_err() are backwards, and confused the heck out of me.
> > snp_lookup_rmpentry() returns '1' if the pfn is assigned, a.k.a. private.  That
> > means this code throws an error if the page is private, i.e. requires the page
> > to be shared.  Which makes sense given the use cases, it's just incredibly
> > confusing.
> Actually I followed your recommendation from the previous feedback that
> snp_lookup_rmpentry() should return 1 for the assigned page, 0 for the
> shared and -negative for invalid. I can clarify it here  again.
>
> >> +	if (snp_lookup_rmpentry(pfn, &level) == 1) {
> > Any reason not to provide e.g. rmp_is_shared() and rmp_is_private() so that
> > callers don't have to care as much about the return values?  The -errno/0/1
> > semantics are all but guarantee to bite us in the rear at some point.
> 
> If we look at the previous series, I had a macro rmp_is_assigned() for
> exactly the same purpose but the feedback was to drop those macros and
> return the state indirectly through the snp_lookup_rmpentry(). I can
> certainly add a helper rmp_is_{shared,private}() if it makes code more
> readable.

Ah rats.  Bad communication on my side.  I didn't intended to have non-RMP code
directly consume snp_lookup_rmpentry() for simple checks.  The goal behind the
helper was to bury "struct rmpentry" so that it wasn't visible to the kernel at
large.  I.e. my objection was that rmp_assigned() was looking directly at a
non-architectural struct.

My full thought for snp_lookup_rmpentry() was that it _could_ be consumed directly
without exposing "struct rmpentry", but that simple, common use cases would provide
wrappers, e.g.

static inline snp_is_rmp_private(...)
{
	return snp_lookup_rmpentry(...) == 1;
}

static inline snp_is_rmp_shared(...)
{
	return snp_lookup_rmpentry(...) < 1;
}


Side topic, what do you think about s/assigned/private for the "public" APIs, as
suggested/implied above?  I actually like the terminology when talking specifically
about the RMP, but it doesn't fit the abstractions that tend to be used when talking
about these things in other contexts, e.g. in KVM.
Sean Christopherson Oct. 13, 2021, 8:16 p.m. UTC | #4
On Wed, Oct 13, 2021, Sean Christopherson wrote:
> On Fri, Aug 20, 2021, Brijesh Singh wrote:
> > When SEV-SNP is enabled in the guest VM, the guest memory pages can
> > either be a private or shared. A write from the hypervisor goes through
> > the RMP checks. If hardware sees that hypervisor is attempting to write
> > to a guest private page, then it triggers an RMP violation #PF.
> > 
> > To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
> > used to verify that its safe to map a given guest page. Use the SRCU to
> > protect against the page state change for existing mapped pages.
> 
> SRCU isn't protecting anything.  The synchronize_srcu_expedited() in the PSC code
> forces it to wait for existing maps to go away, but it doesn't prevent new maps
> from being created while the actual RMP updates are in-flight.  Most telling is
> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.

Argh, another goof on my part.  Rereading prior feedback, I see that I loosely
suggested SRCU as a possible solution.  That was a bad, bad suggestion.  I think
(hope) I made it offhand without really thinking it through.  SRCU can't work in
this case, because the whole premise of Read-Copy-Update is that there can be
multiple copies of the data.  That simply can't be true for the RMP as hardware
operates on a single table.

In the future, please don't hesitate to push back on and/or question suggestions,
especially those that are made without concrete examples, i.e. are likely off the
cuff.  My goal isn't to set you up for failure :-/
Brijesh Singh Oct. 13, 2021, 9:49 p.m. UTC | #5
On 10/13/21 1:10 PM, Sean Christopherson wrote:
> On Wed, Oct 13, 2021, Brijesh Singh wrote:
>> On 10/12/21 5:23 PM, Sean Christopherson wrote:
>>> On Fri, Aug 20, 2021, Brijesh Singh wrote:
>>>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
>>>> either be a private or shared. A write from the hypervisor goes through
>>>> the RMP checks. If hardware sees that hypervisor is attempting to write
>>>> to a guest private page, then it triggers an RMP violation #PF.
>>>>
>>>> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
>>>> used to verify that its safe to map a given guest page. Use the SRCU to
>>>> protect against the page state change for existing mapped pages.
>>> SRCU isn't protecting anything.  The synchronize_srcu_expedited() in the PSC code
>>> forces it to wait for existing maps to go away, but it doesn't prevent new maps
>>> from being created while the actual RMP updates are in-flight.  Most telling is
>>> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
>>>
>>> This also doesn't handle kvm_{read,write}_guest_cached().
>> Hmm, I thought the kvm_{read_write}_guest_cached() uses the
>> copy_{to,from}_user(). Writing to the private will cause a #PF and will
>> fail the copy_to_user(). Am I missing something?
> Doh, right you are.  I was thinking they cached the kmap, but it's just the
> gpa->hva that gets cached.
>
>>> I can't help but note that the private memslots idea[*] would handle this gracefully,
>>> e.g. the memslot lookup would fail, and any change in private memslots would
>>> invalidate the cache due to a generation mismatch.
>>>
>>> KSM is another mess that would Just Work.
>>>
>>> I'm not saying that SNP should be blocked on support for unmapping guest private
>>> memory, but I do think we should strongly consider focusing on that effort rather
>>> than trying to fix things piecemeal throughout KVM.  I don't think it's too absurd
>>> to say that it might actually be faster overall.  And I 100% think that having a
>>> cohesive design and uABI for SNP and TDX would be hugely beneficial to KVM.
>>>
>>> [*] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20210824005248.200037-1-seanjc%40google.com&amp;data=04%7C01%7Cbrijesh.singh%40amd.com%7C0f1a3f5f63074b60d21b08d98e857daf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637697526304105177%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=w6xS3DcG4fcTweC5i4%2BuB4jhn3Xcj2a44BkoATVcSgQ%3D&amp;reserved=0
>>>
>>>> +int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *token)
>>>> +{
>>>> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>> +	int level;
>>>> +
>>>> +	if (!sev_snp_guest(kvm))
>>>> +		return 0;
>>>> +
>>>> +	*token = srcu_read_lock(&sev->psc_srcu);
>>>> +
>>>> +	/* If pfn is not added as private then fail */
>>> This comment and the pr_err() are backwards, and confused the heck out of me.
>>> snp_lookup_rmpentry() returns '1' if the pfn is assigned, a.k.a. private.  That
>>> means this code throws an error if the page is private, i.e. requires the page
>>> to be shared.  Which makes sense given the use cases, it's just incredibly
>>> confusing.
>> Actually I followed your recommendation from the previous feedback that
>> snp_lookup_rmpentry() should return 1 for the assigned page, 0 for the
>> shared and -negative for invalid. I can clarify it here  again.
>>
>>>> +	if (snp_lookup_rmpentry(pfn, &level) == 1) {
>>> Any reason not to provide e.g. rmp_is_shared() and rmp_is_private() so that
>>> callers don't have to care as much about the return values?  The -errno/0/1
>>> semantics are all but guarantee to bite us in the rear at some point.
>> If we look at the previous series, I had a macro rmp_is_assigned() for
>> exactly the same purpose but the feedback was to drop those macros and
>> return the state indirectly through the snp_lookup_rmpentry(). I can
>> certainly add a helper rmp_is_{shared,private}() if it makes code more
>> readable.
> Ah rats.  Bad communication on my side.  I didn't intended to have non-RMP code
> directly consume snp_lookup_rmpentry() for simple checks.  The goal behind the
> helper was to bury "struct rmpentry" so that it wasn't visible to the kernel at
> large.  I.e. my objection was that rmp_assigned() was looking directly at a
> non-architectural struct.
>
> My full thought for snp_lookup_rmpentry() was that it _could_ be consumed directly
> without exposing "struct rmpentry", but that simple, common use cases would provide
> wrappers, e.g.
>
> static inline snp_is_rmp_private(...)
> {
> 	return snp_lookup_rmpentry(...) == 1;
> }
>
> static inline snp_is_rmp_shared(...)
> {
> 	return snp_lookup_rmpentry(...) < 1;
> }

Yep, that what I was going to do for the helper.


>
> Side topic, what do you think about s/assigned/private for the "public" APIs, as
> suggested/implied above?  I actually like the terminology when talking specifically
> about the RMP, but it doesn't fit the abstractions that tend to be used when talking
> about these things in other contexts, e.g. in KVM.

I can float the idea to see if docs folks is okay with the changes but
generally speaking we all have been referring the assigned == private in
the Linux SNP support patch.

thanks
Sean Christopherson Oct. 13, 2021, 10:10 p.m. UTC | #6
On Wed, Oct 13, 2021, Brijesh Singh wrote:
> 
> On 10/13/21 1:10 PM, Sean Christopherson wrote:
> > Side topic, what do you think about s/assigned/private for the "public" APIs, as
> > suggested/implied above?  I actually like the terminology when talking specifically
> > about the RMP, but it doesn't fit the abstractions that tend to be used when talking
> > about these things in other contexts, e.g. in KVM.
> 
> I can float the idea to see if docs folks is okay with the changes but
> generally speaking we all have been referring the assigned == private in
> the Linux SNP support patch.

Oh, I wasn't suggesting anything remotely close to an "official" change, it was
purely a suggestion for the host-side patches.  It might even be unique to this
one helper.
Brijesh Singh Oct. 13, 2021, 10:31 p.m. UTC | #7
On 10/13/21 3:10 PM, Sean Christopherson wrote:
> On Wed, Oct 13, 2021, Brijesh Singh wrote:
>> On 10/13/21 1:10 PM, Sean Christopherson wrote:
>>> Side topic, what do you think about s/assigned/private for the "public" APIs, as
>>> suggested/implied above?  I actually like the terminology when talking specifically
>>> about the RMP, but it doesn't fit the abstractions that tend to be used when talking
>>> about these things in other contexts, e.g. in KVM.
>> I can float the idea to see if docs folks is okay with the changes but
>> generally speaking we all have been referring the assigned == private in
>> the Linux SNP support patch.
> Oh, I wasn't suggesting anything remotely close to an "official" change, it was
> purely a suggestion for the host-side patches.  It might even be unique to this
> one helper.

Ah, sure I am good with calling a private in the helper.

-Brijesh
Brijesh Singh Oct. 15, 2021, 4:31 p.m. UTC | #8
On 10/13/21 1:16 PM, Sean Christopherson wrote:
> On Wed, Oct 13, 2021, Sean Christopherson wrote:
>> On Fri, Aug 20, 2021, Brijesh Singh wrote:
>>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
>>> either be a private or shared. A write from the hypervisor goes through
>>> the RMP checks. If hardware sees that hypervisor is attempting to write
>>> to a guest private page, then it triggers an RMP violation #PF.
>>>
>>> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
>>> used to verify that its safe to map a given guest page. Use the SRCU to
>>> protect against the page state change for existing mapped pages.
>> SRCU isn't protecting anything.  The synchronize_srcu_expedited() in the PSC code
>> forces it to wait for existing maps to go away, but it doesn't prevent new maps
>> from being created while the actual RMP updates are in-flight.  Most telling is
>> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
> Argh, another goof on my part.  Rereading prior feedback, I see that I loosely
> suggested SRCU as a possible solution.  That was a bad, bad suggestion.  I think
> (hope) I made it offhand without really thinking it through.  SRCU can't work in
> this case, because the whole premise of Read-Copy-Update is that there can be
> multiple copies of the data.  That simply can't be true for the RMP as hardware
> operates on a single table.
>
> In the future, please don't hesitate to push back on and/or question suggestions,
> especially those that are made without concrete examples, i.e. are likely off the
> cuff.  My goal isn't to set you up for failure :-/

What do you think about going back to my initial proposal of per-gfn
tracking [1] ? We can limit the changes to just for the kvm_vcpu_map()
and let the copy_to_user() take a fault and return an error (if it
attempt to write to guest private). If PSC happen while lock is held
then simplify return and let the guest retry PSC.

[1]
https://lore.kernel.org/lkml/20210707183616.5620-36-brijesh.singh@amd.com/
Sean Christopherson Oct. 15, 2021, 5:16 p.m. UTC | #9
On Fri, Oct 15, 2021, Brijesh Singh wrote:
> 
> On 10/13/21 1:16 PM, Sean Christopherson wrote:
> > On Wed, Oct 13, 2021, Sean Christopherson wrote:
> >> On Fri, Aug 20, 2021, Brijesh Singh wrote:
> >>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
> >>> either be a private or shared. A write from the hypervisor goes through
> >>> the RMP checks. If hardware sees that hypervisor is attempting to write
> >>> to a guest private page, then it triggers an RMP violation #PF.
> >>>
> >>> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
> >>> used to verify that its safe to map a given guest page. Use the SRCU to
> >>> protect against the page state change for existing mapped pages.
> >> SRCU isn't protecting anything.  The synchronize_srcu_expedited() in the PSC code
> >> forces it to wait for existing maps to go away, but it doesn't prevent new maps
> >> from being created while the actual RMP updates are in-flight.  Most telling is
> >> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
> > Argh, another goof on my part.  Rereading prior feedback, I see that I loosely
> > suggested SRCU as a possible solution.  That was a bad, bad suggestion.  I think
> > (hope) I made it offhand without really thinking it through.  SRCU can't work in
> > this case, because the whole premise of Read-Copy-Update is that there can be
> > multiple copies of the data.  That simply can't be true for the RMP as hardware
> > operates on a single table.
> >
> > In the future, please don't hesitate to push back on and/or question suggestions,
> > especially those that are made without concrete examples, i.e. are likely off the
> > cuff.  My goal isn't to set you up for failure :-/
> 
> What do you think about going back to my initial proposal of per-gfn
> tracking [1] ? We can limit the changes to just for the kvm_vcpu_map()
> and let the copy_to_user() take a fault and return an error (if it
> attempt to write to guest private). If PSC happen while lock is held
> then simplify return and let the guest retry PSC.

That approach is also broken as it doesn't hold a lock when updating host_write_track,
e.g. the count can be corrupted if writers collide, and nothing blocks writers on
in-progress readers.

I'm not opposed to a scheme that blocks PSC while KVM is reading, but I don't want
to spend time iterating on the KVM case until consensus has been reached on how
exactly RMP updates will be handled, and in general how the kernel will manage
guest private memory.
Michael Roth Sept. 8, 2022, 9:21 p.m. UTC | #10
On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> On Fri, Oct 15, 2021, Brijesh Singh wrote:
> > 
> > On 10/13/21 1:16 PM, Sean Christopherson wrote:
> > > On Wed, Oct 13, 2021, Sean Christopherson wrote:
> > >> On Fri, Aug 20, 2021, Brijesh Singh wrote:
> > >>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
> > >>> either be a private or shared. A write from the hypervisor goes through
> > >>> the RMP checks. If hardware sees that hypervisor is attempting to write
> > >>> to a guest private page, then it triggers an RMP violation #PF.
> > >>>
> > >>> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
> > >>> used to verify that its safe to map a given guest page. Use the SRCU to
> > >>> protect against the page state change for existing mapped pages.
> > >> SRCU isn't protecting anything.  The synchronize_srcu_expedited() in the PSC code
> > >> forces it to wait for existing maps to go away, but it doesn't prevent new maps
> > >> from being created while the actual RMP updates are in-flight.  Most telling is
> > >> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
> > > Argh, another goof on my part.  Rereading prior feedback, I see that I loosely
> > > suggested SRCU as a possible solution.  That was a bad, bad suggestion.  I think
> > > (hope) I made it offhand without really thinking it through.  SRCU can't work in
> > > this case, because the whole premise of Read-Copy-Update is that there can be
> > > multiple copies of the data.  That simply can't be true for the RMP as hardware
> > > operates on a single table.
> > >
> > > In the future, please don't hesitate to push back on and/or question suggestions,
> > > especially those that are made without concrete examples, i.e. are likely off the
> > > cuff.  My goal isn't to set you up for failure :-/
> > 
> > What do you think about going back to my initial proposal of per-gfn
> > tracking [1] ? We can limit the changes to just for the kvm_vcpu_map()
> > and let the copy_to_user() take a fault and return an error (if it
> > attempt to write to guest private). If PSC happen while lock is held
> > then simplify return and let the guest retry PSC.
> 
> That approach is also broken as it doesn't hold a lock when updating host_write_track,
> e.g. the count can be corrupted if writers collide, and nothing blocks writers on
> in-progress readers.
> 
> I'm not opposed to a scheme that blocks PSC while KVM is reading, but I don't want
> to spend time iterating on the KVM case until consensus has been reached on how
> exactly RMP updates will be handled, and in general how the kernel will manage
> guest private memory.

Hi Sean,

(Sorry in advance for the long read, but it touches on a couple
inter-tangled topics that I think are important for this series.)

While we do still remain committed to working with community toward
implementing a UPM solution, we would still like to have a 'complete'
solution in the meantime, if at the very least to use as a basis for
building out other parts of the stack, or for enabling early adopters
downstream doing the same.

Toward that end, there are a couple areas that need to be addressed,
(and remain unaddressed with v6, since we were planning to leverage UPM
to handle them and so left them as open TODOs at the time):

 1) this issue of guarding against shared->private conversions for pages
    that are in-use by kernel
 2) how to deal with unmapping/splitting the direct map

(These 2 things end up being fairly tightly coupled, which is why I'm
trying to cover them both in this email. Will explain more in a bit)

You mentioned elsewhere how 1) would be nicely addressed with UPM, since
the conversion would only point the guest to an entirely new page, while
leaving the shared page intact (at least until the normal notifiers do
their thing in response to any subsequent discard operations from
userspace). If the guest breaks because it doesn't see the write, that's
okay, because it's not supposed to be switching in-use shared pages to
private while they are being used. (though the pKVM folks did note in v7
of private memslot patches that they actually use the same physical page
for both shared/private, so I'm not sure if that approach will still
stack up in that context, if it ends up being needed there)

So in the context of this interim solution, we're trying to look for a
solution that's simple enough that it can be used reliably, without
introducing too much additional complexity into KVM. There is one
approach that seems to fit that bill, that Brijesh attempted in an
earlier version of this series (I'm not sure what exactly was the
catalyst to changing the approach, as I wasn't really in the loop at
the time, but AIUI there weren't any showstoppers there, but please
correct me if I'm missing anything):

 - if the host is writing to a page that it thinks is supposed to be
   shared, and the guest switches it to private, we get an RMP fault
   (actually, we will get a !PRESENT fault, since as of v5 we now
   remove the mapping from the directmap as part of conversion)
 - in the host #PF handler, if we see that the page is marked private
   in the RMP table, simply switch it back to shared
 - if this was a bug on the part of the host, then the guest will see
   the validated bit was cleared, and get a #VC where it can
   terminate accordingly
 - if this was a bug (or maliciousness) on the part of the guest,
   then the behavior is unexpected anyway, and it will be made
   aware this by the same mechanism

We've done some testing with this approach and it seems to hold up,
but since we also need the handler to deal with the !PRESENT case
(because of the current approach of nuking directmap entry for
private pages), there's a situation that can't be easily resolved
with this approach:

  # CASE1: recoverable
  
    THREAD 1              THREAD 2
    - #PF(!PRESENT)       - #PF(!PRESENT)
    - restore directmap
    - RMPUPDATE(shared)
                          - not private? okay to retry since maybe it was
                            fixed up already
  
  # CASE2: unrecoverable
  
    THREAD 1
    - broken kernel breaks directmap/init-mm, not related to SEV
    - #PF(!PRESENT)
    - not private? we should oops, since retrying might cause a loop
  
The reason we can't distinguish between recoverable/unrecoverable is
because the kernel #PF handling here needs to happen for both !PRESENT
and for RMP violations. This is due to how we unmap from directmap as
part of shared->private conversion.

So we have to take the pessimistic approach due to CASE2, which
means that if CASE1 pops up, we'll still have a case where guest can
cause a crash/loop.

That where 2) comes into play. If we go back to the original approach
(as of v4) of simply splitting the directmap when we do shared->private
conversions, then we can limit the #PF handling to only have to deal
with cases where there's an RMP violation and X86_PF_RMP is set, which
makes it safe to retry to handle spurious cases like case #1.

AIUI, this is still sort of an open question, but you noted how nuking
the directmap without any formalized interface for letting the kernel
know about it could be problematic down the road, which also sounds
like the sort of thing more suited for having UPM address at a more
general level, since there are similar requirements for TDX as well.

AIUI there are 2 main arguments against splitting the directmap:
 a) we can't easily rebuild it atm
 b) things like KSM might still tries to access private pages

But unmapping also suffers from a), since we still end up splitting the
directmap unless we remove pages in blocks of 2M. But nothing prevents
a guest from switching a single 4K page to private, in which case we
are forced to split. That would be normal behavior on the part of the
guest for setting up GHCB pages/etc, so we still end up splitting the
directmap over time.

And for b), as you noted, this is already something that SEV/SEV-ES
need to deal with, and at least in the case of SNP guest things are
a bit better since:

  -  if some kernel thread tried to write to private memory, they
     would notice if some errant kernel thread tried to write to guest
     memory, since #PF handler with flip the page and unset the
     validated bit on that memory.

  - for reads, they will get ciphertext, which isn't any worse than
    reading unencrypted guest memory for anything outside of KVM that
    specifically knows what it is expecting to read there (KSM being
    one exception, since it'll waste cycles cmp'ing ciphertext, but
    there's no way to make it work anyway, and we would be fine with
    requiring it to be disabled if that is a concern)

So that's sort of the context for why we're considering these 2 changes.
Any input on these is welcome, but at the very least wanted to provide
our rationale for past reviewers.

Thanks!

-Mike
Michael Roth Sept. 8, 2022, 10:28 p.m. UTC | #11
On Thu, Sep 08, 2022 at 04:21:14PM -0500, Michael Roth wrote:
> On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> > On Fri, Oct 15, 2021, Brijesh Singh wrote:
> > > 
> > > On 10/13/21 1:16 PM, Sean Christopherson wrote:
> > > > On Wed, Oct 13, 2021, Sean Christopherson wrote:
> > > >> On Fri, Aug 20, 2021, Brijesh Singh wrote:
> > > >>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
> > > >>> either be a private or shared. A write from the hypervisor goes through
> > > >>> the RMP checks. If hardware sees that hypervisor is attempting to write
> > > >>> to a guest private page, then it triggers an RMP violation #PF.
> > > >>>
> > > >>> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
> > > >>> used to verify that its safe to map a given guest page. Use the SRCU to
> > > >>> protect against the page state change for existing mapped pages.
> > > >> SRCU isn't protecting anything.  The synchronize_srcu_expedited() in the PSC code
> > > >> forces it to wait for existing maps to go away, but it doesn't prevent new maps
> > > >> from being created while the actual RMP updates are in-flight.  Most telling is
> > > >> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
> > > > Argh, another goof on my part.  Rereading prior feedback, I see that I loosely
> > > > suggested SRCU as a possible solution.  That was a bad, bad suggestion.  I think
> > > > (hope) I made it offhand without really thinking it through.  SRCU can't work in
> > > > this case, because the whole premise of Read-Copy-Update is that there can be
> > > > multiple copies of the data.  That simply can't be true for the RMP as hardware
> > > > operates on a single table.
> > > >
> > > > In the future, please don't hesitate to push back on and/or question suggestions,
> > > > especially those that are made without concrete examples, i.e. are likely off the
> > > > cuff.  My goal isn't to set you up for failure :-/
> > > 
> > > What do you think about going back to my initial proposal of per-gfn
> > > tracking [1] ? We can limit the changes to just for the kvm_vcpu_map()
> > > and let the copy_to_user() take a fault and return an error (if it
> > > attempt to write to guest private). If PSC happen while lock is held
> > > then simplify return and let the guest retry PSC.
> > 
> > That approach is also broken as it doesn't hold a lock when updating host_write_track,
> > e.g. the count can be corrupted if writers collide, and nothing blocks writers on
> > in-progress readers.
> > 
> > I'm not opposed to a scheme that blocks PSC while KVM is reading, but I don't want
> > to spend time iterating on the KVM case until consensus has been reached on how
> > exactly RMP updates will be handled, and in general how the kernel will manage
> > guest private memory.
> 
> Hi Sean,
> 
> (Sorry in advance for the long read, but it touches on a couple
> inter-tangled topics that I think are important for this series.)
> 
> While we do still remain committed to working with community toward
> implementing a UPM solution, we would still like to have a 'complete'
> solution in the meantime, if at the very least to use as a basis for
> building out other parts of the stack, or for enabling early adopters
> downstream doing the same.
> 
> Toward that end, there are a couple areas that need to be addressed,
> (and remain unaddressed with v6, since we were planning to leverage UPM
> to handle them and so left them as open TODOs at the time):
> 
>  1) this issue of guarding against shared->private conversions for pages
>     that are in-use by kernel
>  2) how to deal with unmapping/splitting the direct map
> 
> (These 2 things end up being fairly tightly coupled, which is why I'm
> trying to cover them both in this email. Will explain more in a bit)
> 
> You mentioned elsewhere how 1) would be nicely addressed with UPM, since
> the conversion would only point the guest to an entirely new page, while
> leaving the shared page intact (at least until the normal notifiers do
> their thing in response to any subsequent discard operations from
> userspace). If the guest breaks because it doesn't see the write, that's
> okay, because it's not supposed to be switching in-use shared pages to
> private while they are being used. (though the pKVM folks did note in v7
> of private memslot patches that they actually use the same physical page
> for both shared/private, so I'm not sure if that approach will still
> stack up in that context, if it ends up being needed there)
> 
> So in the context of this interim solution, we're trying to look for a
> solution that's simple enough that it can be used reliably, without
> introducing too much additional complexity into KVM. There is one
> approach that seems to fit that bill, that Brijesh attempted in an
> earlier version of this series (I'm not sure what exactly was the
> catalyst to changing the approach, as I wasn't really in the loop at
> the time, but AIUI there weren't any showstoppers there, but please
> correct me if I'm missing anything):
> 
>  - if the host is writing to a page that it thinks is supposed to be
>    shared, and the guest switches it to private, we get an RMP fault
>    (actually, we will get a !PRESENT fault, since as of v5 we now
>    remove the mapping from the directmap as part of conversion)
>  - in the host #PF handler, if we see that the page is marked private
>    in the RMP table, simply switch it back to shared
>  - if this was a bug on the part of the host, then the guest will see
>    the validated bit was cleared, and get a #VC where it can
>    terminate accordingly
>  - if this was a bug (or maliciousness) on the part of the guest,
>    then the behavior is unexpected anyway, and it will be made
>    aware this by the same mechanism
> 
> We've done some testing with this approach and it seems to hold up,
> but since we also need the handler to deal with the !PRESENT case
> (because of the current approach of nuking directmap entry for
> private pages), there's a situation that can't be easily resolved
> with this approach:
> 
>   # CASE1: recoverable
>   
>     THREAD 1              THREAD 2
>     - #PF(!PRESENT)       - #PF(!PRESENT)
>     - restore directmap
>     - RMPUPDATE(shared)
>                           - not private? okay to retry since maybe it was
>                             fixed up already
>   
>   # CASE2: unrecoverable
>   
>     THREAD 1
>     - broken kernel breaks directmap/init-mm, not related to SEV
>     - #PF(!PRESENT)
>     - not private? we should oops, since retrying might cause a loop
>   
> The reason we can't distinguish between recoverable/unrecoverable is
> because the kernel #PF handling here needs to happen for both !PRESENT
> and for RMP violations. This is due to how we unmap from directmap as
> part of shared->private conversion.
> 
> So we have to take the pessimistic approach due to CASE2, which
> means that if CASE1 pops up, we'll still have a case where guest can
> cause a crash/loop.
> 
> That where 2) comes into play. If we go back to the original approach
> (as of v4) of simply splitting the directmap when we do shared->private
> conversions, then we can limit the #PF handling to only have to deal
> with cases where there's an RMP violation and X86_PF_RMP is set, which
> makes it safe to retry to handle spurious cases like case #1.
> 
> AIUI, this is still sort of an open question, but you noted how nuking
> the directmap without any formalized interface for letting the kernel
> know about it could be problematic down the road, which also sounds
> like the sort of thing more suited for having UPM address at a more
> general level, since there are similar requirements for TDX as well.
> 
> AIUI there are 2 main arguments against splitting the directmap:
>  a) we can't easily rebuild it atm
>  b) things like KSM might still tries to access private pages
> 
> But unmapping also suffers from a), since we still end up splitting the
> directmap unless we remove pages in blocks of 2M. But nothing prevents
> a guest from switching a single 4K page to private, in which case we
> are forced to split. That would be normal behavior on the part of the
> guest for setting up GHCB pages/etc, so we still end up splitting the
> directmap over time.

One more thing to note here, is with the splitting approach, we also
have the option of doing all the splitting when the region is registered
via KVM_MEMORY_ENCRYPT_REG_REGION. If unsplitting becomes possible in
the future, we could then handle that in KVM_MEMORY_ENCRYPT_UNREG_REGION
all at once which, where we'd have a bit more flexibility for going
about that.

-Mike

> 
> And for b), as you noted, this is already something that SEV/SEV-ES
> need to deal with, and at least in the case of SNP guest things are
> a bit better since:
> 
>   -  if some kernel thread tried to write to private memory, they
>      would notice if some errant kernel thread tried to write to guest
>      memory, since #PF handler with flip the page and unset the
>      validated bit on that memory.
> 
>   - for reads, they will get ciphertext, which isn't any worse than
>     reading unencrypted guest memory for anything outside of KVM that
>     specifically knows what it is expecting to read there (KSM being
>     one exception, since it'll waste cycles cmp'ing ciphertext, but
>     there's no way to make it work anyway, and we would be fine with
>     requiring it to be disabled if that is a concern)
> 
> So that's sort of the context for why we're considering these 2 changes.
> Any input on these is welcome, but at the very least wanted to provide
> our rationale for past reviewers.
> 
> Thanks!
> 
> -Mike
Sean Christopherson Sept. 14, 2022, 8:05 a.m. UTC | #12
On Thu, Sep 08, 2022, Michael Roth wrote:
> On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> So in the context of this interim solution, we're trying to look for a
> solution that's simple enough that it can be used reliably, without
> introducing too much additional complexity into KVM. There is one
> approach that seems to fit that bill, that Brijesh attempted in an
> earlier version of this series (I'm not sure what exactly was the
> catalyst to changing the approach, as I wasn't really in the loop at
> the time, but AIUI there weren't any showstoppers there, but please
> correct me if I'm missing anything):
> 
>  - if the host is writing to a page that it thinks is supposed to be
>    shared, and the guest switches it to private, we get an RMP fault
>    (actually, we will get a !PRESENT fault, since as of v5 we now
>    remove the mapping from the directmap as part of conversion)
>  - in the host #PF handler, if we see that the page is marked private
>    in the RMP table, simply switch it back to shared
>  - if this was a bug on the part of the host, then the guest will see

As discussed off-list, attempting to fix up RMP violations in the host #PF handler
is not a viable approach.  There was also extensive discussion on-list a while back:

https://lore.kernel.org/all/8a244d34-2b10-4cf8-894a-1bf12b59cf92@www.fastmail.com

> AIUI, this is still sort of an open question, but you noted how nuking
> the directmap without any formalized interface for letting the kernel
> know about it could be problematic down the road, which also sounds
> like the sort of thing more suited for having UPM address at a more
> general level, since there are similar requirements for TDX as well.
> 
> AIUI there are 2 main arguments against splitting the directmap:
>  a) we can't easily rebuild it atm
>  b) things like KSM might still tries to access private pages
> 
> But unmapping also suffers from a), since we still end up splitting the
> directmap unless we remove pages in blocks of 2M.

But for UPM, it's easy to rebuild the direct map since there will be an explicit,
kernel controlled point where the "inaccesible" memfd releases the private page.

> But nothing prevents a guest from switching a single 4K page to private, in
> which case we are forced to split. That would be normal behavior on the part
> of the guest for setting up GHCB pages/etc, so we still end up splitting the
> directmap over time.

The host actually isn't _forced_ to split with UPM.  One option would be to refuse
to split the direct map and instead force userspace to eat the 2mb allocation even
though it only wants to map a single 4kb chunk into the guest.  I don't know that
that's a _good_ option, but it is an option.
Marc Orr Sept. 14, 2022, 11:02 a.m. UTC | #13
On Wed, Sep 14, 2022 at 9:05 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Sep 08, 2022, Michael Roth wrote:
> > On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> > So in the context of this interim solution, we're trying to look for a
> > solution that's simple enough that it can be used reliably, without
> > introducing too much additional complexity into KVM. There is one
> > approach that seems to fit that bill, that Brijesh attempted in an
> > earlier version of this series (I'm not sure what exactly was the
> > catalyst to changing the approach, as I wasn't really in the loop at
> > the time, but AIUI there weren't any showstoppers there, but please
> > correct me if I'm missing anything):
> >
> >  - if the host is writing to a page that it thinks is supposed to be
> >    shared, and the guest switches it to private, we get an RMP fault
> >    (actually, we will get a !PRESENT fault, since as of v5 we now
> >    remove the mapping from the directmap as part of conversion)
> >  - in the host #PF handler, if we see that the page is marked private
> >    in the RMP table, simply switch it back to shared
> >  - if this was a bug on the part of the host, then the guest will see
>
> As discussed off-list, attempting to fix up RMP violations in the host #PF handler
> is not a viable approach.  There was also extensive discussion on-list a while back:
>
> https://lore.kernel.org/all/8a244d34-2b10-4cf8-894a-1bf12b59cf92@www.fastmail.com

I mentioned this during Mike's talk at the micro-conference: For pages
mapped in by the kernel can we disallow them to be converted to
private? Note, userspace accesses are already handled by UPM.

In pseudo-code, I'm thinking something like this:

kmap_helper() {
  // And all other interfaces where the kernel can map a GPA
  // into the kernel page tables
  mapped_into_kernel_mem_set[hpa] = true;
}

kunmap_helper() {
  // And all other interfaces where the kernel can unmap a GPA
  // into the kernel page tables
  mapped_into_kernel_mem_set[hpa] = false;

  // Except it's not this simple because we probably need ref counting
  // for multiple mappings. Sigh. But you get the idea.
}

rmpupdate_helper() {
  if (conversion = SHARED_TO_PRIVATE && mapped_into_kernel_mem_set[hpa])
    return -EINVAL;  // Or whatever the appropriate error code here is.
}
Sean Christopherson Sept. 14, 2022, 4:15 p.m. UTC | #14
On Wed, Sep 14, 2022, Marc Orr wrote:
> On Wed, Sep 14, 2022 at 9:05 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Sep 08, 2022, Michael Roth wrote:
> > > On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> > > So in the context of this interim solution, we're trying to look for a
> > > solution that's simple enough that it can be used reliably, without
> > > introducing too much additional complexity into KVM. There is one
> > > approach that seems to fit that bill, that Brijesh attempted in an
> > > earlier version of this series (I'm not sure what exactly was the
> > > catalyst to changing the approach, as I wasn't really in the loop at
> > > the time, but AIUI there weren't any showstoppers there, but please
> > > correct me if I'm missing anything):
> > >
> > >  - if the host is writing to a page that it thinks is supposed to be
> > >    shared, and the guest switches it to private, we get an RMP fault
> > >    (actually, we will get a !PRESENT fault, since as of v5 we now
> > >    remove the mapping from the directmap as part of conversion)
> > >  - in the host #PF handler, if we see that the page is marked private
> > >    in the RMP table, simply switch it back to shared
> > >  - if this was a bug on the part of the host, then the guest will see
> >
> > As discussed off-list, attempting to fix up RMP violations in the host #PF handler
> > is not a viable approach.  There was also extensive discussion on-list a while back:
> >
> > https://lore.kernel.org/all/8a244d34-2b10-4cf8-894a-1bf12b59cf92@www.fastmail.com
> 
> I mentioned this during Mike's talk at the micro-conference: For pages
> mapped in by the kernel can we disallow them to be converted to
> private?

In theory, yes.  Do we want to do something like this?  No.  kmap() does something
vaguely similar for 32-bit PAE/PSE kernels, but that's a lot of complexity and
overhead to take on.  And this issue goes far beyond a kmap(); when the kernel gup()s
a page, the kernel expects the pfn to be available, no exceptions (pun intended).

> Note, userspace accesses are already handled by UPM.

I'm confused by the UPM comment.  Isn't the gist of this thread about the ability
to merge SNP _without_ UPM?  Or am I out in left field?

> In pseudo-code, I'm thinking something like this:
> 
> kmap_helper() {
>   // And all other interfaces where the kernel can map a GPA
>   // into the kernel page tables
>   mapped_into_kernel_mem_set[hpa] = true;
> }
> 
> kunmap_helper() {
>   // And all other interfaces where the kernel can unmap a GPA
>   // into the kernel page tables
>   mapped_into_kernel_mem_set[hpa] = false;
> 
>   // Except it's not this simple because we probably need ref counting
>   // for multiple mappings. Sigh. But you get the idea.

A few issues off the top of my head:

  - It's not just refcounting, there would also likely need to be locking to
    guarantee sane behavior.
  - kmap() isn't allowed to fail and RMPUPDATE isn't strictly guaranteed to succeed,
    which is problematic if the kernel attempts to kmap() a page that's already
    private, especially for kmap_atomic(), which isn't allowed to sleep.
  - Not all kernel code is well behaved and bounces through kmap(); undoubtedly
    some of the 1200+ users of page_address() will be problematic.
    
    $ git grep page_address | wc -l
    1267
  - It's not sufficient for TDX.  Merging something this complicated when we know
    we still need UPM would be irresponsible from a maintenance perspective.
  - KVM would need to support two separate APIs for SNP, which I very much don't
    want to do.
Marc Orr Sept. 14, 2022, 4:32 p.m. UTC | #15
On Wed, Sep 14, 2022 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Sep 14, 2022, Marc Orr wrote:
> > On Wed, Sep 14, 2022 at 9:05 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Thu, Sep 08, 2022, Michael Roth wrote:
> > > > On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> > > > So in the context of this interim solution, we're trying to look for a
> > > > solution that's simple enough that it can be used reliably, without
> > > > introducing too much additional complexity into KVM. There is one
> > > > approach that seems to fit that bill, that Brijesh attempted in an
> > > > earlier version of this series (I'm not sure what exactly was the
> > > > catalyst to changing the approach, as I wasn't really in the loop at
> > > > the time, but AIUI there weren't any showstoppers there, but please
> > > > correct me if I'm missing anything):
> > > >
> > > >  - if the host is writing to a page that it thinks is supposed to be
> > > >    shared, and the guest switches it to private, we get an RMP fault
> > > >    (actually, we will get a !PRESENT fault, since as of v5 we now
> > > >    remove the mapping from the directmap as part of conversion)
> > > >  - in the host #PF handler, if we see that the page is marked private
> > > >    in the RMP table, simply switch it back to shared
> > > >  - if this was a bug on the part of the host, then the guest will see
> > >
> > > As discussed off-list, attempting to fix up RMP violations in the host #PF handler
> > > is not a viable approach.  There was also extensive discussion on-list a while back:
> > >
> > > https://lore.kernel.org/all/8a244d34-2b10-4cf8-894a-1bf12b59cf92@www.fastmail.com
> >
> > I mentioned this during Mike's talk at the micro-conference: For pages
> > mapped in by the kernel can we disallow them to be converted to
> > private?
>
> In theory, yes.  Do we want to do something like this?  No.  kmap() does something
> vaguely similar for 32-bit PAE/PSE kernels, but that's a lot of complexity and
> overhead to take on.  And this issue goes far beyond a kmap(); when the kernel gup()s
> a page, the kernel expects the pfn to be available, no exceptions (pun intended).
>
> > Note, userspace accesses are already handled by UPM.
>
> I'm confused by the UPM comment.  Isn't the gist of this thread about the ability
> to merge SNP _without_ UPM?  Or am I out in left field?

I think that was the overall gist: yes. But it's not what I was trying
to comment on :-).

HOWEVER, thinking about this more: I was confused when I wrote out my
last reply. I had thought that the issue that Michael brought up
applied even with UPM. That is, I was thinking it was still possibly
for a guest to maliciously convert a page to private mapped in by the
kernel and assumed to be shared.

But I now realize that is not what will actually happen. To be
concrete, let's assume the GHCB page. What will happen is:
- KVM has GHCB page mapped in. GHCB is always assumed to be shared. So
far so good.
- Malicious guest converts GHCB page to private (e.g., via Page State
Change request)
- Guest exits to KVM
- KVM exits to userspace VMM
- Userspace VM allocates page in private FD.

Now, what happens here depends on how UPM works. If we allow double
allocation then our host kernel is safe. However, now we have the
"double allocation problem".

If on the other hand, we deallocate the page in the shared FD, the
host kernel can segfault. And now we actually do have essentially the
same problem Michael was describing that we have without UPM. Because
we'll end up in fault.c in the kernel context and likely panic the
host.

I hope I got this right this time. Sorry for the confusion on my last reply.

> > In pseudo-code, I'm thinking something like this:
> >
> > kmap_helper() {
> >   // And all other interfaces where the kernel can map a GPA
> >   // into the kernel page tables
> >   mapped_into_kernel_mem_set[hpa] = true;
> > }
> >
> > kunmap_helper() {
> >   // And all other interfaces where the kernel can unmap a GPA
> >   // into the kernel page tables
> >   mapped_into_kernel_mem_set[hpa] = false;
> >
> >   // Except it's not this simple because we probably need ref counting
> >   // for multiple mappings. Sigh. But you get the idea.
>
> A few issues off the top of my head:
>
>   - It's not just refcounting, there would also likely need to be locking to
>     guarantee sane behavior.
>   - kmap() isn't allowed to fail and RMPUPDATE isn't strictly guaranteed to succeed,
>     which is problematic if the kernel attempts to kmap() a page that's already
>     private, especially for kmap_atomic(), which isn't allowed to sleep.
>   - Not all kernel code is well behaved and bounces through kmap(); undoubtedly
>     some of the 1200+ users of page_address() will be problematic.
>
>     $ git grep page_address | wc -l
>     1267
>   - It's not sufficient for TDX.  Merging something this complicated when we know
>     we still need UPM would be irresponsible from a maintenance perspective.
>   - KVM would need to support two separate APIs for SNP, which I very much don't
>     want to do.

Ack on merging without UPM. I wasn't trying to chime in on merging
before/after UPM. See my other comment above. Sorry for the confusion.
Ack on other concerns about "enlightening kmap" as well. I agree.
Marc Orr Sept. 14, 2022, 4:39 p.m. UTC | #16
On Wed, Sep 14, 2022 at 5:32 PM Marc Orr <marcorr@google.com> wrote:
>
> On Wed, Sep 14, 2022 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Sep 14, 2022, Marc Orr wrote:
> > > On Wed, Sep 14, 2022 at 9:05 AM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Thu, Sep 08, 2022, Michael Roth wrote:
> > > > > On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> > > > > So in the context of this interim solution, we're trying to look for a
> > > > > solution that's simple enough that it can be used reliably, without
> > > > > introducing too much additional complexity into KVM. There is one
> > > > > approach that seems to fit that bill, that Brijesh attempted in an
> > > > > earlier version of this series (I'm not sure what exactly was the
> > > > > catalyst to changing the approach, as I wasn't really in the loop at
> > > > > the time, but AIUI there weren't any showstoppers there, but please
> > > > > correct me if I'm missing anything):
> > > > >
> > > > >  - if the host is writing to a page that it thinks is supposed to be
> > > > >    shared, and the guest switches it to private, we get an RMP fault
> > > > >    (actually, we will get a !PRESENT fault, since as of v5 we now
> > > > >    remove the mapping from the directmap as part of conversion)
> > > > >  - in the host #PF handler, if we see that the page is marked private
> > > > >    in the RMP table, simply switch it back to shared
> > > > >  - if this was a bug on the part of the host, then the guest will see
> > > >
> > > > As discussed off-list, attempting to fix up RMP violations in the host #PF handler
> > > > is not a viable approach.  There was also extensive discussion on-list a while back:
> > > >
> > > > https://lore.kernel.org/all/8a244d34-2b10-4cf8-894a-1bf12b59cf92@www.fastmail.com
> > >
> > > I mentioned this during Mike's talk at the micro-conference: For pages
> > > mapped in by the kernel can we disallow them to be converted to
> > > private?
> >
> > In theory, yes.  Do we want to do something like this?  No.  kmap() does something
> > vaguely similar for 32-bit PAE/PSE kernels, but that's a lot of complexity and
> > overhead to take on.  And this issue goes far beyond a kmap(); when the kernel gup()s
> > a page, the kernel expects the pfn to be available, no exceptions (pun intended).
> >
> > > Note, userspace accesses are already handled by UPM.
> >
> > I'm confused by the UPM comment.  Isn't the gist of this thread about the ability
> > to merge SNP _without_ UPM?  Or am I out in left field?
>
> I think that was the overall gist: yes. But it's not what I was trying
> to comment on :-).
>
> HOWEVER, thinking about this more: I was confused when I wrote out my
> last reply. I had thought that the issue that Michael brought up
> applied even with UPM. That is, I was thinking it was still possibly
> for a guest to maliciously convert a page to private mapped in by the
> kernel and assumed to be shared.
>
> But I now realize that is not what will actually happen. To be
> concrete, let's assume the GHCB page. What will happen is:
> - KVM has GHCB page mapped in. GHCB is always assumed to be shared. So
> far so good.
> - Malicious guest converts GHCB page to private (e.g., via Page State
> Change request)
> - Guest exits to KVM
> - KVM exits to userspace VMM
> - Userspace VM allocates page in private FD.
>
> Now, what happens here depends on how UPM works. If we allow double
> allocation then our host kernel is safe. However, now we have the
> "double allocation problem".
>
> If on the other hand, we deallocate the page in the shared FD, the
> host kernel can segfault. And now we actually do have essentially the
> same problem Michael was describing that we have without UPM. Because
> we'll end up in fault.c in the kernel context and likely panic the
> host.

Thinking about this even more... Even if we deallocate in the
userspace VMM's shared FD, the kernel has its own page tables --
right? So maybe we are actually 100% OK under UPM then regardless of
the userspace VMM's policy around managing the private and shared FDs.
Michael Roth Sept. 19, 2022, 5:56 p.m. UTC | #17
On Wed, Sep 14, 2022 at 08:05:49AM +0000, Sean Christopherson wrote:
> On Thu, Sep 08, 2022, Michael Roth wrote:
> > On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> > So in the context of this interim solution, we're trying to look for a
> > solution that's simple enough that it can be used reliably, without
> > introducing too much additional complexity into KVM. There is one
> > approach that seems to fit that bill, that Brijesh attempted in an
> > earlier version of this series (I'm not sure what exactly was the
> > catalyst to changing the approach, as I wasn't really in the loop at
> > the time, but AIUI there weren't any showstoppers there, but please
> > correct me if I'm missing anything):
> > 
> >  - if the host is writing to a page that it thinks is supposed to be
> >    shared, and the guest switches it to private, we get an RMP fault
> >    (actually, we will get a !PRESENT fault, since as of v5 we now
> >    remove the mapping from the directmap as part of conversion)
> >  - in the host #PF handler, if we see that the page is marked private
> >    in the RMP table, simply switch it back to shared
> >  - if this was a bug on the part of the host, then the guest will see

Hi Sean,

Thanks for the input here and at KVM Forum.

> 
> As discussed off-list, attempting to fix up RMP violations in the host #PF handler
> is not a viable approach.  There was also extensive discussion on-list a while back:
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F8a244d34-2b10-4cf8-894a-1bf12b59cf92%40www.fastmail.com&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C2f2356ebe2b44daab93708da9627f2b4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637987395629620130%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=Mm13HgUAE4M%2BluyBys3Ihp%2FTNqSQTq14WrMXdF8ArAw%3D&amp;reserved=0

I think that was likely the only hope for a non-UPM approach, as
anything else would require a good bit of infrastructure in KVM and
elsewhere to avoid that situation occuring to begin with, and it
probably would not be worth the effort outside the context of a
general/platform-independent solution like UPM. I was hoping it would be
possible to work through Andy's concerns, but the concerns you and Paolo
raised of potential surprise #PFs in other parts of the kernel are
something I'm less optimistic about, so I agree UPM is probably the right
place to focus efforts.

> 
> > AIUI, this is still sort of an open question, but you noted how nuking
> > the directmap without any formalized interface for letting the kernel
> > know about it could be problematic down the road, which also sounds
> > like the sort of thing more suited for having UPM address at a more
> > general level, since there are similar requirements for TDX as well.
> > 
> > AIUI there are 2 main arguments against splitting the directmap:
> >  a) we can't easily rebuild it atm
> >  b) things like KSM might still tries to access private pages
> > 
> > But unmapping also suffers from a), since we still end up splitting the
> > directmap unless we remove pages in blocks of 2M.
> 
> But for UPM, it's easy to rebuild the direct map since there will be an explicit,
> kernel controlled point where the "inaccesible" memfd releases the private page.

I was thinking it would be possible to do something similar by doing page
splitting/restore in bulk as part of MEM_ENCRYPT_{REG,UNREG}_REGION, but
yes UPM also allows for a convenient point in time to split/unsplit.

> 
> > But nothing prevents a guest from switching a single 4K page to private, in
> > which case we are forced to split. That would be normal behavior on the part
> > of the guest for setting up GHCB pages/etc, so we still end up splitting the
> > directmap over time.
> 
> The host actually isn't _forced_ to split with UPM.  One option would be to refuse
> to split the direct map and instead force userspace to eat the 2mb allocation even
> though it only wants to map a single 4kb chunk into the guest.  I don't know that
> that's a _good_ option, but it is an option.

That does seem like a reasonable option. Maybe it also opens up a path
for hugetlbfs support of sorts. In practice I wouldn't expect too many
of those pages to be wasted, worst case would be 2MB per shared page in
the guest... I suppose that could add up for GHCB pages and whatnot if
there are lots of vCPUs, but at that point you're likely dealing with
large guests with enough memory to spare. Could be another pain point
regarding calculating appropriate memory limits for userspace though.

Thanks!

-Mike
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 371756c7f8f4..c09bd40e0160 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -124,6 +124,8 @@  KVM_X86_OP(msr_filter_changed)
 KVM_X86_OP_NULL(complete_emulated_msr)
 KVM_X86_OP(alloc_apic_backing_page)
 KVM_X86_OP_NULL(rmp_page_level_adjust)
+KVM_X86_OP(post_map_gfn)
+KVM_X86_OP(post_unmap_gfn)
 
 #undef KVM_X86_OP
 #undef KVM_X86_OP_NULL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a6e764458f3e..5ac1ff097e8c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1463,7 +1463,11 @@  struct kvm_x86_ops {
 	void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
 
 	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
+
 	void (*rmp_page_level_adjust)(struct kvm *kvm, kvm_pfn_t pfn, int *level);
+
+	int (*post_map_gfn)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *token);
+	void (*post_unmap_gfn)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int token);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 0de85ed63e9b..65b578463271 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -336,6 +336,7 @@  static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
 		if (ret)
 			goto e_free;
 
+		init_srcu_struct(&sev->psc_srcu);
 		ret = sev_snp_init(&argp->error);
 	} else {
 		ret = sev_platform_init(&argp->error);
@@ -2293,6 +2294,7 @@  void sev_vm_destroy(struct kvm *kvm)
 			WARN_ONCE(1, "Failed to free SNP guest context, leaking asid!\n");
 			return;
 		}
+		cleanup_srcu_struct(&sev->psc_srcu);
 	} else {
 		sev_unbind_asid(kvm, sev->handle);
 	}
@@ -2494,23 +2496,32 @@  void sev_free_vcpu(struct kvm_vcpu *vcpu)
 	kfree(svm->ghcb_sa);
 }
 
-static inline int svm_map_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map)
+static inline int svm_map_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map, int *token)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
 	u64 gfn = gpa_to_gfn(control->ghcb_gpa);
+	struct kvm_vcpu *vcpu = &svm->vcpu;
 
-	if (kvm_vcpu_map(&svm->vcpu, gfn, map)) {
+	if (kvm_vcpu_map(vcpu, gfn, map)) {
 		/* Unable to map GHCB from guest */
 		pr_err("error mapping GHCB GFN [%#llx] from guest\n", gfn);
 		return -EFAULT;
 	}
 
+	if (sev_post_map_gfn(vcpu->kvm, map->gfn, map->pfn, token)) {
+		kvm_vcpu_unmap(vcpu, map, false);
+		return -EBUSY;
+	}
+
 	return 0;
 }
 
-static inline void svm_unmap_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map)
+static inline void svm_unmap_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map, int token)
 {
-	kvm_vcpu_unmap(&svm->vcpu, map, true);
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+
+	kvm_vcpu_unmap(vcpu, map, true);
+	sev_post_unmap_gfn(vcpu->kvm, map->gfn, map->pfn, token);
 }
 
 static void dump_ghcb(struct vcpu_svm *svm)
@@ -2518,8 +2529,9 @@  static void dump_ghcb(struct vcpu_svm *svm)
 	struct kvm_host_map map;
 	unsigned int nbits;
 	struct ghcb *ghcb;
+	int token;
 
-	if (svm_map_ghcb(svm, &map))
+	if (svm_map_ghcb(svm, &map, &token))
 		return;
 
 	ghcb = map.hva;
@@ -2544,7 +2556,7 @@  static void dump_ghcb(struct vcpu_svm *svm)
 	pr_err("%-20s%*pb\n", "valid_bitmap", nbits, ghcb->save.valid_bitmap);
 
 e_unmap:
-	svm_unmap_ghcb(svm, &map);
+	svm_unmap_ghcb(svm, &map, token);
 }
 
 static bool sev_es_sync_to_ghcb(struct vcpu_svm *svm)
@@ -2552,8 +2564,9 @@  static bool sev_es_sync_to_ghcb(struct vcpu_svm *svm)
 	struct kvm_vcpu *vcpu = &svm->vcpu;
 	struct kvm_host_map map;
 	struct ghcb *ghcb;
+	int token;
 
-	if (svm_map_ghcb(svm, &map))
+	if (svm_map_ghcb(svm, &map, &token))
 		return false;
 
 	ghcb = map.hva;
@@ -2579,7 +2592,7 @@  static bool sev_es_sync_to_ghcb(struct vcpu_svm *svm)
 
 	trace_kvm_vmgexit_exit(svm->vcpu.vcpu_id, ghcb);
 
-	svm_unmap_ghcb(svm, &map);
+	svm_unmap_ghcb(svm, &map, token);
 
 	return true;
 }
@@ -2636,8 +2649,9 @@  static int sev_es_validate_vmgexit(struct vcpu_svm *svm, u64 *exit_code)
 	struct kvm_vcpu *vcpu = &svm->vcpu;
 	struct kvm_host_map map;
 	struct ghcb *ghcb;
+	int token;
 
-	if (svm_map_ghcb(svm, &map))
+	if (svm_map_ghcb(svm, &map, &token))
 		return -EFAULT;
 
 	ghcb = map.hva;
@@ -2739,7 +2753,7 @@  static int sev_es_validate_vmgexit(struct vcpu_svm *svm, u64 *exit_code)
 
 	sev_es_sync_from_ghcb(svm, ghcb);
 
-	svm_unmap_ghcb(svm, &map);
+	svm_unmap_ghcb(svm, &map, token);
 	return 0;
 
 vmgexit_err:
@@ -2760,7 +2774,7 @@  static int sev_es_validate_vmgexit(struct vcpu_svm *svm, u64 *exit_code)
 	vcpu->run->internal.data[0] = *exit_code;
 	vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu;
 
-	svm_unmap_ghcb(svm, &map);
+	svm_unmap_ghcb(svm, &map, token);
 	return -EINVAL;
 }
 
@@ -3036,6 +3050,9 @@  static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,
 				return PSC_UNDEF_ERR;
 		}
 
+		/* Wait for all the existing mapped gfn to unmap */
+		synchronize_srcu_expedited(&sev->psc_srcu);
+
 		write_lock(&kvm->mmu_lock);
 
 		rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);
@@ -3604,3 +3621,33 @@  void sev_rmp_page_level_adjust(struct kvm *kvm, kvm_pfn_t pfn, int *level)
 	/* Adjust the level to keep the NPT and RMP in sync */
 	*level = min_t(size_t, *level, rmp_level);
 }
+
+int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *token)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	int level;
+
+	if (!sev_snp_guest(kvm))
+		return 0;
+
+	*token = srcu_read_lock(&sev->psc_srcu);
+
+	/* If pfn is not added as private then fail */
+	if (snp_lookup_rmpentry(pfn, &level) == 1) {
+		srcu_read_unlock(&sev->psc_srcu, *token);
+		pr_err_ratelimited("failed to map private gfn 0x%llx pfn 0x%llx\n", gfn, pfn);
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
+void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int token)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+
+	if (!sev_snp_guest(kvm))
+		return;
+
+	srcu_read_unlock(&sev->psc_srcu, token);
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 5f73f21a37a1..3784d389247b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4679,7 +4679,11 @@  static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
 
 	.alloc_apic_backing_page = svm_alloc_apic_backing_page,
+
 	.rmp_page_level_adjust = sev_rmp_page_level_adjust,
+
+	.post_map_gfn = sev_post_map_gfn,
+	.post_unmap_gfn = sev_post_unmap_gfn,
 };
 
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index d10f7166b39d..ff91184f9b4a 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -76,16 +76,22 @@  struct kvm_sev_info {
 	bool active;		/* SEV enabled guest */
 	bool es_active;		/* SEV-ES enabled guest */
 	bool snp_active;	/* SEV-SNP enabled guest */
+
 	unsigned int asid;	/* ASID used for this guest */
 	unsigned int handle;	/* SEV firmware handle */
 	int fd;			/* SEV device fd */
+
 	unsigned long pages_locked; /* Number of pages locked */
 	struct list_head regions_list;  /* List of registered regions */
+
 	u64 ap_jump_table;	/* SEV-ES AP Jump Table address */
+
 	struct kvm *enc_context_owner; /* Owner of copied encryption context */
 	struct misc_cg *misc_cg; /* For misc cgroup accounting */
+
 	u64 snp_init_flags;
 	void *snp_context;      /* SNP guest context page */
+	struct srcu_struct psc_srcu;
 };
 
 struct kvm_svm {
@@ -618,6 +624,8 @@  void sev_es_prepare_guest_switch(struct vcpu_svm *svm, unsigned int cpu);
 void sev_es_unmap_ghcb(struct vcpu_svm *svm);
 struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
 void sev_rmp_page_level_adjust(struct kvm *kvm, kvm_pfn_t pfn, int *level);
+int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *token);
+void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int token);
 
 /* vmenter.S */
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index afcdc75a99f2..bf4389ffc88f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3095,6 +3095,65 @@  static inline bool kvm_pv_async_pf_enabled(struct kvm_vcpu *vcpu)
 	return (vcpu->arch.apf.msr_en_val & mask) == mask;
 }
 
+static int kvm_map_gfn_protected(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map,
+				 struct gfn_to_pfn_cache *cache, bool atomic, int *token)
+{
+	int ret;
+
+	ret = kvm_map_gfn(vcpu, gfn, map, cache, atomic);
+	if (ret)
+		return ret;
+
+	if (kvm_x86_ops.post_map_gfn) {
+		ret = static_call(kvm_x86_post_map_gfn)(vcpu->kvm, map->gfn, map->pfn, token);
+		if (ret)
+			kvm_unmap_gfn(vcpu, map, cache, false, atomic);
+	}
+
+	return ret;
+}
+
+static int kvm_unmap_gfn_protected(struct kvm_vcpu *vcpu, struct kvm_host_map *map,
+				   struct gfn_to_pfn_cache *cache, bool dirty,
+				   bool atomic, int token)
+{
+	int ret;
+
+	ret = kvm_unmap_gfn(vcpu, map, cache, dirty, atomic);
+
+	if (kvm_x86_ops.post_unmap_gfn)
+		static_call(kvm_x86_post_unmap_gfn)(vcpu->kvm, map->gfn, map->pfn, token);
+
+	return ret;
+}
+
+static int kvm_vcpu_map_protected(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map,
+				  int *token)
+{
+	int ret;
+
+	ret = kvm_vcpu_map(vcpu, gpa, map);
+	if (ret)
+		return ret;
+
+	if (kvm_x86_ops.post_map_gfn) {
+		ret = static_call(kvm_x86_post_map_gfn)(vcpu->kvm, map->gfn, map->pfn, token);
+		if (ret)
+			kvm_vcpu_unmap(vcpu, map, false);
+	}
+
+	return ret;
+}
+
+static void kvm_vcpu_unmap_protected(struct kvm_vcpu *vcpu, struct kvm_host_map *map,
+				     bool dirty, int token)
+{
+	kvm_vcpu_unmap(vcpu, map, dirty);
+
+	if (kvm_x86_ops.post_unmap_gfn)
+		static_call(kvm_x86_post_unmap_gfn)(vcpu->kvm, map->gfn, map->pfn, token);
+}
+
 static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 {
 	gpa_t gpa = data & ~0x3f;
@@ -3185,6 +3244,7 @@  static void record_steal_time(struct kvm_vcpu *vcpu)
 {
 	struct kvm_host_map map;
 	struct kvm_steal_time *st;
+	int token;
 
 	if (kvm_xen_msr_enabled(vcpu->kvm)) {
 		kvm_xen_runstate_set_running(vcpu);
@@ -3195,8 +3255,8 @@  static void record_steal_time(struct kvm_vcpu *vcpu)
 		return;
 
 	/* -EAGAIN is returned in atomic context so we can just return. */
-	if (kvm_map_gfn(vcpu, vcpu->arch.st.msr_val >> PAGE_SHIFT,
-			&map, &vcpu->arch.st.cache, false))
+	if (kvm_map_gfn_protected(vcpu, vcpu->arch.st.msr_val >> PAGE_SHIFT,
+				  &map, &vcpu->arch.st.cache, false, &token))
 		return;
 
 	st = map.hva +
@@ -3234,7 +3294,7 @@  static void record_steal_time(struct kvm_vcpu *vcpu)
 
 	st->version += 1;
 
-	kvm_unmap_gfn(vcpu, &map, &vcpu->arch.st.cache, true, false);
+	kvm_unmap_gfn_protected(vcpu, &map, &vcpu->arch.st.cache, true, false, token);
 }
 
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
@@ -4271,6 +4331,7 @@  static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 {
 	struct kvm_host_map map;
 	struct kvm_steal_time *st;
+	int token;
 
 	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
 		return;
@@ -4278,8 +4339,8 @@  static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.st.preempted)
 		return;
 
-	if (kvm_map_gfn(vcpu, vcpu->arch.st.msr_val >> PAGE_SHIFT, &map,
-			&vcpu->arch.st.cache, true))
+	if (kvm_map_gfn_protected(vcpu, vcpu->arch.st.msr_val >> PAGE_SHIFT,
+				  &map, &vcpu->arch.st.cache, true, &token))
 		return;
 
 	st = map.hva +
@@ -4287,7 +4348,7 @@  static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 
 	st->preempted = vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED;
 
-	kvm_unmap_gfn(vcpu, &map, &vcpu->arch.st.cache, true, true);
+	kvm_unmap_gfn_protected(vcpu, &map, &vcpu->arch.st.cache, true, true, token);
 }
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
@@ -6816,6 +6877,7 @@  static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 	gpa_t gpa;
 	char *kaddr;
 	bool exchanged;
+	int token;
 
 	/* guests cmpxchg8b have to be emulated atomically */
 	if (bytes > 8 || (bytes & (bytes - 1)))
@@ -6839,7 +6901,7 @@  static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 	if (((gpa + bytes - 1) & page_line_mask) != (gpa & page_line_mask))
 		goto emul_write;
 
-	if (kvm_vcpu_map(vcpu, gpa_to_gfn(gpa), &map))
+	if (kvm_vcpu_map_protected(vcpu, gpa_to_gfn(gpa), &map, &token))
 		goto emul_write;
 
 	kaddr = map.hva + offset_in_page(gpa);
@@ -6861,7 +6923,7 @@  static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 		BUG();
 	}
 
-	kvm_vcpu_unmap(vcpu, &map, true);
+	kvm_vcpu_unmap_protected(vcpu, &map, true, token);
 
 	if (!exchanged)
 		return X86EMUL_CMPXCHG_FAILED;