diff mbox series

[09/12] KVM: guest_memfd: move check for already-populated page to common code

Message ID 20240711222755.57476-10-pbonzini@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: guest_memfd: lazy preparation of pages + prefault support for SEV-SNP | expand

Commit Message

Paolo Bonzini July 11, 2024, 10:27 p.m. UTC
Do not allow populating the same page twice with startup data.  In the
case of SEV-SNP, for example, the firmware does not allow it anyway,
since the launch-update operation is only possible on pages that are
still shared in the RMP.

Even if it worked, kvm_gmem_populate()'s callback is meant to have side
effects such as updating launch measurements, and updating the same
page twice is unlikely to have the desired results.

Races between calls to the ioctl are not possible because kvm_gmem_populate()
holds slots_lock and the VM should not be running.  But again, even if
this worked on other confidential computing technology, it doesn't matter
to guest_memfd.c whether this is an intentional attempt to do something
fishy, or missing synchronization in userspace, or even something
intentional.  One of the racers wins, and the page is initialized by
either kvm_gmem_prepare_folio() or kvm_gmem_populate().

Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use
the same errno that kvm_gmem_populate() is using.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/svm/sev.c | 2 +-
 virt/kvm/guest_memfd.c | 7 +++++++
 2 files changed, 8 insertions(+), 1 deletion(-)

Comments

Edgecombe, Rick P July 13, 2024, 1:28 a.m. UTC | #1
On Thu, 2024-07-11 at 18:27 -0400, Paolo Bonzini wrote:
> Do not allow populating the same page twice with startup data.  In the
> case of SEV-SNP, for example, the firmware does not allow it anyway,
> since the launch-update operation is only possible on pages that are
> still shared in the RMP.
> 
> Even if it worked, kvm_gmem_populate()'s callback is meant to have side
> effects such as updating launch measurements, and updating the same
> page twice is unlikely to have the desired results.
> 
> Races between calls to the ioctl are not possible because kvm_gmem_populate()
> holds slots_lock and the VM should not be running.  But again, even if
> this worked on other confidential computing technology, it doesn't matter
> to guest_memfd.c whether this is an intentional attempt to do something
> fishy, or missing synchronization in userspace, or even something
> intentional.  One of the racers wins, and the page is initialized by
> either kvm_gmem_prepare_folio() or kvm_gmem_populate().
> 
> Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use
> the same errno that kvm_gmem_populate() is using.

This patch breaks our rebased TDX development tree. First
kvm_gmem_prepare_folio() is called during the KVM_PRE_FAULT_MEMORY operation,
then next kvm_gmem_populate() is called during the KVM_TDX_INIT_MEM_REGION ioctl
to actually populate the memory, which hits the new -EEXIST error path.

Given we are not actually populating during KVM_PRE_FAULT_MEMORY and try to
avoid booting a TD until we've done so, maybe setting folio_mark_uptodate() in
kvm_gmem_prepare_folio() is not appropriate in that case? But it may not be easy
to separate.
Paolo Bonzini July 13, 2024, 10:10 a.m. UTC | #2
On Sat, Jul 13, 2024 at 3:28 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> On Thu, 2024-07-11 at 18:27 -0400, Paolo Bonzini wrote:
> > Do not allow populating the same page twice with startup data.  In the
> > case of SEV-SNP, for example, the firmware does not allow it anyway,
> > since the launch-update operation is only possible on pages that are
> > still shared in the RMP.
> >
> > Even if it worked, kvm_gmem_populate()'s callback is meant to have side
> > effects such as updating launch measurements, and updating the same
> > page twice is unlikely to have the desired results.
> >
> > Races between calls to the ioctl are not possible because kvm_gmem_populate()
> > holds slots_lock and the VM should not be running.  But again, even if
> > this worked on other confidential computing technology, it doesn't matter
> > to guest_memfd.c whether this is an intentional attempt to do something
> > fishy, or missing synchronization in userspace, or even something
> > intentional.  One of the racers wins, and the page is initialized by
> > either kvm_gmem_prepare_folio() or kvm_gmem_populate().
> >
> > Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use
> > the same errno that kvm_gmem_populate() is using.
>
> This patch breaks our rebased TDX development tree. First
> kvm_gmem_prepare_folio() is called during the KVM_PRE_FAULT_MEMORY operation,
> then next kvm_gmem_populate() is called during the KVM_TDX_INIT_MEM_REGION ioctl
> to actually populate the memory, which hits the new -EEXIST error path.

It's not a problem to only keep patches 1-8 for 6.11, and move the
rest to 6.12 (except for the bit that returns -EEXIST in sev.c).

Could you push a branch for me to take a look? I've never liked that
you have to do the explicit prefault before the VM setup is finished;
it's a TDX-specific detail that is transpiring into the API.

> Given we are not actually populating during KVM_PRE_FAULT_MEMORY and try to
> avoid booting a TD until we've done so, maybe setting folio_mark_uptodate() in
> kvm_gmem_prepare_folio() is not appropriate in that case? But it may not be easy
> to separate.

It would be easy (just return a boolean value from
kvm_arch_gmem_prepare() to skip folio_mark_uptodate() before the VM is
ready, and implement it for TDX) but it's ugly. You're also clearing
the memory unnecessarily before overwriting it.

Paolo
Edgecombe, Rick P July 13, 2024, 8:25 p.m. UTC | #3
On Sat, 2024-07-13 at 12:10 +0200, Paolo Bonzini wrote:
> > 
> > This patch breaks our rebased TDX development tree. First
> > kvm_gmem_prepare_folio() is called during the KVM_PRE_FAULT_MEMORY
> > operation,
> > then next kvm_gmem_populate() is called during the KVM_TDX_INIT_MEM_REGION
> > ioctl
> > to actually populate the memory, which hits the new -EEXIST error path.
> 
> It's not a problem to only keep patches 1-8 for 6.11, and move the
> rest to 6.12 (except for the bit that returns -EEXIST in sev.c).
> 
> Could you push a branch for me to take a look?

Sure, here it is.

KVM:
https://github.com/rpedgeco/linux/tree/tdx_kvm_dev-2024-07-12-mark_uptodate_issue
Matching QEMU:
https://github.com/intel-staging/qemu-tdx/releases/tag/tdx-qemu-wip-2024.06.19-v9.0.0

It is not fully based on kvm-coco-queue because it has the latest v2 of the
zapping quirk swapped in.

>  I've never liked that
> you have to do the explicit prefault before the VM setup is finished;
> it's a TDX-specific detail that is transpiring into the API.

Well, it's not too late to change direction again. I remember you and Sean were
not fully of one mind on the tradeoffs.

I guess this series is trying to help userspace not mess up the order of things
for SEV, where as TDX's design was to let userspace hold the pieces from the
beginning. As in, needing to match up the KVM_PRE_FAULT_MEMORY and
KVM_TDX_INIT_MEM_REGION calls, mysteriously return errors in later IOCTLs if
something was missed, etc.

Still, I might lean towards staying the course just because we have gone down
this path for a while and we don't currently have any fundamental issues.
Probably we *really* need to get the next TDX MMU stuff posted so we can start
to add a bit more certainty to that statement.

> 
> > Given we are not actually populating during KVM_PRE_FAULT_MEMORY and try to
> > avoid booting a TD until we've done so, maybe setting folio_mark_uptodate()
> > in
> > kvm_gmem_prepare_folio() is not appropriate in that case? But it may not be
> > easy
> > to separate.
> 
> It would be easy (just return a boolean value from
> kvm_arch_gmem_prepare() to skip folio_mark_uptodate() before the VM is
> ready, and implement it for TDX) but it's ugly. You're also clearing
> the memory unnecessarily before overwriting it.

Hmm, right. Since kvm_gmem_populate() does folio_mark_uptodate() again despite
testing for it earlier, we can skip folio_mark_uptodate() in
kvm_gmem_prepare_folio() for TDX during the pre-finalization stage and it will
get marked there.

I put a little POC of this suggestion at the end of the branch. Just revert it
to reproduce the issue.

I think in the context of the work to launch a TD, extra clearing of pages is
not too bad. I'm more bothered by how it highlights the general pitfalls of
TDX's special clever behavior for KVM_PRE_FAULT_MEMORY before TD initialization.

If/when we want to skip it, I wonder if we could move the clearing into the
gmem_prepare callbacks.
Michael Roth July 14, 2024, 5:32 a.m. UTC | #4
On Sat, Jul 13, 2024 at 08:25:42PM +0000, Edgecombe, Rick P wrote:
> On Sat, 2024-07-13 at 12:10 +0200, Paolo Bonzini wrote:
> > > 
> > > This patch breaks our rebased TDX development tree. First
> > > kvm_gmem_prepare_folio() is called during the KVM_PRE_FAULT_MEMORY
> > > operation,
> > > then next kvm_gmem_populate() is called during the KVM_TDX_INIT_MEM_REGION
> > > ioctl
> > > to actually populate the memory, which hits the new -EEXIST error path.
> > 
> > It's not a problem to only keep patches 1-8 for 6.11, and move the
> > rest to 6.12 (except for the bit that returns -EEXIST in sev.c).
> > 
> > Could you push a branch for me to take a look?
> 
> Sure, here it is.
> 
> KVM:
> https://github.com/rpedgeco/linux/tree/tdx_kvm_dev-2024-07-12-mark_uptodate_issue
> Matching QEMU:
> https://github.com/intel-staging/qemu-tdx/releases/tag/tdx-qemu-wip-2024.06.19-v9.0.0
> 
> It is not fully based on kvm-coco-queue because it has the latest v2 of the
> zapping quirk swapped in.
> 
> >  I've never liked that
> > you have to do the explicit prefault before the VM setup is finished;
> > it's a TDX-specific detail that is transpiring into the API.
> 
> Well, it's not too late to change direction again. I remember you and Sean were
> not fully of one mind on the tradeoffs.
> 
> I guess this series is trying to help userspace not mess up the order of things
> for SEV, where as TDX's design was to let userspace hold the pieces from the
> beginning. As in, needing to match up the KVM_PRE_FAULT_MEMORY and
> KVM_TDX_INIT_MEM_REGION calls, mysteriously return errors in later IOCTLs if
> something was missed, etc.

If SNP were to try to call KVM_PRE_FAULT_MEMORY before SNP_LAUNCH_UPDATE
(rough equivalent to KVM_TDX_INIT_MEM_REGION), I think the same issue
would arise, and in that case the uptodate flag you prototyped would
wouldn't be enough to address it because SNP_LAUNCH_UPDATE would end up
failing because the gmem_prepare hook previously triggered by
KVM_PRE_FAULT_MEMORY would have put the corresponding RMP entries into
an unexpected state (guest-owned/private).

So for SNP, KVM_PRE_FAULT_MEMORY/SNP_LAUNCH_UPDATE are mutually
exclusive on what GPA ranges they can prep before finalizing launch state.

*After* finalizing launch state however, KVM_PRE_FAULT_MEMORY can be
called for whatever range it likes. If gmem_prepare/gmem_populate was
already called for a GPA, the uptodate flag will be set and KVM only
needs to deal with the mapping.

So I wonder if it would be possible to enforce that KVM_PRE_FAULT_MEMORY
only be used after finalizing the VM in the CoCo case?

I realize that is awkward for TDX, where the KVM_PRE_FAULT_MEMORY is
required to create the sEPT mapping before encrypting, but maybe it
would be possible for TDX to just do that implicitly within
KVM_TDX_INIT_MEM_REGION?

That would free up KVM_PRE_FAULT_MEMORY to be called on any range
post-finalization, and all the edge cases prior to finalization could be
avoided if we have some way to enforce that finalization has already
been done.

One thing I'm not sure of is if KVM_TDX_INIT_MEM_REGION for a 4K page could
maybe lead to a 2M sEPT mapping that overlaps with a GPA range passed to
KVM_PRE_FAULT_MEMORY, which I think could lead to unexpected 'left' return
values unless we can make sure to only map exactly the GPA ranges populated
by KVM_TDX_INIT_MEM_REGION and nothing more.

-Mike

> 
> Still, I might lean towards staying the course just because we have gone down
> this path for a while and we don't currently have any fundamental issues.
> Probably we *really* need to get the next TDX MMU stuff posted so we can start
> to add a bit more certainty to that statement.
> 
> > 
> > > Given we are not actually populating during KVM_PRE_FAULT_MEMORY and try to
> > > avoid booting a TD until we've done so, maybe setting folio_mark_uptodate()
> > > in
> > > kvm_gmem_prepare_folio() is not appropriate in that case? But it may not be
> > > easy
> > > to separate.
> > 
> > It would be easy (just return a boolean value from
> > kvm_arch_gmem_prepare() to skip folio_mark_uptodate() before the VM is
> > ready, and implement it for TDX) but it's ugly. You're also clearing
> > the memory unnecessarily before overwriting it.
> 
> Hmm, right. Since kvm_gmem_populate() does folio_mark_uptodate() again despite
> testing for it earlier, we can skip folio_mark_uptodate() in
> kvm_gmem_prepare_folio() for TDX during the pre-finalization stage and it will
> get marked there.
> 
> I put a little POC of this suggestion at the end of the branch. Just revert it
> to reproduce the issue.
> 
> I think in the context of the work to launch a TD, extra clearing of pages is
> not too bad. I'm more bothered by how it highlights the general pitfalls of
> TDX's special clever behavior for KVM_PRE_FAULT_MEMORY before TD initialization.
> 
> If/when we want to skip it, I wonder if we could move the clearing into the
> gmem_prepare callbacks.
Paolo Bonzini July 15, 2024, 4:08 p.m. UTC | #5
On Sun, Jul 14, 2024 at 7:33 AM Michael Roth <michael.roth@amd.com> wrote:
> > I guess this series is trying to help userspace not mess up the order of things
> > for SEV, where as TDX's design was to let userspace hold the pieces from the
> > beginning. As in, needing to match up the KVM_PRE_FAULT_MEMORY and
> > KVM_TDX_INIT_MEM_REGION calls, mysteriously return errors in later IOCTLs if
> > something was missed, etc.
>
> If SNP were to try to call KVM_PRE_FAULT_MEMORY before SNP_LAUNCH_UPDATE
> (rough equivalent to KVM_TDX_INIT_MEM_REGION), I think the same issue
> would arise, and in that case the uptodate flag you prototyped would
> wouldn't be enough to address it because SNP_LAUNCH_UPDATE would end up
> failing because the gmem_prepare hook previously triggered by
> KVM_PRE_FAULT_MEMORY would have put the corresponding RMP entries into
> an unexpected state (guest-owned/private).

Indeed, and I'd love for that to be the case for both TDX and SNP.

> So for SNP, KVM_PRE_FAULT_MEMORY/SNP_LAUNCH_UPDATE are mutually
> exclusive on what GPA ranges they can prep before finalizing launch state.

Not a problem; is KVM_PRE_FAULT_MEMORY before finalization the same as
zeroing memory?

> I realize that is awkward for TDX, where the KVM_PRE_FAULT_MEMORY is
> required to create the sEPT mapping before encrypting, but maybe it
> would be possible for TDX to just do that implicitly within
> KVM_TDX_INIT_MEM_REGION?

Yes, and it's what the TDX API used to be like a while ago.
Locking-wise, Rick confirmed offlist that there's no problem in
calling kvm_arch_vcpu_pre_fault_memory() from tdx_gmem_post_populate()
(my fault that it went offlist - email from the phone is hard...).

To be clear, I have no problem at all reusing the prefaulting code,
that's better than TDX having to do its own thing.  But forcing
userspace to do two passes is not great (it's already not great that
it has to be TDX_INIT_MEM_REGION has to be a VCPU operation, but
that's unfortunately unavoidable ).

Paolo
Michael Roth July 15, 2024, 9:47 p.m. UTC | #6
On Mon, Jul 15, 2024 at 06:08:51PM +0200, Paolo Bonzini wrote:
> On Sun, Jul 14, 2024 at 7:33 AM Michael Roth <michael.roth@amd.com> wrote:
> > > I guess this series is trying to help userspace not mess up the order of things
> > > for SEV, where as TDX's design was to let userspace hold the pieces from the
> > > beginning. As in, needing to match up the KVM_PRE_FAULT_MEMORY and
> > > KVM_TDX_INIT_MEM_REGION calls, mysteriously return errors in later IOCTLs if
> > > something was missed, etc.
> >
> > If SNP were to try to call KVM_PRE_FAULT_MEMORY before SNP_LAUNCH_UPDATE
> > (rough equivalent to KVM_TDX_INIT_MEM_REGION), I think the same issue
> > would arise, and in that case the uptodate flag you prototyped would
> > wouldn't be enough to address it because SNP_LAUNCH_UPDATE would end up
> > failing because the gmem_prepare hook previously triggered by
> > KVM_PRE_FAULT_MEMORY would have put the corresponding RMP entries into
> > an unexpected state (guest-owned/private).
> 
> Indeed, and I'd love for that to be the case for both TDX and SNP.
> 
> > So for SNP, KVM_PRE_FAULT_MEMORY/SNP_LAUNCH_UPDATE are mutually
> > exclusive on what GPA ranges they can prep before finalizing launch state.
> 
> Not a problem; is KVM_PRE_FAULT_MEMORY before finalization the same as
> zeroing memory?

Sort of: you get a page that's zero'd by
gmem->kvm_gmem_prepare_folio()->clear_highpage(), and then that page is
mapped in the guest as a private page. But since no encrypted data was
written the guest will effectively see ciphertext until the guest actually
initializes the page after accepting it.

But you can sort of treat that whole sequence as being similar to calling
sev_gmem_post_populate() with page type KVM_SEV_SNP_PAGE_TYPE_ZERO (or
rather, a theoretical KVM_SEV_SNP_PAGE_TYPE_SCRAMBLE in this case), just
that in that case the page won't be pre-validated and it won't be
measured into the launch digest. But still if you look at it that way it's
a bit clearer why pre-fault shouldn't be performed on any ranges that will
later be populated again (unless gmem_populate callback itself handles it
and so aware of the restrictions).

> 
> > I realize that is awkward for TDX, where the KVM_PRE_FAULT_MEMORY is
> > required to create the sEPT mapping before encrypting, but maybe it
> > would be possible for TDX to just do that implicitly within
> > KVM_TDX_INIT_MEM_REGION?
> 
> Yes, and it's what the TDX API used to be like a while ago.
> Locking-wise, Rick confirmed offlist that there's no problem in
> calling kvm_arch_vcpu_pre_fault_memory() from tdx_gmem_post_populate()
> (my fault that it went offlist - email from the phone is hard...).

That's promising, if we go that route it probably makes sense for SNP to
do that as well, even though it's only an optimization in that case.

> 
> To be clear, I have no problem at all reusing the prefaulting code,
> that's better than TDX having to do its own thing.  But forcing
> userspace to do two passes is not great (it's already not great that
> it has to be TDX_INIT_MEM_REGION has to be a VCPU operation, but
> that's unfortunately unavoidable ).

Makes sense.

If we document mutual exclusion between ranges touched by
gmem_populate() vs ranges touched by actual userspace issuance of
KVM_PRE_FAULT_MEMORY (and make sure nothing too crazy happens if users
don't abide by the documentation), then I think most problems go away...

But there is still at least one awkward constraint for SNP:
KVM_PRE_FAULT_MEMORY cannot be called for private GPA ranges until after
SNP_LAUNCH_START is called. This is true even if the GPA range is not
one of the ranges that will get passed to
gmem_populate()/SNP_LAUNCH_UPDATE. The reason for this is that when
binding the ASID to the SNP context as part of SNP_LAUNCH_START, firmware
will perform checks to make sure that ASID is not already being used in
the RMP table, and that check will fail if KVM_PRE_FAULT_MEMORY triggered
for a private page before calling SNP_LAUNCH_START.

So we have this constraint that KVM_PRE_FAULT_MEMORY can't be issued
before SNP_LAUNCH_START. So it makes me wonder if we should just broaden
that constraint and for now just disallow KVM_PRE_FAULT_MEMORY prior to
finalizing a guest, since it'll be easier to lift that restriction later
versus discovering some other sort of edge case and need to
retroactively place restrictions.

I've taken Isaku's original pre_fault_memory_test and added a new
x86-specific coco_pre_fault_memory_test to try to better document and
exercise these corner cases for SEV and SNP, but I'm hoping it could
also be useful for TDX (hence the generic name). These are based on
Pratik's initial SNP selftests (which are in turn based on kvm/queue +
these patches):

  https://github.com/mdroth/linux/commits/snp-uptodate0-kst/
  https://github.com/mdroth/linux/commit/dd7d4b1983ceeb653132cfd54ad63f597db85f74

It could be interesting to also add some coverage interleaving of
fallocate()/FALLOC_FL_PUNCH_HOLE as well. I'll look into that more,
and more negative testing depending on the final semantics we end up
with.

Thanks,

Mike

> 
> Paolo
>
Edgecombe, Rick P July 15, 2024, 10:57 p.m. UTC | #7
On Mon, 2024-07-15 at 16:47 -0500, Michael Roth wrote:
> Makes sense.
> 
> If we document mutual exclusion between ranges touched by
> gmem_populate() vs ranges touched by actual userspace issuance of
> KVM_PRE_FAULT_MEMORY (and make sure nothing too crazy happens if users
> don't abide by the documentation), then I think most problems go away...
> 
> But there is still at least one awkward constraint for SNP:
> KVM_PRE_FAULT_MEMORY cannot be called for private GPA ranges until after
> SNP_LAUNCH_START is called. This is true even if the GPA range is not
> one of the ranges that will get passed to
> gmem_populate()/SNP_LAUNCH_UPDATE. The reason for this is that when
> binding the ASID to the SNP context as part of SNP_LAUNCH_START, firmware
> will perform checks to make sure that ASID is not already being used in
> the RMP table, and that check will fail if KVM_PRE_FAULT_MEMORY triggered
> for a private page before calling SNP_LAUNCH_START.
> 
> So we have this constraint that KVM_PRE_FAULT_MEMORY can't be issued
> before SNP_LAUNCH_START. So it makes me wonder if we should just broaden
> that constraint and for now just disallow KVM_PRE_FAULT_MEMORY prior to
> finalizing a guest, since it'll be easier to lift that restriction later
> versus discovering some other sort of edge case and need to
> retroactively place restrictions.
> 
> I've taken Isaku's original pre_fault_memory_test and added a new
> x86-specific coco_pre_fault_memory_test to try to better document and
> exercise these corner cases for SEV and SNP, but I'm hoping it could
> also be useful for TDX (hence the generic name). These are based on
> Pratik's initial SNP selftests (which are in turn based on kvm/queue +
> these patches):
> 
>   https://github.com/mdroth/linux/commits/snp-uptodate0-kst/
>  
> https://github.com/mdroth/linux/commit/dd7d4b1983ceeb653132cfd54ad63f597db85f74
> 
> 

From the TDX side it wouldn't be horrible to not have to worry about userspace
mucking around with the mirrored page tables in unexpected ways during the
special period. TDX already has its own "finalized" state in kvm_tdx, is there
something similar on the SEV side we could unify with?

I looked at moving from kvm_arch_vcpu_pre_fault_memory() to directly calling
kvm_tdp_map_page(), so we could potentially put whatever check in
kvm_arch_vcpu_pre_fault_memory(). It required a couple exports:

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 03737f3aaeeb..9004ac597a85 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -277,6 +277,7 @@ extern bool tdp_mmu_enabled;
 #endif
 
 int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa, kvm_pfn_t
*pfn);
+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8
*level);
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7bb6b17b455f..4a3e471ec9fe 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4721,8 +4721,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct
kvm_page_fault *fault)
        return direct_page_fault(vcpu, fault);
 }
 
-static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
-                           u8 *level)
+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8
*level)
 {
        int r;
 
@@ -4759,6 +4758,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t
gpa, u64 error_code,
                return -EIO;
        }
 }
+EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
 
 long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
                                    struct kvm_pre_fault_memory *range)
@@ -5770,6 +5770,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 out:
        return r;
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_load);
 
 void kvm_mmu_unload(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9ac0821eb44b..7161ef68f3da 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2809,11 +2809,13 @@ struct tdx_gmem_post_populate_arg {
 static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
                                  void __user *src, int order, void *_arg)
 {
+       u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
        struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
        struct tdx_gmem_post_populate_arg *arg = _arg;
        struct kvm_vcpu *vcpu = arg->vcpu;
        struct kvm_memory_slot *slot;
        gpa_t gpa = gfn_to_gpa(gfn);
+       u8 level = PG_LEVEL_4K;
        struct page *page;
        kvm_pfn_t mmu_pfn;
        int ret, i;
@@ -2832,6 +2834,10 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t
gfn, kvm_pfn_t pfn,
                goto out_put_page;
        }
 
+       ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
+       if (ret < 0)
+               goto out_put_page;
+
        read_lock(&kvm->mmu_lock);
 
        ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn);
@@ -2910,6 +2916,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu,
struct kvm_tdx_cmd *c
        mutex_lock(&kvm->slots_lock);
        idx = srcu_read_lock(&kvm->srcu);
 
+       kvm_mmu_reload(vcpu);
        ret = 0;
        while (region.nr_pages) {
                if (signal_pending(current)) {
Michael Roth July 16, 2024, 12:13 a.m. UTC | #8
On Mon, Jul 15, 2024 at 10:57:35PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2024-07-15 at 16:47 -0500, Michael Roth wrote:
> > Makes sense.
> > 
> > If we document mutual exclusion between ranges touched by
> > gmem_populate() vs ranges touched by actual userspace issuance of
> > KVM_PRE_FAULT_MEMORY (and make sure nothing too crazy happens if users
> > don't abide by the documentation), then I think most problems go away...
> > 
> > But there is still at least one awkward constraint for SNP:
> > KVM_PRE_FAULT_MEMORY cannot be called for private GPA ranges until after
> > SNP_LAUNCH_START is called. This is true even if the GPA range is not
> > one of the ranges that will get passed to
> > gmem_populate()/SNP_LAUNCH_UPDATE. The reason for this is that when
> > binding the ASID to the SNP context as part of SNP_LAUNCH_START, firmware
> > will perform checks to make sure that ASID is not already being used in
> > the RMP table, and that check will fail if KVM_PRE_FAULT_MEMORY triggered
> > for a private page before calling SNP_LAUNCH_START.
> > 
> > So we have this constraint that KVM_PRE_FAULT_MEMORY can't be issued
> > before SNP_LAUNCH_START. So it makes me wonder if we should just broaden
> > that constraint and for now just disallow KVM_PRE_FAULT_MEMORY prior to
> > finalizing a guest, since it'll be easier to lift that restriction later
> > versus discovering some other sort of edge case and need to
> > retroactively place restrictions.
> > 
> > I've taken Isaku's original pre_fault_memory_test and added a new
> > x86-specific coco_pre_fault_memory_test to try to better document and
> > exercise these corner cases for SEV and SNP, but I'm hoping it could
> > also be useful for TDX (hence the generic name). These are based on
> > Pratik's initial SNP selftests (which are in turn based on kvm/queue +
> > these patches):
> > 
> >   https://github.com/mdroth/linux/commits/snp-uptodate0-kst/
> >  
> > https://github.com/mdroth/linux/commit/dd7d4b1983ceeb653132cfd54ad63f597db85f74
> > 
> > 
> 
> From the TDX side it wouldn't be horrible to not have to worry about userspace
> mucking around with the mirrored page tables in unexpected ways during the
> special period. TDX already has its own "finalized" state in kvm_tdx, is there
> something similar on the SEV side we could unify with?

Unfortunately there isn't currently anything in place like that for SNP,
but if we had a common 'finalized' field somewhere it could easily be set in
snp_launch_finish() as well.

> 
> I looked at moving from kvm_arch_vcpu_pre_fault_memory() to directly calling
> kvm_tdp_map_page(), so we could potentially put whatever check in
> kvm_arch_vcpu_pre_fault_memory(). It required a couple exports:

Thanks. I'll give this a spin for SNP as well.

-Mike

> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 03737f3aaeeb..9004ac597a85 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -277,6 +277,7 @@ extern bool tdp_mmu_enabled;
>  #endif
>  
>  int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa, kvm_pfn_t
> *pfn);
> +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8
> *level);
>  
>  static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
>  {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7bb6b17b455f..4a3e471ec9fe 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4721,8 +4721,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct
> kvm_page_fault *fault)
>         return direct_page_fault(vcpu, fault);
>  }
>  
> -static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> -                           u8 *level)
> +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8
> *level)
>  {
>         int r;
>  
> @@ -4759,6 +4758,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t
> gpa, u64 error_code,
>                 return -EIO;
>         }
>  }
> +EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
>  
>  long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>                                     struct kvm_pre_fault_memory *range)
> @@ -5770,6 +5770,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
>  out:
>         return r;
>  }
> +EXPORT_SYMBOL_GPL(kvm_mmu_load);
>  
>  void kvm_mmu_unload(struct kvm_vcpu *vcpu)
>  {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 9ac0821eb44b..7161ef68f3da 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2809,11 +2809,13 @@ struct tdx_gmem_post_populate_arg {
>  static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>                                   void __user *src, int order, void *_arg)
>  {
> +       u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
>         struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>         struct tdx_gmem_post_populate_arg *arg = _arg;
>         struct kvm_vcpu *vcpu = arg->vcpu;
>         struct kvm_memory_slot *slot;
>         gpa_t gpa = gfn_to_gpa(gfn);
> +       u8 level = PG_LEVEL_4K;
>         struct page *page;
>         kvm_pfn_t mmu_pfn;
>         int ret, i;
> @@ -2832,6 +2834,10 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t
> gfn, kvm_pfn_t pfn,
>                 goto out_put_page;
>         }
>  
> +       ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> +       if (ret < 0)
> +               goto out_put_page;
> +
>         read_lock(&kvm->mmu_lock);
>  
>         ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn);
> @@ -2910,6 +2916,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu,
> struct kvm_tdx_cmd *c
>         mutex_lock(&kvm->slots_lock);
>         idx = srcu_read_lock(&kvm->srcu);
>  
> +       kvm_mmu_reload(vcpu);
>         ret = 0;
>         while (region.nr_pages) {
>                 if (signal_pending(current)) {
>
Michael Roth July 17, 2024, 6:42 a.m. UTC | #9
On Mon, Jul 15, 2024 at 10:57:35PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2024-07-15 at 16:47 -0500, Michael Roth wrote:
> > Makes sense.
> > 
> > If we document mutual exclusion between ranges touched by
> > gmem_populate() vs ranges touched by actual userspace issuance of
> > KVM_PRE_FAULT_MEMORY (and make sure nothing too crazy happens if users
> > don't abide by the documentation), then I think most problems go away...
> > 
> > But there is still at least one awkward constraint for SNP:
> > KVM_PRE_FAULT_MEMORY cannot be called for private GPA ranges until after
> > SNP_LAUNCH_START is called. This is true even if the GPA range is not
> > one of the ranges that will get passed to
> > gmem_populate()/SNP_LAUNCH_UPDATE. The reason for this is that when
> > binding the ASID to the SNP context as part of SNP_LAUNCH_START, firmware
> > will perform checks to make sure that ASID is not already being used in
> > the RMP table, and that check will fail if KVM_PRE_FAULT_MEMORY triggered
> > for a private page before calling SNP_LAUNCH_START.
> > 
> > So we have this constraint that KVM_PRE_FAULT_MEMORY can't be issued
> > before SNP_LAUNCH_START. So it makes me wonder if we should just broaden
> > that constraint and for now just disallow KVM_PRE_FAULT_MEMORY prior to
> > finalizing a guest, since it'll be easier to lift that restriction later
> > versus discovering some other sort of edge case and need to
> > retroactively place restrictions.
> > 
> > I've taken Isaku's original pre_fault_memory_test and added a new
> > x86-specific coco_pre_fault_memory_test to try to better document and
> > exercise these corner cases for SEV and SNP, but I'm hoping it could
> > also be useful for TDX (hence the generic name). These are based on
> > Pratik's initial SNP selftests (which are in turn based on kvm/queue +
> > these patches):
> > 
> >   https://github.com/mdroth/linux/commits/snp-uptodate0-kst/
> >  
> > https://github.com/mdroth/linux/commit/dd7d4b1983ceeb653132cfd54ad63f597db85f74
> > 
> > 
> 
> From the TDX side it wouldn't be horrible to not have to worry about userspace
> mucking around with the mirrored page tables in unexpected ways during the
> special period. TDX already has its own "finalized" state in kvm_tdx, is there
> something similar on the SEV side we could unify with?

While trying to doing pre-mapping in sev_gmem_post_populate() like you've
done below for TDX, I hit another issue that I think would be avoided by
enforcing finalization (or otherwise needs some alternative fix within
this series).

I'd already mentioned the issue with gmem_prepare() putting pages in an
unexpected RMP state if called prior to initializing the same GPA range
in gmem_populate(). This get handled gracefully however when the
firmware call is issued to encrypt/measure the pages and it re-checks
the page's RMP state.

However, if another thread is issuing KVM_PRE_FAULT_MEMORY and triggers
a gmem_prepare() just after gmem_populate() places the pages in a
private RMP state, then gmem_prepare()->kvm_gmem_get_pfn() will trigger
the clear_highpage() call in kvm_gmem_prepare_folio() because the
uptodate flag isn't set until after sev_gmem_post_populate() returns.
So I ended up just calling kvm_arch_vcpu_pre_fault_memory() on the full
GPA range after the post-populate callback returns:

  https://github.com/mdroth/linux/commit/da4e7465ced1a708ff1c5e9ab27c570de7e8974e

...

But a much larger concern is that even without this series, but just
plain kvm/queue, KVM_PRE_FAULT_MEMORY can still race with
sev_gmem_post_populate() in bad ways, so I think for 6.11 we should
consider locking this finalization requirement in and applying
something like the below fix:

  https://github.com/mdroth/linux/commit/7095ba6ee8f050af11620daaf6c2219fd0bbb1c3

Or if that's potentially too big a chance to decide right now, we could
just make KVM_PRE_FAULT_MEMORY return EOPNOTSUPP for
kvm_arch_has_private_memory() until we've had more time to work out the
missing bits for CoCo there.

-Mike

> 
> I looked at moving from kvm_arch_vcpu_pre_fault_memory() to directly calling
> kvm_tdp_map_page(), so we could potentially put whatever check in
> kvm_arch_vcpu_pre_fault_memory(). It required a couple exports:
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 03737f3aaeeb..9004ac597a85 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -277,6 +277,7 @@ extern bool tdp_mmu_enabled;
>  #endif
>  
>  int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa, kvm_pfn_t
> *pfn);
> +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8
> *level);
>  
>  static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
>  {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7bb6b17b455f..4a3e471ec9fe 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4721,8 +4721,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct
> kvm_page_fault *fault)
>         return direct_page_fault(vcpu, fault);
>  }
>  
> -static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> -                           u8 *level)
> +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8
> *level)
>  {
>         int r;
>  
> @@ -4759,6 +4758,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t
> gpa, u64 error_code,
>                 return -EIO;
>         }
>  }
> +EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
>  
>  long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>                                     struct kvm_pre_fault_memory *range)
> @@ -5770,6 +5770,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
>  out:
>         return r;
>  }
> +EXPORT_SYMBOL_GPL(kvm_mmu_load);
>  
>  void kvm_mmu_unload(struct kvm_vcpu *vcpu)
>  {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 9ac0821eb44b..7161ef68f3da 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2809,11 +2809,13 @@ struct tdx_gmem_post_populate_arg {
>  static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>                                   void __user *src, int order, void *_arg)
>  {
> +       u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
>         struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>         struct tdx_gmem_post_populate_arg *arg = _arg;
>         struct kvm_vcpu *vcpu = arg->vcpu;
>         struct kvm_memory_slot *slot;
>         gpa_t gpa = gfn_to_gpa(gfn);
> +       u8 level = PG_LEVEL_4K;
>         struct page *page;
>         kvm_pfn_t mmu_pfn;
>         int ret, i;
> @@ -2832,6 +2834,10 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t
> gfn, kvm_pfn_t pfn,
>                 goto out_put_page;
>         }
>  
> +       ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> +       if (ret < 0)
> +               goto out_put_page;
> +
>         read_lock(&kvm->mmu_lock);
>  
>         ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn);
> @@ -2910,6 +2916,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu,
> struct kvm_tdx_cmd *c
>         mutex_lock(&kvm->slots_lock);
>         idx = srcu_read_lock(&kvm->srcu);
>  
> +       kvm_mmu_reload(vcpu);
>         ret = 0;
>         while (region.nr_pages) {
>                 if (signal_pending(current)) {
>
Michael Roth July 17, 2024, 9:53 p.m. UTC | #10
On Thu, Jul 11, 2024 at 06:27:52PM -0400, Paolo Bonzini wrote:
> Do not allow populating the same page twice with startup data.  In the
> case of SEV-SNP, for example, the firmware does not allow it anyway,
> since the launch-update operation is only possible on pages that are
> still shared in the RMP.
> 
> Even if it worked, kvm_gmem_populate()'s callback is meant to have side
> effects such as updating launch measurements, and updating the same
> page twice is unlikely to have the desired results.
> 
> Races between calls to the ioctl are not possible because kvm_gmem_populate()
> holds slots_lock and the VM should not be running.  But again, even if
> this worked on other confidential computing technology, it doesn't matter
> to guest_memfd.c whether this is an intentional attempt to do something
> fishy, or missing synchronization in userspace, or even something
> intentional.  One of the racers wins, and the page is initialized by

I think one of those "intentional"s was meant to be an "unintentional".

> either kvm_gmem_prepare_folio() or kvm_gmem_populate().
> 
> Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use
> the same errno that kvm_gmem_populate() is using.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Assuming based on the discussion here that there will be some logic added
to enforce that KVM_PRE_FAULT_MEMORY can only happen after finalization, I
think the new checks make sense.

Reviewed-by: Michael Roth <michael.roth@amd.com>

> ---
>  arch/x86/kvm/svm/sev.c | 2 +-
>  virt/kvm/guest_memfd.c | 7 +++++++
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index df8818759698..397ef9e70182 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2213,7 +2213,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf
>  		if (ret || assigned) {
>  			pr_debug("%s: Failed to ensure GFN 0x%llx RMP entry is initial shared state, ret: %d assigned: %d\n",
>  				 __func__, gfn, ret, assigned);
> -			ret = -EINVAL;
> +			ret = ret ? -EINVAL : -EEXIST;
>  			goto err;
>  		}
>  
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 509360eefea5..266810bb91c9 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -650,6 +650,13 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
>  			break;
>  		}
>  
> +		if (folio_test_uptodate(folio)) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			ret = -EEXIST;
> +			break;
> +		}
> +
>  		folio_unlock(folio);
>  		if (!IS_ALIGNED(gfn, (1 << max_order)) ||
>  		    (npages - i) < (1 << max_order))
> -- 
> 2.43.0
> 
> 
>
diff mbox series

Patch

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index df8818759698..397ef9e70182 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2213,7 +2213,7 @@  static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf
 		if (ret || assigned) {
 			pr_debug("%s: Failed to ensure GFN 0x%llx RMP entry is initial shared state, ret: %d assigned: %d\n",
 				 __func__, gfn, ret, assigned);
-			ret = -EINVAL;
+			ret = ret ? -EINVAL : -EEXIST;
 			goto err;
 		}
 
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 509360eefea5..266810bb91c9 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -650,6 +650,13 @@  long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
 			break;
 		}
 
+		if (folio_test_uptodate(folio)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			ret = -EEXIST;
+			break;
+		}
+
 		folio_unlock(folio);
 		if (!IS_ALIGNED(gfn, (1 << max_order)) ||
 		    (npages - i) < (1 << max_order))