Message ID | 20240711222755.57476-10-pbonzini@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: guest_memfd: lazy preparation of pages + prefault support for SEV-SNP | expand |
On Thu, 2024-07-11 at 18:27 -0400, Paolo Bonzini wrote: > Do not allow populating the same page twice with startup data. In the > case of SEV-SNP, for example, the firmware does not allow it anyway, > since the launch-update operation is only possible on pages that are > still shared in the RMP. > > Even if it worked, kvm_gmem_populate()'s callback is meant to have side > effects such as updating launch measurements, and updating the same > page twice is unlikely to have the desired results. > > Races between calls to the ioctl are not possible because kvm_gmem_populate() > holds slots_lock and the VM should not be running. But again, even if > this worked on other confidential computing technology, it doesn't matter > to guest_memfd.c whether this is an intentional attempt to do something > fishy, or missing synchronization in userspace, or even something > intentional. One of the racers wins, and the page is initialized by > either kvm_gmem_prepare_folio() or kvm_gmem_populate(). > > Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use > the same errno that kvm_gmem_populate() is using. This patch breaks our rebased TDX development tree. First kvm_gmem_prepare_folio() is called during the KVM_PRE_FAULT_MEMORY operation, then next kvm_gmem_populate() is called during the KVM_TDX_INIT_MEM_REGION ioctl to actually populate the memory, which hits the new -EEXIST error path. Given we are not actually populating during KVM_PRE_FAULT_MEMORY and try to avoid booting a TD until we've done so, maybe setting folio_mark_uptodate() in kvm_gmem_prepare_folio() is not appropriate in that case? But it may not be easy to separate.
On Sat, Jul 13, 2024 at 3:28 AM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > On Thu, 2024-07-11 at 18:27 -0400, Paolo Bonzini wrote: > > Do not allow populating the same page twice with startup data. In the > > case of SEV-SNP, for example, the firmware does not allow it anyway, > > since the launch-update operation is only possible on pages that are > > still shared in the RMP. > > > > Even if it worked, kvm_gmem_populate()'s callback is meant to have side > > effects such as updating launch measurements, and updating the same > > page twice is unlikely to have the desired results. > > > > Races between calls to the ioctl are not possible because kvm_gmem_populate() > > holds slots_lock and the VM should not be running. But again, even if > > this worked on other confidential computing technology, it doesn't matter > > to guest_memfd.c whether this is an intentional attempt to do something > > fishy, or missing synchronization in userspace, or even something > > intentional. One of the racers wins, and the page is initialized by > > either kvm_gmem_prepare_folio() or kvm_gmem_populate(). > > > > Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use > > the same errno that kvm_gmem_populate() is using. > > This patch breaks our rebased TDX development tree. First > kvm_gmem_prepare_folio() is called during the KVM_PRE_FAULT_MEMORY operation, > then next kvm_gmem_populate() is called during the KVM_TDX_INIT_MEM_REGION ioctl > to actually populate the memory, which hits the new -EEXIST error path. It's not a problem to only keep patches 1-8 for 6.11, and move the rest to 6.12 (except for the bit that returns -EEXIST in sev.c). Could you push a branch for me to take a look? I've never liked that you have to do the explicit prefault before the VM setup is finished; it's a TDX-specific detail that is transpiring into the API. > Given we are not actually populating during KVM_PRE_FAULT_MEMORY and try to > avoid booting a TD until we've done so, maybe setting folio_mark_uptodate() in > kvm_gmem_prepare_folio() is not appropriate in that case? But it may not be easy > to separate. It would be easy (just return a boolean value from kvm_arch_gmem_prepare() to skip folio_mark_uptodate() before the VM is ready, and implement it for TDX) but it's ugly. You're also clearing the memory unnecessarily before overwriting it. Paolo
On Sat, 2024-07-13 at 12:10 +0200, Paolo Bonzini wrote: > > > > This patch breaks our rebased TDX development tree. First > > kvm_gmem_prepare_folio() is called during the KVM_PRE_FAULT_MEMORY > > operation, > > then next kvm_gmem_populate() is called during the KVM_TDX_INIT_MEM_REGION > > ioctl > > to actually populate the memory, which hits the new -EEXIST error path. > > It's not a problem to only keep patches 1-8 for 6.11, and move the > rest to 6.12 (except for the bit that returns -EEXIST in sev.c). > > Could you push a branch for me to take a look? Sure, here it is. KVM: https://github.com/rpedgeco/linux/tree/tdx_kvm_dev-2024-07-12-mark_uptodate_issue Matching QEMU: https://github.com/intel-staging/qemu-tdx/releases/tag/tdx-qemu-wip-2024.06.19-v9.0.0 It is not fully based on kvm-coco-queue because it has the latest v2 of the zapping quirk swapped in. > I've never liked that > you have to do the explicit prefault before the VM setup is finished; > it's a TDX-specific detail that is transpiring into the API. Well, it's not too late to change direction again. I remember you and Sean were not fully of one mind on the tradeoffs. I guess this series is trying to help userspace not mess up the order of things for SEV, where as TDX's design was to let userspace hold the pieces from the beginning. As in, needing to match up the KVM_PRE_FAULT_MEMORY and KVM_TDX_INIT_MEM_REGION calls, mysteriously return errors in later IOCTLs if something was missed, etc. Still, I might lean towards staying the course just because we have gone down this path for a while and we don't currently have any fundamental issues. Probably we *really* need to get the next TDX MMU stuff posted so we can start to add a bit more certainty to that statement. > > > Given we are not actually populating during KVM_PRE_FAULT_MEMORY and try to > > avoid booting a TD until we've done so, maybe setting folio_mark_uptodate() > > in > > kvm_gmem_prepare_folio() is not appropriate in that case? But it may not be > > easy > > to separate. > > It would be easy (just return a boolean value from > kvm_arch_gmem_prepare() to skip folio_mark_uptodate() before the VM is > ready, and implement it for TDX) but it's ugly. You're also clearing > the memory unnecessarily before overwriting it. Hmm, right. Since kvm_gmem_populate() does folio_mark_uptodate() again despite testing for it earlier, we can skip folio_mark_uptodate() in kvm_gmem_prepare_folio() for TDX during the pre-finalization stage and it will get marked there. I put a little POC of this suggestion at the end of the branch. Just revert it to reproduce the issue. I think in the context of the work to launch a TD, extra clearing of pages is not too bad. I'm more bothered by how it highlights the general pitfalls of TDX's special clever behavior for KVM_PRE_FAULT_MEMORY before TD initialization. If/when we want to skip it, I wonder if we could move the clearing into the gmem_prepare callbacks.
On Sat, Jul 13, 2024 at 08:25:42PM +0000, Edgecombe, Rick P wrote: > On Sat, 2024-07-13 at 12:10 +0200, Paolo Bonzini wrote: > > > > > > This patch breaks our rebased TDX development tree. First > > > kvm_gmem_prepare_folio() is called during the KVM_PRE_FAULT_MEMORY > > > operation, > > > then next kvm_gmem_populate() is called during the KVM_TDX_INIT_MEM_REGION > > > ioctl > > > to actually populate the memory, which hits the new -EEXIST error path. > > > > It's not a problem to only keep patches 1-8 for 6.11, and move the > > rest to 6.12 (except for the bit that returns -EEXIST in sev.c). > > > > Could you push a branch for me to take a look? > > Sure, here it is. > > KVM: > https://github.com/rpedgeco/linux/tree/tdx_kvm_dev-2024-07-12-mark_uptodate_issue > Matching QEMU: > https://github.com/intel-staging/qemu-tdx/releases/tag/tdx-qemu-wip-2024.06.19-v9.0.0 > > It is not fully based on kvm-coco-queue because it has the latest v2 of the > zapping quirk swapped in. > > > I've never liked that > > you have to do the explicit prefault before the VM setup is finished; > > it's a TDX-specific detail that is transpiring into the API. > > Well, it's not too late to change direction again. I remember you and Sean were > not fully of one mind on the tradeoffs. > > I guess this series is trying to help userspace not mess up the order of things > for SEV, where as TDX's design was to let userspace hold the pieces from the > beginning. As in, needing to match up the KVM_PRE_FAULT_MEMORY and > KVM_TDX_INIT_MEM_REGION calls, mysteriously return errors in later IOCTLs if > something was missed, etc. If SNP were to try to call KVM_PRE_FAULT_MEMORY before SNP_LAUNCH_UPDATE (rough equivalent to KVM_TDX_INIT_MEM_REGION), I think the same issue would arise, and in that case the uptodate flag you prototyped would wouldn't be enough to address it because SNP_LAUNCH_UPDATE would end up failing because the gmem_prepare hook previously triggered by KVM_PRE_FAULT_MEMORY would have put the corresponding RMP entries into an unexpected state (guest-owned/private). So for SNP, KVM_PRE_FAULT_MEMORY/SNP_LAUNCH_UPDATE are mutually exclusive on what GPA ranges they can prep before finalizing launch state. *After* finalizing launch state however, KVM_PRE_FAULT_MEMORY can be called for whatever range it likes. If gmem_prepare/gmem_populate was already called for a GPA, the uptodate flag will be set and KVM only needs to deal with the mapping. So I wonder if it would be possible to enforce that KVM_PRE_FAULT_MEMORY only be used after finalizing the VM in the CoCo case? I realize that is awkward for TDX, where the KVM_PRE_FAULT_MEMORY is required to create the sEPT mapping before encrypting, but maybe it would be possible for TDX to just do that implicitly within KVM_TDX_INIT_MEM_REGION? That would free up KVM_PRE_FAULT_MEMORY to be called on any range post-finalization, and all the edge cases prior to finalization could be avoided if we have some way to enforce that finalization has already been done. One thing I'm not sure of is if KVM_TDX_INIT_MEM_REGION for a 4K page could maybe lead to a 2M sEPT mapping that overlaps with a GPA range passed to KVM_PRE_FAULT_MEMORY, which I think could lead to unexpected 'left' return values unless we can make sure to only map exactly the GPA ranges populated by KVM_TDX_INIT_MEM_REGION and nothing more. -Mike > > Still, I might lean towards staying the course just because we have gone down > this path for a while and we don't currently have any fundamental issues. > Probably we *really* need to get the next TDX MMU stuff posted so we can start > to add a bit more certainty to that statement. > > > > > > Given we are not actually populating during KVM_PRE_FAULT_MEMORY and try to > > > avoid booting a TD until we've done so, maybe setting folio_mark_uptodate() > > > in > > > kvm_gmem_prepare_folio() is not appropriate in that case? But it may not be > > > easy > > > to separate. > > > > It would be easy (just return a boolean value from > > kvm_arch_gmem_prepare() to skip folio_mark_uptodate() before the VM is > > ready, and implement it for TDX) but it's ugly. You're also clearing > > the memory unnecessarily before overwriting it. > > Hmm, right. Since kvm_gmem_populate() does folio_mark_uptodate() again despite > testing for it earlier, we can skip folio_mark_uptodate() in > kvm_gmem_prepare_folio() for TDX during the pre-finalization stage and it will > get marked there. > > I put a little POC of this suggestion at the end of the branch. Just revert it > to reproduce the issue. > > I think in the context of the work to launch a TD, extra clearing of pages is > not too bad. I'm more bothered by how it highlights the general pitfalls of > TDX's special clever behavior for KVM_PRE_FAULT_MEMORY before TD initialization. > > If/when we want to skip it, I wonder if we could move the clearing into the > gmem_prepare callbacks.
On Sun, Jul 14, 2024 at 7:33 AM Michael Roth <michael.roth@amd.com> wrote: > > I guess this series is trying to help userspace not mess up the order of things > > for SEV, where as TDX's design was to let userspace hold the pieces from the > > beginning. As in, needing to match up the KVM_PRE_FAULT_MEMORY and > > KVM_TDX_INIT_MEM_REGION calls, mysteriously return errors in later IOCTLs if > > something was missed, etc. > > If SNP were to try to call KVM_PRE_FAULT_MEMORY before SNP_LAUNCH_UPDATE > (rough equivalent to KVM_TDX_INIT_MEM_REGION), I think the same issue > would arise, and in that case the uptodate flag you prototyped would > wouldn't be enough to address it because SNP_LAUNCH_UPDATE would end up > failing because the gmem_prepare hook previously triggered by > KVM_PRE_FAULT_MEMORY would have put the corresponding RMP entries into > an unexpected state (guest-owned/private). Indeed, and I'd love for that to be the case for both TDX and SNP. > So for SNP, KVM_PRE_FAULT_MEMORY/SNP_LAUNCH_UPDATE are mutually > exclusive on what GPA ranges they can prep before finalizing launch state. Not a problem; is KVM_PRE_FAULT_MEMORY before finalization the same as zeroing memory? > I realize that is awkward for TDX, where the KVM_PRE_FAULT_MEMORY is > required to create the sEPT mapping before encrypting, but maybe it > would be possible for TDX to just do that implicitly within > KVM_TDX_INIT_MEM_REGION? Yes, and it's what the TDX API used to be like a while ago. Locking-wise, Rick confirmed offlist that there's no problem in calling kvm_arch_vcpu_pre_fault_memory() from tdx_gmem_post_populate() (my fault that it went offlist - email from the phone is hard...). To be clear, I have no problem at all reusing the prefaulting code, that's better than TDX having to do its own thing. But forcing userspace to do two passes is not great (it's already not great that it has to be TDX_INIT_MEM_REGION has to be a VCPU operation, but that's unfortunately unavoidable ). Paolo
On Mon, Jul 15, 2024 at 06:08:51PM +0200, Paolo Bonzini wrote: > On Sun, Jul 14, 2024 at 7:33 AM Michael Roth <michael.roth@amd.com> wrote: > > > I guess this series is trying to help userspace not mess up the order of things > > > for SEV, where as TDX's design was to let userspace hold the pieces from the > > > beginning. As in, needing to match up the KVM_PRE_FAULT_MEMORY and > > > KVM_TDX_INIT_MEM_REGION calls, mysteriously return errors in later IOCTLs if > > > something was missed, etc. > > > > If SNP were to try to call KVM_PRE_FAULT_MEMORY before SNP_LAUNCH_UPDATE > > (rough equivalent to KVM_TDX_INIT_MEM_REGION), I think the same issue > > would arise, and in that case the uptodate flag you prototyped would > > wouldn't be enough to address it because SNP_LAUNCH_UPDATE would end up > > failing because the gmem_prepare hook previously triggered by > > KVM_PRE_FAULT_MEMORY would have put the corresponding RMP entries into > > an unexpected state (guest-owned/private). > > Indeed, and I'd love for that to be the case for both TDX and SNP. > > > So for SNP, KVM_PRE_FAULT_MEMORY/SNP_LAUNCH_UPDATE are mutually > > exclusive on what GPA ranges they can prep before finalizing launch state. > > Not a problem; is KVM_PRE_FAULT_MEMORY before finalization the same as > zeroing memory? Sort of: you get a page that's zero'd by gmem->kvm_gmem_prepare_folio()->clear_highpage(), and then that page is mapped in the guest as a private page. But since no encrypted data was written the guest will effectively see ciphertext until the guest actually initializes the page after accepting it. But you can sort of treat that whole sequence as being similar to calling sev_gmem_post_populate() with page type KVM_SEV_SNP_PAGE_TYPE_ZERO (or rather, a theoretical KVM_SEV_SNP_PAGE_TYPE_SCRAMBLE in this case), just that in that case the page won't be pre-validated and it won't be measured into the launch digest. But still if you look at it that way it's a bit clearer why pre-fault shouldn't be performed on any ranges that will later be populated again (unless gmem_populate callback itself handles it and so aware of the restrictions). > > > I realize that is awkward for TDX, where the KVM_PRE_FAULT_MEMORY is > > required to create the sEPT mapping before encrypting, but maybe it > > would be possible for TDX to just do that implicitly within > > KVM_TDX_INIT_MEM_REGION? > > Yes, and it's what the TDX API used to be like a while ago. > Locking-wise, Rick confirmed offlist that there's no problem in > calling kvm_arch_vcpu_pre_fault_memory() from tdx_gmem_post_populate() > (my fault that it went offlist - email from the phone is hard...). That's promising, if we go that route it probably makes sense for SNP to do that as well, even though it's only an optimization in that case. > > To be clear, I have no problem at all reusing the prefaulting code, > that's better than TDX having to do its own thing. But forcing > userspace to do two passes is not great (it's already not great that > it has to be TDX_INIT_MEM_REGION has to be a VCPU operation, but > that's unfortunately unavoidable ). Makes sense. If we document mutual exclusion between ranges touched by gmem_populate() vs ranges touched by actual userspace issuance of KVM_PRE_FAULT_MEMORY (and make sure nothing too crazy happens if users don't abide by the documentation), then I think most problems go away... But there is still at least one awkward constraint for SNP: KVM_PRE_FAULT_MEMORY cannot be called for private GPA ranges until after SNP_LAUNCH_START is called. This is true even if the GPA range is not one of the ranges that will get passed to gmem_populate()/SNP_LAUNCH_UPDATE. The reason for this is that when binding the ASID to the SNP context as part of SNP_LAUNCH_START, firmware will perform checks to make sure that ASID is not already being used in the RMP table, and that check will fail if KVM_PRE_FAULT_MEMORY triggered for a private page before calling SNP_LAUNCH_START. So we have this constraint that KVM_PRE_FAULT_MEMORY can't be issued before SNP_LAUNCH_START. So it makes me wonder if we should just broaden that constraint and for now just disallow KVM_PRE_FAULT_MEMORY prior to finalizing a guest, since it'll be easier to lift that restriction later versus discovering some other sort of edge case and need to retroactively place restrictions. I've taken Isaku's original pre_fault_memory_test and added a new x86-specific coco_pre_fault_memory_test to try to better document and exercise these corner cases for SEV and SNP, but I'm hoping it could also be useful for TDX (hence the generic name). These are based on Pratik's initial SNP selftests (which are in turn based on kvm/queue + these patches): https://github.com/mdroth/linux/commits/snp-uptodate0-kst/ https://github.com/mdroth/linux/commit/dd7d4b1983ceeb653132cfd54ad63f597db85f74 It could be interesting to also add some coverage interleaving of fallocate()/FALLOC_FL_PUNCH_HOLE as well. I'll look into that more, and more negative testing depending on the final semantics we end up with. Thanks, Mike > > Paolo >
On Mon, 2024-07-15 at 16:47 -0500, Michael Roth wrote: > Makes sense. > > If we document mutual exclusion between ranges touched by > gmem_populate() vs ranges touched by actual userspace issuance of > KVM_PRE_FAULT_MEMORY (and make sure nothing too crazy happens if users > don't abide by the documentation), then I think most problems go away... > > But there is still at least one awkward constraint for SNP: > KVM_PRE_FAULT_MEMORY cannot be called for private GPA ranges until after > SNP_LAUNCH_START is called. This is true even if the GPA range is not > one of the ranges that will get passed to > gmem_populate()/SNP_LAUNCH_UPDATE. The reason for this is that when > binding the ASID to the SNP context as part of SNP_LAUNCH_START, firmware > will perform checks to make sure that ASID is not already being used in > the RMP table, and that check will fail if KVM_PRE_FAULT_MEMORY triggered > for a private page before calling SNP_LAUNCH_START. > > So we have this constraint that KVM_PRE_FAULT_MEMORY can't be issued > before SNP_LAUNCH_START. So it makes me wonder if we should just broaden > that constraint and for now just disallow KVM_PRE_FAULT_MEMORY prior to > finalizing a guest, since it'll be easier to lift that restriction later > versus discovering some other sort of edge case and need to > retroactively place restrictions. > > I've taken Isaku's original pre_fault_memory_test and added a new > x86-specific coco_pre_fault_memory_test to try to better document and > exercise these corner cases for SEV and SNP, but I'm hoping it could > also be useful for TDX (hence the generic name). These are based on > Pratik's initial SNP selftests (which are in turn based on kvm/queue + > these patches): > > https://github.com/mdroth/linux/commits/snp-uptodate0-kst/ > > https://github.com/mdroth/linux/commit/dd7d4b1983ceeb653132cfd54ad63f597db85f74 > > From the TDX side it wouldn't be horrible to not have to worry about userspace mucking around with the mirrored page tables in unexpected ways during the special period. TDX already has its own "finalized" state in kvm_tdx, is there something similar on the SEV side we could unify with? I looked at moving from kvm_arch_vcpu_pre_fault_memory() to directly calling kvm_tdp_map_page(), so we could potentially put whatever check in kvm_arch_vcpu_pre_fault_memory(). It required a couple exports: diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 03737f3aaeeb..9004ac597a85 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -277,6 +277,7 @@ extern bool tdp_mmu_enabled; #endif int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa, kvm_pfn_t *pfn); +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level); static inline bool kvm_memslots_have_rmaps(struct kvm *kvm) { diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 7bb6b17b455f..4a3e471ec9fe 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4721,8 +4721,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) return direct_page_fault(vcpu, fault); } -static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, - u8 *level) +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level) { int r; @@ -4759,6 +4758,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, return -EIO; } } +EXPORT_SYMBOL_GPL(kvm_tdp_map_page); long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, struct kvm_pre_fault_memory *range) @@ -5770,6 +5770,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu) out: return r; } +EXPORT_SYMBOL_GPL(kvm_mmu_load); void kvm_mmu_unload(struct kvm_vcpu *vcpu) { diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 9ac0821eb44b..7161ef68f3da 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -2809,11 +2809,13 @@ struct tdx_gmem_post_populate_arg { static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, void __user *src, int order, void *_arg) { + u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS; struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); struct tdx_gmem_post_populate_arg *arg = _arg; struct kvm_vcpu *vcpu = arg->vcpu; struct kvm_memory_slot *slot; gpa_t gpa = gfn_to_gpa(gfn); + u8 level = PG_LEVEL_4K; struct page *page; kvm_pfn_t mmu_pfn; int ret, i; @@ -2832,6 +2834,10 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, goto out_put_page; } + ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level); + if (ret < 0) + goto out_put_page; + read_lock(&kvm->mmu_lock); ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn); @@ -2910,6 +2916,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c mutex_lock(&kvm->slots_lock); idx = srcu_read_lock(&kvm->srcu); + kvm_mmu_reload(vcpu); ret = 0; while (region.nr_pages) { if (signal_pending(current)) {
On Mon, Jul 15, 2024 at 10:57:35PM +0000, Edgecombe, Rick P wrote: > On Mon, 2024-07-15 at 16:47 -0500, Michael Roth wrote: > > Makes sense. > > > > If we document mutual exclusion between ranges touched by > > gmem_populate() vs ranges touched by actual userspace issuance of > > KVM_PRE_FAULT_MEMORY (and make sure nothing too crazy happens if users > > don't abide by the documentation), then I think most problems go away... > > > > But there is still at least one awkward constraint for SNP: > > KVM_PRE_FAULT_MEMORY cannot be called for private GPA ranges until after > > SNP_LAUNCH_START is called. This is true even if the GPA range is not > > one of the ranges that will get passed to > > gmem_populate()/SNP_LAUNCH_UPDATE. The reason for this is that when > > binding the ASID to the SNP context as part of SNP_LAUNCH_START, firmware > > will perform checks to make sure that ASID is not already being used in > > the RMP table, and that check will fail if KVM_PRE_FAULT_MEMORY triggered > > for a private page before calling SNP_LAUNCH_START. > > > > So we have this constraint that KVM_PRE_FAULT_MEMORY can't be issued > > before SNP_LAUNCH_START. So it makes me wonder if we should just broaden > > that constraint and for now just disallow KVM_PRE_FAULT_MEMORY prior to > > finalizing a guest, since it'll be easier to lift that restriction later > > versus discovering some other sort of edge case and need to > > retroactively place restrictions. > > > > I've taken Isaku's original pre_fault_memory_test and added a new > > x86-specific coco_pre_fault_memory_test to try to better document and > > exercise these corner cases for SEV and SNP, but I'm hoping it could > > also be useful for TDX (hence the generic name). These are based on > > Pratik's initial SNP selftests (which are in turn based on kvm/queue + > > these patches): > > > > https://github.com/mdroth/linux/commits/snp-uptodate0-kst/ > > > > https://github.com/mdroth/linux/commit/dd7d4b1983ceeb653132cfd54ad63f597db85f74 > > > > > > From the TDX side it wouldn't be horrible to not have to worry about userspace > mucking around with the mirrored page tables in unexpected ways during the > special period. TDX already has its own "finalized" state in kvm_tdx, is there > something similar on the SEV side we could unify with? Unfortunately there isn't currently anything in place like that for SNP, but if we had a common 'finalized' field somewhere it could easily be set in snp_launch_finish() as well. > > I looked at moving from kvm_arch_vcpu_pre_fault_memory() to directly calling > kvm_tdp_map_page(), so we could potentially put whatever check in > kvm_arch_vcpu_pre_fault_memory(). It required a couple exports: Thanks. I'll give this a spin for SNP as well. -Mike > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 03737f3aaeeb..9004ac597a85 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -277,6 +277,7 @@ extern bool tdp_mmu_enabled; > #endif > > int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa, kvm_pfn_t > *pfn); > +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 > *level); > > static inline bool kvm_memslots_have_rmaps(struct kvm *kvm) > { > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 7bb6b17b455f..4a3e471ec9fe 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4721,8 +4721,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct > kvm_page_fault *fault) > return direct_page_fault(vcpu, fault); > } > > -static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, > - u8 *level) > +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 > *level) > { > int r; > > @@ -4759,6 +4758,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t > gpa, u64 error_code, > return -EIO; > } > } > +EXPORT_SYMBOL_GPL(kvm_tdp_map_page); > > long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, > struct kvm_pre_fault_memory *range) > @@ -5770,6 +5770,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu) > out: > return r; > } > +EXPORT_SYMBOL_GPL(kvm_mmu_load); > > void kvm_mmu_unload(struct kvm_vcpu *vcpu) > { > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index 9ac0821eb44b..7161ef68f3da 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -2809,11 +2809,13 @@ struct tdx_gmem_post_populate_arg { > static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > void __user *src, int order, void *_arg) > { > + u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS; > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > struct tdx_gmem_post_populate_arg *arg = _arg; > struct kvm_vcpu *vcpu = arg->vcpu; > struct kvm_memory_slot *slot; > gpa_t gpa = gfn_to_gpa(gfn); > + u8 level = PG_LEVEL_4K; > struct page *page; > kvm_pfn_t mmu_pfn; > int ret, i; > @@ -2832,6 +2834,10 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t > gfn, kvm_pfn_t pfn, > goto out_put_page; > } > > + ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level); > + if (ret < 0) > + goto out_put_page; > + > read_lock(&kvm->mmu_lock); > > ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn); > @@ -2910,6 +2916,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, > struct kvm_tdx_cmd *c > mutex_lock(&kvm->slots_lock); > idx = srcu_read_lock(&kvm->srcu); > > + kvm_mmu_reload(vcpu); > ret = 0; > while (region.nr_pages) { > if (signal_pending(current)) { >
On Mon, Jul 15, 2024 at 10:57:35PM +0000, Edgecombe, Rick P wrote: > On Mon, 2024-07-15 at 16:47 -0500, Michael Roth wrote: > > Makes sense. > > > > If we document mutual exclusion between ranges touched by > > gmem_populate() vs ranges touched by actual userspace issuance of > > KVM_PRE_FAULT_MEMORY (and make sure nothing too crazy happens if users > > don't abide by the documentation), then I think most problems go away... > > > > But there is still at least one awkward constraint for SNP: > > KVM_PRE_FAULT_MEMORY cannot be called for private GPA ranges until after > > SNP_LAUNCH_START is called. This is true even if the GPA range is not > > one of the ranges that will get passed to > > gmem_populate()/SNP_LAUNCH_UPDATE. The reason for this is that when > > binding the ASID to the SNP context as part of SNP_LAUNCH_START, firmware > > will perform checks to make sure that ASID is not already being used in > > the RMP table, and that check will fail if KVM_PRE_FAULT_MEMORY triggered > > for a private page before calling SNP_LAUNCH_START. > > > > So we have this constraint that KVM_PRE_FAULT_MEMORY can't be issued > > before SNP_LAUNCH_START. So it makes me wonder if we should just broaden > > that constraint and for now just disallow KVM_PRE_FAULT_MEMORY prior to > > finalizing a guest, since it'll be easier to lift that restriction later > > versus discovering some other sort of edge case and need to > > retroactively place restrictions. > > > > I've taken Isaku's original pre_fault_memory_test and added a new > > x86-specific coco_pre_fault_memory_test to try to better document and > > exercise these corner cases for SEV and SNP, but I'm hoping it could > > also be useful for TDX (hence the generic name). These are based on > > Pratik's initial SNP selftests (which are in turn based on kvm/queue + > > these patches): > > > > https://github.com/mdroth/linux/commits/snp-uptodate0-kst/ > > > > https://github.com/mdroth/linux/commit/dd7d4b1983ceeb653132cfd54ad63f597db85f74 > > > > > > From the TDX side it wouldn't be horrible to not have to worry about userspace > mucking around with the mirrored page tables in unexpected ways during the > special period. TDX already has its own "finalized" state in kvm_tdx, is there > something similar on the SEV side we could unify with? While trying to doing pre-mapping in sev_gmem_post_populate() like you've done below for TDX, I hit another issue that I think would be avoided by enforcing finalization (or otherwise needs some alternative fix within this series). I'd already mentioned the issue with gmem_prepare() putting pages in an unexpected RMP state if called prior to initializing the same GPA range in gmem_populate(). This get handled gracefully however when the firmware call is issued to encrypt/measure the pages and it re-checks the page's RMP state. However, if another thread is issuing KVM_PRE_FAULT_MEMORY and triggers a gmem_prepare() just after gmem_populate() places the pages in a private RMP state, then gmem_prepare()->kvm_gmem_get_pfn() will trigger the clear_highpage() call in kvm_gmem_prepare_folio() because the uptodate flag isn't set until after sev_gmem_post_populate() returns. So I ended up just calling kvm_arch_vcpu_pre_fault_memory() on the full GPA range after the post-populate callback returns: https://github.com/mdroth/linux/commit/da4e7465ced1a708ff1c5e9ab27c570de7e8974e ... But a much larger concern is that even without this series, but just plain kvm/queue, KVM_PRE_FAULT_MEMORY can still race with sev_gmem_post_populate() in bad ways, so I think for 6.11 we should consider locking this finalization requirement in and applying something like the below fix: https://github.com/mdroth/linux/commit/7095ba6ee8f050af11620daaf6c2219fd0bbb1c3 Or if that's potentially too big a chance to decide right now, we could just make KVM_PRE_FAULT_MEMORY return EOPNOTSUPP for kvm_arch_has_private_memory() until we've had more time to work out the missing bits for CoCo there. -Mike > > I looked at moving from kvm_arch_vcpu_pre_fault_memory() to directly calling > kvm_tdp_map_page(), so we could potentially put whatever check in > kvm_arch_vcpu_pre_fault_memory(). It required a couple exports: > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 03737f3aaeeb..9004ac597a85 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -277,6 +277,7 @@ extern bool tdp_mmu_enabled; > #endif > > int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa, kvm_pfn_t > *pfn); > +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 > *level); > > static inline bool kvm_memslots_have_rmaps(struct kvm *kvm) > { > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 7bb6b17b455f..4a3e471ec9fe 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4721,8 +4721,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct > kvm_page_fault *fault) > return direct_page_fault(vcpu, fault); > } > > -static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, > - u8 *level) > +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 > *level) > { > int r; > > @@ -4759,6 +4758,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t > gpa, u64 error_code, > return -EIO; > } > } > +EXPORT_SYMBOL_GPL(kvm_tdp_map_page); > > long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, > struct kvm_pre_fault_memory *range) > @@ -5770,6 +5770,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu) > out: > return r; > } > +EXPORT_SYMBOL_GPL(kvm_mmu_load); > > void kvm_mmu_unload(struct kvm_vcpu *vcpu) > { > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index 9ac0821eb44b..7161ef68f3da 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -2809,11 +2809,13 @@ struct tdx_gmem_post_populate_arg { > static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > void __user *src, int order, void *_arg) > { > + u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS; > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > struct tdx_gmem_post_populate_arg *arg = _arg; > struct kvm_vcpu *vcpu = arg->vcpu; > struct kvm_memory_slot *slot; > gpa_t gpa = gfn_to_gpa(gfn); > + u8 level = PG_LEVEL_4K; > struct page *page; > kvm_pfn_t mmu_pfn; > int ret, i; > @@ -2832,6 +2834,10 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t > gfn, kvm_pfn_t pfn, > goto out_put_page; > } > > + ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level); > + if (ret < 0) > + goto out_put_page; > + > read_lock(&kvm->mmu_lock); > > ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn); > @@ -2910,6 +2916,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, > struct kvm_tdx_cmd *c > mutex_lock(&kvm->slots_lock); > idx = srcu_read_lock(&kvm->srcu); > > + kvm_mmu_reload(vcpu); > ret = 0; > while (region.nr_pages) { > if (signal_pending(current)) { >
On Thu, Jul 11, 2024 at 06:27:52PM -0400, Paolo Bonzini wrote: > Do not allow populating the same page twice with startup data. In the > case of SEV-SNP, for example, the firmware does not allow it anyway, > since the launch-update operation is only possible on pages that are > still shared in the RMP. > > Even if it worked, kvm_gmem_populate()'s callback is meant to have side > effects such as updating launch measurements, and updating the same > page twice is unlikely to have the desired results. > > Races between calls to the ioctl are not possible because kvm_gmem_populate() > holds slots_lock and the VM should not be running. But again, even if > this worked on other confidential computing technology, it doesn't matter > to guest_memfd.c whether this is an intentional attempt to do something > fishy, or missing synchronization in userspace, or even something > intentional. One of the racers wins, and the page is initialized by I think one of those "intentional"s was meant to be an "unintentional". > either kvm_gmem_prepare_folio() or kvm_gmem_populate(). > > Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use > the same errno that kvm_gmem_populate() is using. > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Assuming based on the discussion here that there will be some logic added to enforce that KVM_PRE_FAULT_MEMORY can only happen after finalization, I think the new checks make sense. Reviewed-by: Michael Roth <michael.roth@amd.com> > --- > arch/x86/kvm/svm/sev.c | 2 +- > virt/kvm/guest_memfd.c | 7 +++++++ > 2 files changed, 8 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > index df8818759698..397ef9e70182 100644 > --- a/arch/x86/kvm/svm/sev.c > +++ b/arch/x86/kvm/svm/sev.c > @@ -2213,7 +2213,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > if (ret || assigned) { > pr_debug("%s: Failed to ensure GFN 0x%llx RMP entry is initial shared state, ret: %d assigned: %d\n", > __func__, gfn, ret, assigned); > - ret = -EINVAL; > + ret = ret ? -EINVAL : -EEXIST; > goto err; > } > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 509360eefea5..266810bb91c9 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -650,6 +650,13 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > break; > } > > + if (folio_test_uptodate(folio)) { > + folio_unlock(folio); > + folio_put(folio); > + ret = -EEXIST; > + break; > + } > + > folio_unlock(folio); > if (!IS_ALIGNED(gfn, (1 << max_order)) || > (npages - i) < (1 << max_order)) > -- > 2.43.0 > > >
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index df8818759698..397ef9e70182 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2213,7 +2213,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf if (ret || assigned) { pr_debug("%s: Failed to ensure GFN 0x%llx RMP entry is initial shared state, ret: %d assigned: %d\n", __func__, gfn, ret, assigned); - ret = -EINVAL; + ret = ret ? -EINVAL : -EEXIST; goto err; } diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 509360eefea5..266810bb91c9 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -650,6 +650,13 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long break; } + if (folio_test_uptodate(folio)) { + folio_unlock(folio); + folio_put(folio); + ret = -EEXIST; + break; + } + folio_unlock(folio); if (!IS_ALIGNED(gfn, (1 << max_order)) || (npages - i) < (1 << max_order))
Do not allow populating the same page twice with startup data. In the case of SEV-SNP, for example, the firmware does not allow it anyway, since the launch-update operation is only possible on pages that are still shared in the RMP. Even if it worked, kvm_gmem_populate()'s callback is meant to have side effects such as updating launch measurements, and updating the same page twice is unlikely to have the desired results. Races between calls to the ioctl are not possible because kvm_gmem_populate() holds slots_lock and the VM should not be running. But again, even if this worked on other confidential computing technology, it doesn't matter to guest_memfd.c whether this is an intentional attempt to do something fishy, or missing synchronization in userspace, or even something intentional. One of the racers wins, and the page is initialized by either kvm_gmem_prepare_folio() or kvm_gmem_populate(). Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use the same errno that kvm_gmem_populate() is using. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- arch/x86/kvm/svm/sev.c | 2 +- virt/kvm/guest_memfd.c | 7 +++++++ 2 files changed, 8 insertions(+), 1 deletion(-)