Message ID | 20231016115028.996656-5-michael.roth@amd.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | KVM: gmem hooks/changes needed for x86 (other archs?) | expand |
On 16/10/2023 12:50, Michael Roth wrote: > In some cases, like with SEV-SNP, guest memory needs to be updated in a > platform-specific manner before it can be safely freed back to the host. > Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to > allow for special handling of this sort when freeing memory in response > to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go > ahead and define an arch-specific hook for x86 since it will be needed > for handling memory used for SEV-SNP guests. Hi all, Arm CCA has a similar need to prepare/unprepare memory (granule delegate/undelegate using our terminology) before it is used for protected memory. However I see a problem with the current gmem implementation that the "invalidations" are not precise enough for our RMI API. When punching a hole in the memfd the code currently hits the same path (ending in kvm_unmap_gfn_range()) as if a VMA is modified in the same range (for the shared version). The Arm CCA architecture doesn't allow the protected memory to be removed and refaulted without the permission of the guest (the memory contents would be wiped in this case). One option that I've considered is to implement a seperate CCA ioctl to notify KVM whether the memory should be mapped protected. The invalidations would then be ignored on ranges that are currently protected for this guest. This 'solves' the problem nicely except for the case where the VMM deliberately punches holes in memory which the guest is using. The issue in this case is that there's no way of failing the punch hole operation - we can detect that the memory is in use and shouldn't be freed, but this callback doesn't give the opportunity to actually block the freeing of the memory. Sadly there's no easy way to map from a physical page in a gmem back to which VM (and where in the VM) the page is mapped. So actually ripping the page out of the appropriate VM isn't really possible in this case. How is this situation handled on x86? Is it possible to invalidate and then refault a protected page without affecting the memory contents? My guess is yes and that is a CCA specific problem - is my understanding correct? My current thoughts for CCA are one of three options: 1. Represent shared and protected memory as two separate memslots. This matches the underlying architecture more closely (the top address bit is repurposed as a 'shared' flag), but I don't like it because it's a deviation from other CoCo architectures (notably pKVM). 2. Allow punch-hole to fail on CCA if the memory is mapped into the guest's protected space. Again, this is CCA being different and also creates nasty corner cases where the gmem descriptor could have to outlive the VMM - so looks like a potential source of memory leaks. 3. 'Fix' the invalidation to provide more precise semantics. I haven't yet prototyped it but it might be possible to simply provide a flag from kvm_gmem_invalidate_begin specifying that the invalidation is for the protected memory. KVM would then only unmap the protected memory when this flag is set (avoiding issues with VMA updates causing spurious unmaps). Fairly obviously (3) is my preferred option, but it relies on the guarantees that the "invalidation" is actually a precise set of addresses where the memory is actually being freed. Comments, thoughts, objections welcome! Steve > Signed-off-by: Michael Roth <michael.roth@amd.com> > --- > arch/x86/include/asm/kvm-x86-ops.h | 1 + > arch/x86/include/asm/kvm_host.h | 1 + > arch/x86/kvm/x86.c | 7 +++++++ > include/linux/kvm_host.h | 4 ++++ > virt/kvm/Kconfig | 4 ++++ > virt/kvm/guest_memfd.c | 14 ++++++++++++++ > 6 files changed, 31 insertions(+) > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h > index 0c113f42d5c7..f1505a5fa781 100644 > --- a/arch/x86/include/asm/kvm-x86-ops.h > +++ b/arch/x86/include/asm/kvm-x86-ops.h > @@ -135,6 +135,7 @@ KVM_X86_OP(complete_emulated_msr) > KVM_X86_OP(vcpu_deliver_sipi_vector) > KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons); > KVM_X86_OP_OPTIONAL_RET0(gmem_prepare) > +KVM_X86_OP_OPTIONAL(gmem_invalidate) > > #undef KVM_X86_OP > #undef KVM_X86_OP_OPTIONAL > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 66fc89d1858f..dbec74783f48 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1754,6 +1754,7 @@ struct kvm_x86_ops { > unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu); > > int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order); > + void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end); > }; > > struct kvm_x86_nested_ops { > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 33a4cc33d86d..0e95c3a95e59 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -13308,6 +13308,13 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord > } > #endif > > +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE > +void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end) > +{ > + static_call_cond(kvm_x86_gmem_invalidate)(start, end); > +} > +#endif > + > int kvm_spec_ctrl_test_value(u64 value) > { > /* > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index c7f82c2f1bcf..840a5be5962a 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2429,4 +2429,8 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm, > int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order); > #endif > > +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE > +void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end); > +#endif > + > #endif > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig > index 992cf6ed86ef..7fd1362a7ebe 100644 > --- a/virt/kvm/Kconfig > +++ b/virt/kvm/Kconfig > @@ -113,3 +113,7 @@ config KVM_GENERIC_PRIVATE_MEM > config HAVE_KVM_GMEM_PREPARE > bool > depends on KVM_PRIVATE_MEM > + > +config HAVE_KVM_GMEM_INVALIDATE > + bool > + depends on KVM_PRIVATE_MEM > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 72ff8b7b31d5..b4c4df259fb8 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -369,12 +369,26 @@ static int kvm_gmem_error_page(struct address_space *mapping, struct page *page) > return MF_DELAYED; > } > > +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE > +static void kvm_gmem_free_folio(struct folio *folio) > +{ > + struct page *page = folio_page(folio, 0); > + kvm_pfn_t pfn = page_to_pfn(page); > + int order = folio_order(folio); > + > + kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order)); > +} > +#endif > + > static const struct address_space_operations kvm_gmem_aops = { > .dirty_folio = noop_dirty_folio, > #ifdef CONFIG_MIGRATION > .migrate_folio = kvm_gmem_migrate_folio, > #endif > .error_remove_page = kvm_gmem_error_page, > +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE > + .free_folio = kvm_gmem_free_folio, > +#endif > }; > > static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,
On Fri, Feb 09, 2024, Steven Price wrote: > On 16/10/2023 12:50, Michael Roth wrote: > > In some cases, like with SEV-SNP, guest memory needs to be updated in a > > platform-specific manner before it can be safely freed back to the host. > > Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to > > allow for special handling of this sort when freeing memory in response > > to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go > > ahead and define an arch-specific hook for x86 since it will be needed > > for handling memory used for SEV-SNP guests. > > Hi all, > > Arm CCA has a similar need to prepare/unprepare memory (granule > delegate/undelegate using our terminology) before it is used for > protected memory. > > However I see a problem with the current gmem implementation that the > "invalidations" are not precise enough for our RMI API. When punching a > hole in the memfd the code currently hits the same path (ending in > kvm_unmap_gfn_range()) as if a VMA is modified in the same range (for > the shared version). > > The Arm CCA architecture doesn't allow the protected memory to be removed and > refaulted without the permission of the guest (the memory contents would be > wiped in this case). TDX behaves almost exactly like CCA. Well, that's not technically true, strictly speaking, as there are TDX APIs that do allow for *temporarily* marking mappings !PRESENT, but those aren't in play for invalidation events like this. SNP does allow zapping page table mappings, but fully removing a page, as PUNCH_HOLE would do, is destructive, so SNP also behaves the same way for all intents and purposes. > One option that I've considered is to implement a seperate CCA ioctl to > notify KVM whether the memory should be mapped protected. That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no? > The invalidations would then be ignored on ranges that are currently > protected for this guest. That's backwards. Invalidations on a guest_memfd should affect only *protected* mappings. And for that, the plan/proposal is to plumb only_{shared,private} flags into "struct kvm_gfn_range"[1] so that guest_memfd invalidations don't zap shared mappings, and mmu_notifier invalidation don't zap private mappings. Sample usage in the TDX context[2] (disclaimer, I'm pretty sure I didn't write most of that patch despite, I only provided a rough sketch). [1] https://lore.kernel.org/all/20231027182217.3615211-13-seanjc@google.com [2] https://lore.kernel.org/all/0b308fb6dd52bafe7153086c7f54bfad03da74b1.1705965635.git.isaku.yamahata@intel.com > This 'solves' the problem nicely except for the case where the VMM > deliberately punches holes in memory which the guest is using. I don't see what problem there is to solve in this case. PUNCH_HOLE is destructive, so don't do that. > The issue in this case is that there's no way of failing the punch hole > operation - we can detect that the memory is in use and shouldn't be > freed, but this callback doesn't give the opportunity to actually block > the freeing of the memory. Why is this KVM's problem? E.g. the same exact thing happens without guest_memfd if userspace munmap()s memory the guest is using. > Sadly there's no easy way to map from a physical page in a gmem back to > which VM (and where in the VM) the page is mapped. So actually ripping > the page out of the appropriate VM isn't really possible in this case. I don't follow. guest_memfd has a 1:1 binding with a VM *and* a gfn, how can you not know what exactly needs to be invalidated? > How is this situation handled on x86? Is it possible to invalidate and > then refault a protected page without affecting the memory contents? My > guess is yes and that is a CCA specific problem - is my understanding > correct? > > My current thoughts for CCA are one of three options: > > 1. Represent shared and protected memory as two separate memslots. This > matches the underlying architecture more closely (the top address bit is > repurposed as a 'shared' flag), but I don't like it because it's a > deviation from other CoCo architectures (notably pKVM). > > 2. Allow punch-hole to fail on CCA if the memory is mapped into the > guest's protected space. Again, this is CCA being different and also > creates nasty corner cases where the gmem descriptor could have to > outlive the VMM - so looks like a potential source of memory leaks. > > 3. 'Fix' the invalidation to provide more precise semantics. I haven't > yet prototyped it but it might be possible to simply provide a flag from > kvm_gmem_invalidate_begin specifying that the invalidation is for the > protected memory. KVM would then only unmap the protected memory when > this flag is set (avoiding issues with VMA updates causing spurious unmaps). > > Fairly obviously (3) is my preferred option, but it relies on the > guarantees that the "invalidation" is actually a precise set of > addresses where the memory is actually being freed. #3 is what we are planning for x86, and except for the only_{shared,private} flags, the requisite functionality should already be in Linus' tree, though it does need to be wired up for ARM.
Hi Sean, Thanks for the reply. On 09/02/2024 14:28, Sean Christopherson wrote: > On Fri, Feb 09, 2024, Steven Price wrote: >> On 16/10/2023 12:50, Michael Roth wrote: >>> In some cases, like with SEV-SNP, guest memory needs to be updated in a >>> platform-specific manner before it can be safely freed back to the host. >>> Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to >>> allow for special handling of this sort when freeing memory in response >>> to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go >>> ahead and define an arch-specific hook for x86 since it will be needed >>> for handling memory used for SEV-SNP guests. >> >> Hi all, >> >> Arm CCA has a similar need to prepare/unprepare memory (granule >> delegate/undelegate using our terminology) before it is used for >> protected memory. >> >> However I see a problem with the current gmem implementation that the >> "invalidations" are not precise enough for our RMI API. When punching a >> hole in the memfd the code currently hits the same path (ending in >> kvm_unmap_gfn_range()) as if a VMA is modified in the same range (for >> the shared version). >> >> The Arm CCA architecture doesn't allow the protected memory to be removed and >> refaulted without the permission of the guest (the memory contents would be >> wiped in this case). > > TDX behaves almost exactly like CCA. Well, that's not technically true, strictly > speaking, as there are TDX APIs that do allow for *temporarily* marking mappings > !PRESENT, but those aren't in play for invalidation events like this. Ok, great I was under the impression they were similar. > SNP does allow zapping page table mappings, but fully removing a page, as PUNCH_HOLE > would do, is destructive, so SNP also behaves the same way for all intents and > purposes. Zapping page table mappings is what the invalidate calls imply. This is something CCA can't do. Obviously fully removing the page would be destructive. >> One option that I've considered is to implement a seperate CCA ioctl to >> notify KVM whether the memory should be mapped protected. > > That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no? Sorry, I really didn't explain that well. Yes effectively this is the attribute flag, but there's corner cases for destruction of the VM. My thought was that if the VMM wanted to tear down part of the protected range (without making it shared) then a separate ioctl would be needed to notify KVM of the unmap. >> The invalidations would then be ignored on ranges that are currently >> protected for this guest. > > That's backwards. Invalidations on a guest_memfd should affect only *protected* > mappings. And for that, the plan/proposal is to plumb only_{shared,private} flags > into "struct kvm_gfn_range"[1] so that guest_memfd invalidations don't zap shared > mappings, and mmu_notifier invalidation don't zap private mappings. Sample usage > in the TDX context[2] (disclaimer, I'm pretty sure I didn't write most of that > patch despite, I only provided a rough sketch). Aha, this sounds much like my option 3 below - a way to tell if the invalidate comes from guest_memfd as opposed to VMA changes. > [1] https://lore.kernel.org/all/20231027182217.3615211-13-seanjc@google.com > [2] https://lore.kernel.org/all/0b308fb6dd52bafe7153086c7f54bfad03da74b1.1705965635.git.isaku.yamahata@intel.com > >> This 'solves' the problem nicely except for the case where the VMM >> deliberately punches holes in memory which the guest is using. > > I don't see what problem there is to solve in this case. PUNCH_HOLE is destructive, > so don't do that. A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but my concern here is a VMM which is trying to break the host. In this case either the PUNCH_HOLE needs to fail, or we actually need to recover the memory from the guest (effectively killing the guest in the process). >> The issue in this case is that there's no way of failing the punch hole >> operation - we can detect that the memory is in use and shouldn't be >> freed, but this callback doesn't give the opportunity to actually block >> the freeing of the memory. > > Why is this KVM's problem? E.g. the same exact thing happens without guest_memfd > if userspace munmap()s memory the guest is using. Indeed. The difference here is that for a normal non-realm guest the pages can be removed from the page-table and refaulted on a later access. Indeed there's nothing stopping the VMM from using freeing the pages and reallocating them later. For a realm guest if the memory is pulled from the guest then the guest is effectively dead (at least until migration is implemented but even then there's going to be a specific controlled mechanism). >> Sadly there's no easy way to map from a physical page in a gmem back to >> which VM (and where in the VM) the page is mapped. So actually ripping >> the page out of the appropriate VM isn't really possible in this case. > > I don't follow. guest_memfd has a 1:1 binding with a VM *and* a gfn, how can you > not know what exactly needs to be invalidated? At the point that gmem calls kvm_mmu_unmap_gfn_range() the fact that the range is a gmem is lost. >> How is this situation handled on x86? Is it possible to invalidate and >> then refault a protected page without affecting the memory contents? My >> guess is yes and that is a CCA specific problem - is my understanding >> correct? >> >> My current thoughts for CCA are one of three options: >> >> 1. Represent shared and protected memory as two separate memslots. This >> matches the underlying architecture more closely (the top address bit is >> repurposed as a 'shared' flag), but I don't like it because it's a >> deviation from other CoCo architectures (notably pKVM). >> >> 2. Allow punch-hole to fail on CCA if the memory is mapped into the >> guest's protected space. Again, this is CCA being different and also >> creates nasty corner cases where the gmem descriptor could have to >> outlive the VMM - so looks like a potential source of memory leaks. >> >> 3. 'Fix' the invalidation to provide more precise semantics. I haven't >> yet prototyped it but it might be possible to simply provide a flag from >> kvm_gmem_invalidate_begin specifying that the invalidation is for the >> protected memory. KVM would then only unmap the protected memory when >> this flag is set (avoiding issues with VMA updates causing spurious unmaps). >> >> Fairly obviously (3) is my preferred option, but it relies on the >> guarantees that the "invalidation" is actually a precise set of >> addresses where the memory is actually being freed. > > #3 is what we are planning for x86, and except for the only_{shared,private} flags, > the requisite functionality should already be in Linus' tree, though it does need > to be wired up for ARM. Thanks, looks like the only_{shared,private} flags should do it. My only worry about that solution was that it implicitly changes the "invalidation" when only_private==1 to a precise list of pages that are to be unmapped. Whereas for a normal guest it's only a performance issue if a larger region is invalidated, for a CoCo guest it would be fatal to the guest. I'll cherry-pick the "KVM: Add new members to struct kvm_gfn_range to operate on" patch from the TDX tree as I think this should do the trick. I have hacked up something similar and it looks like it should work. Thanks, Steve
On Fri, Feb 09, 2024, Steven Price wrote: > >> One option that I've considered is to implement a seperate CCA ioctl to > >> notify KVM whether the memory should be mapped protected. > > > > That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no? > > Sorry, I really didn't explain that well. Yes effectively this is the > attribute flag, but there's corner cases for destruction of the VM. My > thought was that if the VMM wanted to tear down part of the protected > range (without making it shared) then a separate ioctl would be needed > to notify KVM of the unmap. No new uAPI should be needed, because the only scenario time a benign VMM should do this is if the guest also knows the memory is being removed, in which case PUNCH_HOLE will suffice. > >> This 'solves' the problem nicely except for the case where the VMM > >> deliberately punches holes in memory which the guest is using. > > > > I don't see what problem there is to solve in this case. PUNCH_HOLE is destructive, > > so don't do that. > > A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but > my concern here is a VMM which is trying to break the host. In this case > either the PUNCH_HOLE needs to fail, or we actually need to recover the > memory from the guest (effectively killing the guest in the process). The latter. IIRC, we talked about this exact case somewhere in the hour-long rambling discussion on guest_memfd at PUCK[1]. And we've definitely discussed this multiple times on-list, though I don't know that there is a single thread that captures the entire plan. The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem" instance that's attached to a given guest_memfd inode when a page is being fully removed, i.e. when a page is being freed back to the normal memory pool. Something like this proposed SNP patch[2]. Mike, do have WIP patches you can share? [1] https://drive.google.com/corp/drive/folders/116YTH1h9yBZmjqeJc03cV4_AhSe-VBkc?resourcekey=0-sOGeFEUi60-znJJmZBsTHQ [2] https://lore.kernel.org/all/20231230172351.574091-30-michael.roth@amd.com
On Fri, Feb 09, 2024 at 07:13:13AM -0800, Sean Christopherson wrote: > On Fri, Feb 09, 2024, Steven Price wrote: > > >> One option that I've considered is to implement a seperate CCA ioctl to > > >> notify KVM whether the memory should be mapped protected. > > > > > > That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no? > > > > Sorry, I really didn't explain that well. Yes effectively this is the > > attribute flag, but there's corner cases for destruction of the VM. My > > thought was that if the VMM wanted to tear down part of the protected > > range (without making it shared) then a separate ioctl would be needed > > to notify KVM of the unmap. > > No new uAPI should be needed, because the only scenario time a benign VMM should > do this is if the guest also knows the memory is being removed, in which case > PUNCH_HOLE will suffice. > > > >> This 'solves' the problem nicely except for the case where the VMM > > >> deliberately punches holes in memory which the guest is using. > > > > > > I don't see what problem there is to solve in this case. PUNCH_HOLE is destructive, > > > so don't do that. > > > > A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but > > my concern here is a VMM which is trying to break the host. In this case > > either the PUNCH_HOLE needs to fail, or we actually need to recover the > > memory from the guest (effectively killing the guest in the process). > > The latter. IIRC, we talked about this exact case somewhere in the hour-long > rambling discussion on guest_memfd at PUCK[1]. And we've definitely discussed > this multiple times on-list, though I don't know that there is a single thread > that captures the entire plan. > > The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem" > instance that's attached to a given guest_memfd inode when a page is being fully > removed, i.e. when a page is being freed back to the normal memory pool. Something > like this proposed SNP patch[2]. > > Mike, do have WIP patches you can share? Sorry, I missed this query earlier. I'm a bit confused though, I thought the kvm_arch_gmem_invalidate() hook provided in this patch was what we ended up agreeing on during the PUCK call in question. There was an open question about what to do if a use-case came along where we needed to pass additional parameters to kvm_arch_gmem_invalidate() other than just the start/end PFN range for the pages being freed, but we'd determined that SNP and TDX did not currently need this, so I didn't have any changes planned in this regard. If we now have such a need, what we had proposed was to modify __filemap_remove_folio()/page_cache_delete() to defer setting folio->mapping to NULL so that we could still access it in kvm_gmem_free_folio() so that we can still access mapping->i_private_list to get the list of gmem/KVM instances and pass them on via kvm_arch_gmem_invalidate(). So that's doable, but it's not clear from this discussion that that's needed. If the idea to block/kill the guest if VMM tries to hole-punch, and ARM CCA already has plans to wire up the shared/private flags in kvm_unmap_gfn_range(), wouldn't that have all the information needed to kill that guest? At that point, kvm_gmem_free_folio() can handle additional per-page cleanup (with additional gmem/KVM info plumbed in if necessary). -Mike [1] https://lore.kernel.org/kvm/20240202230611.351544-1-seanjc@google.com/T/ > > [1] https://drive.google.com/corp/drive/folders/116YTH1h9yBZmjqeJc03cV4_AhSe-VBkc?resourcekey=0-sOGeFEUi60-znJJmZBsTHQ > [2] https://lore.kernel.org/all/20231230172351.574091-30-michael.roth@amd.com
On Mon, Mar 11, 2024, Michael Roth wrote: > On Fri, Feb 09, 2024 at 07:13:13AM -0800, Sean Christopherson wrote: > > On Fri, Feb 09, 2024, Steven Price wrote: > > > >> One option that I've considered is to implement a seperate CCA ioctl to > > > >> notify KVM whether the memory should be mapped protected. > > > > > > > > That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no? > > > > > > Sorry, I really didn't explain that well. Yes effectively this is the > > > attribute flag, but there's corner cases for destruction of the VM. My > > > thought was that if the VMM wanted to tear down part of the protected > > > range (without making it shared) then a separate ioctl would be needed > > > to notify KVM of the unmap. > > > > No new uAPI should be needed, because the only scenario time a benign VMM should > > do this is if the guest also knows the memory is being removed, in which case > > PUNCH_HOLE will suffice. > > > > > >> This 'solves' the problem nicely except for the case where the VMM > > > >> deliberately punches holes in memory which the guest is using. > > > > > > > > I don't see what problem there is to solve in this case. PUNCH_HOLE is destructive, > > > > so don't do that. > > > > > > A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but > > > my concern here is a VMM which is trying to break the host. In this case > > > either the PUNCH_HOLE needs to fail, or we actually need to recover the > > > memory from the guest (effectively killing the guest in the process). > > > > The latter. IIRC, we talked about this exact case somewhere in the hour-long > > rambling discussion on guest_memfd at PUCK[1]. And we've definitely discussed > > this multiple times on-list, though I don't know that there is a single thread > > that captures the entire plan. > > > > The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem" > > instance that's attached to a given guest_memfd inode when a page is being fully > > removed, i.e. when a page is being freed back to the normal memory pool. Something > > like this proposed SNP patch[2]. > > > > Mike, do have WIP patches you can share? > > Sorry, I missed this query earlier. I'm a bit confused though, I thought > the kvm_arch_gmem_invalidate() hook provided in this patch was what we > ended up agreeing on during the PUCK call in question. Heh, I trust your memory of things far more than I trust mine. I'm just proving Cunningham's Law. :-) > There was an open question about what to do if a use-case came along > where we needed to pass additional parameters to > kvm_arch_gmem_invalidate() other than just the start/end PFN range for > the pages being freed, but we'd determined that SNP and TDX did not > currently need this, so I didn't have any changes planned in this > regard. > > If we now have such a need, what we had proposed was to modify > __filemap_remove_folio()/page_cache_delete() to defer setting > folio->mapping to NULL so that we could still access it in > kvm_gmem_free_folio() so that we can still access mapping->i_private_list > to get the list of gmem/KVM instances and pass them on via > kvm_arch_gmem_invalidate(). Yeah, this is what I was remembering. I obviously forgot that we didn't have a need to iterate over all bindings at this time. > So that's doable, but it's not clear from this discussion that that's > needed. Same here. And even if it is needed, it's not your problem to solve. The above blurb about needing to preserve folio->mapping being free_folio() is sufficient to get the ARM code moving in the right direction. Thanks! > If the idea to block/kill the guest if VMM tries to hole-punch, > and ARM CCA already has plans to wire up the shared/private flags in > kvm_unmap_gfn_range(), wouldn't that have all the information needed to > kill that guest? At that point, kvm_gmem_free_folio() can handle > additional per-page cleanup (with additional gmem/KVM info plumbed in > if necessary).
On 12/03/2024 20:26, Sean Christopherson wrote: > On Mon, Mar 11, 2024, Michael Roth wrote: >> On Fri, Feb 09, 2024 at 07:13:13AM -0800, Sean Christopherson wrote: >>> On Fri, Feb 09, 2024, Steven Price wrote: >>>>>> One option that I've considered is to implement a seperate CCA ioctl to >>>>>> notify KVM whether the memory should be mapped protected. >>>>> >>>>> That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no? >>>> >>>> Sorry, I really didn't explain that well. Yes effectively this is the >>>> attribute flag, but there's corner cases for destruction of the VM. My >>>> thought was that if the VMM wanted to tear down part of the protected >>>> range (without making it shared) then a separate ioctl would be needed >>>> to notify KVM of the unmap. >>> >>> No new uAPI should be needed, because the only scenario time a benign VMM should >>> do this is if the guest also knows the memory is being removed, in which case >>> PUNCH_HOLE will suffice. >>> >>>>>> This 'solves' the problem nicely except for the case where the VMM >>>>>> deliberately punches holes in memory which the guest is using. >>>>> >>>>> I don't see what problem there is to solve in this case. PUNCH_HOLE is destructive, >>>>> so don't do that. >>>> >>>> A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but >>>> my concern here is a VMM which is trying to break the host. In this case >>>> either the PUNCH_HOLE needs to fail, or we actually need to recover the >>>> memory from the guest (effectively killing the guest in the process). >>> >>> The latter. IIRC, we talked about this exact case somewhere in the hour-long >>> rambling discussion on guest_memfd at PUCK[1]. And we've definitely discussed >>> this multiple times on-list, though I don't know that there is a single thread >>> that captures the entire plan. >>> >>> The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem" >>> instance that's attached to a given guest_memfd inode when a page is being fully >>> removed, i.e. when a page is being freed back to the normal memory pool. Something >>> like this proposed SNP patch[2]. >>> >>> Mike, do have WIP patches you can share? >> >> Sorry, I missed this query earlier. I'm a bit confused though, I thought >> the kvm_arch_gmem_invalidate() hook provided in this patch was what we >> ended up agreeing on during the PUCK call in question. > > Heh, I trust your memory of things far more than I trust mine. I'm just proving > Cunningham's Law. :-) > >> There was an open question about what to do if a use-case came along >> where we needed to pass additional parameters to >> kvm_arch_gmem_invalidate() other than just the start/end PFN range for >> the pages being freed, but we'd determined that SNP and TDX did not >> currently need this, so I didn't have any changes planned in this >> regard. >> >> If we now have such a need, what we had proposed was to modify >> __filemap_remove_folio()/page_cache_delete() to defer setting >> folio->mapping to NULL so that we could still access it in >> kvm_gmem_free_folio() so that we can still access mapping->i_private_list >> to get the list of gmem/KVM instances and pass them on via >> kvm_arch_gmem_invalidate(). > > Yeah, this is what I was remembering. I obviously forgot that we didn't have a > need to iterate over all bindings at this time. > >> So that's doable, but it's not clear from this discussion that that's >> needed. > > Same here. And even if it is needed, it's not your problem to solve. The above > blurb about needing to preserve folio->mapping being free_folio() is sufficient > to get the ARM code moving in the right direction. > > Thanks! > >> If the idea to block/kill the guest if VMM tries to hole-punch, >> and ARM CCA already has plans to wire up the shared/private flags in >> kvm_unmap_gfn_range(), wouldn't that have all the information needed to >> kill that guest? At that point, kvm_gmem_free_folio() can handle >> additional per-page cleanup (with additional gmem/KVM info plumbed in >> if necessary). Yes, the missing piece of the puzzle was provided by "KVM: Prepare for handling only shared mappings in mmu_notifier events"[1] - namely the "only_shared" flag. We don't need to actually block/kill the guest until it attempts access to the memory which has been removed from the guest - at that point the guest cannot continue because the security properties have been violated (the protected memory contents have been lost) so attempts to continue the guest will fail. You can ignore most of my other ramblings - as long as everyone is happy with that flag then Arm CCA should be fine. I was just looking at other options. Thanks, Steve [1] https://lore.kernel.org/lkml/20231027182217.3615211-13-seanjc@google.com/
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index 0c113f42d5c7..f1505a5fa781 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -135,6 +135,7 @@ KVM_X86_OP(complete_emulated_msr) KVM_X86_OP(vcpu_deliver_sipi_vector) KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons); KVM_X86_OP_OPTIONAL_RET0(gmem_prepare) +KVM_X86_OP_OPTIONAL(gmem_invalidate) #undef KVM_X86_OP #undef KVM_X86_OP_OPTIONAL diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 66fc89d1858f..dbec74783f48 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1754,6 +1754,7 @@ struct kvm_x86_ops { unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu); int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order); + void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end); }; struct kvm_x86_nested_ops { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 33a4cc33d86d..0e95c3a95e59 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -13308,6 +13308,13 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord } #endif +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE +void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end) +{ + static_call_cond(kvm_x86_gmem_invalidate)(start, end); +} +#endif + int kvm_spec_ctrl_test_value(u64 value) { /* diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c7f82c2f1bcf..840a5be5962a 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2429,4 +2429,8 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm, int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order); #endif +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE +void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end); +#endif + #endif diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 992cf6ed86ef..7fd1362a7ebe 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -113,3 +113,7 @@ config KVM_GENERIC_PRIVATE_MEM config HAVE_KVM_GMEM_PREPARE bool depends on KVM_PRIVATE_MEM + +config HAVE_KVM_GMEM_INVALIDATE + bool + depends on KVM_PRIVATE_MEM diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 72ff8b7b31d5..b4c4df259fb8 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -369,12 +369,26 @@ static int kvm_gmem_error_page(struct address_space *mapping, struct page *page) return MF_DELAYED; } +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE +static void kvm_gmem_free_folio(struct folio *folio) +{ + struct page *page = folio_page(folio, 0); + kvm_pfn_t pfn = page_to_pfn(page); + int order = folio_order(folio); + + kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order)); +} +#endif + static const struct address_space_operations kvm_gmem_aops = { .dirty_folio = noop_dirty_folio, #ifdef CONFIG_MIGRATION .migrate_folio = kvm_gmem_migrate_folio, #endif .error_remove_page = kvm_gmem_error_page, +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE + .free_folio = kvm_gmem_free_folio, +#endif }; static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,
In some cases, like with SEV-SNP, guest memory needs to be updated in a platform-specific manner before it can be safely freed back to the host. Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to allow for special handling of this sort when freeing memory in response to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go ahead and define an arch-specific hook for x86 since it will be needed for handling memory used for SEV-SNP guests. Signed-off-by: Michael Roth <michael.roth@amd.com> --- arch/x86/include/asm/kvm-x86-ops.h | 1 + arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/x86.c | 7 +++++++ include/linux/kvm_host.h | 4 ++++ virt/kvm/Kconfig | 4 ++++ virt/kvm/guest_memfd.c | 14 ++++++++++++++ 6 files changed, 31 insertions(+)