Message ID | 20221202061347.1070246-1-chao.p.peng@linux.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: mm: fd-based approach for supporting KVM | expand |
On Fri, Dec 02, 2022, Chao Peng wrote: > This patch series implements KVM guest private memory for confidential > computing scenarios like Intel TDX[1]. If a TDX host accesses > TDX-protected guest memory, machine check can happen which can further > crash the running host system, this is terrible for multi-tenant > configurations. The host accesses include those from KVM userspace like > QEMU. This series addresses KVM userspace induced crash by introducing > new mm and KVM interfaces so KVM userspace can still manage guest memory > via a fd-based approach, but it can never access the guest memory > content. > > The patch series touches both core mm and KVM code. I appreciate > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > reviews are always welcome. > - 01: mm change, target for mm tree > - 02-09: KVM change, target for KVM tree A version with all of my feedback, plus reworked versions of Vishal's selftest, is available here: git@github.com:sean-jc/linux.git x86/upm_base_support It compiles and passes the selftest, but it's otherwise barely tested. There are a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still a WIP. As for next steps, can you (handwaving all of the TDX folks) take a look at what I pushed and see if there's anything horrifically broken, and that it still works for TDX? Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush (and I mean that). On my side, the two things on my mind are (a) tests and (b) downstream dependencies (SEV and TDX). For tests, I want to build a lists of tests that are required for merging so that the criteria for merging are clear, and so that if the list is large (haven't thought much yet), the work of writing and running tests can be distributed. Regarding downstream dependencies, before this lands, I want to pull in all the TDX and SNP series and see how everything fits together. Specifically, I want to make sure that we don't end up with a uAPI that necessitates ugly code, and that we don't miss an opportunity to make things simpler. The patches in the SNP series to add "legacy" SEV support for UPM in particular made me slightly rethink some minor details. Nothing remotely major, but something that needs attention since it'll be uAPI. I'm off Monday, so it'll be at least Tuesday before I make any more progress on my side. Thanks!
On Sat, Jan 14, 2023 at 12:37:59AM +0000, Sean Christopherson wrote: > On Fri, Dec 02, 2022, Chao Peng wrote: > > This patch series implements KVM guest private memory for confidential > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > TDX-protected guest memory, machine check can happen which can further > > crash the running host system, this is terrible for multi-tenant > > configurations. The host accesses include those from KVM userspace like > > QEMU. This series addresses KVM userspace induced crash by introducing > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > via a fd-based approach, but it can never access the guest memory > > content. > > > > The patch series touches both core mm and KVM code. I appreciate > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > reviews are always welcome. > > - 01: mm change, target for mm tree > > - 02-09: KVM change, target for KVM tree > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > is available here: > > git@github.com:sean-jc/linux.git x86/upm_base_support > > It compiles and passes the selftest, but it's otherwise barely tested. There are > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > a WIP. > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > I pushed and see if there's anything horrifically broken, and that it still works > for TDX? Minor build fix: diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 6eb5336ccc65..4a9e9fa2552a 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -7211,8 +7211,8 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm, int level; bool mixed; - lockdep_assert_held_write(kvm->mmu_lock); - lockdep_assert_held(kvm->slots_lock); + lockdep_assert_held_write(&kvm->mmu_lock); + lockdep_assert_held(&kvm->slots_lock); /* * KVM x86 currently only supports KVM_MEMORY_ATTRIBUTE_PRIVATE, skip diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 467916943c73..4ef60ba7eb1d 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2304,7 +2304,7 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn) { - lockdep_assert_held(kvm->mmu_lock); + lockdep_assert_held(&kvm->mmu_lock); return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)); }
On Sat, Jan 14, 2023 at 12:37:59AM +0000, Sean Christopherson wrote: > On Fri, Dec 02, 2022, Chao Peng wrote: > > This patch series implements KVM guest private memory for confidential > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > TDX-protected guest memory, machine check can happen which can further > > crash the running host system, this is terrible for multi-tenant > > configurations. The host accesses include those from KVM userspace like > > QEMU. This series addresses KVM userspace induced crash by introducing > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > via a fd-based approach, but it can never access the guest memory > > content. > > > > The patch series touches both core mm and KVM code. I appreciate > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > reviews are always welcome. > > - 01: mm change, target for mm tree > > - 02-09: KVM change, target for KVM tree > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > is available here: > > git@github.com:sean-jc/linux.git x86/upm_base_support > > It compiles and passes the selftest, but it's otherwise barely tested. There are > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > a WIP. Thanks very much for doing this. Almost all of your comments are well received, except for two cases that need more discussions which have replied individually. > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > I pushed and see if there's anything horrifically broken, and that it still works > for TDX? I have integrated this into my local TDX repo, with some changes (as I replied individually), the new code basically still works with TDX. I have also asked other TDX folks to take a look. > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > (and I mean that). > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > (SEV and TDX). For tests, I want to build a lists of tests that are required for > merging so that the criteria for merging are clear, and so that if the list is large > (haven't thought much yet), the work of writing and running tests can be distributed. > > Regarding downstream dependencies, before this lands, I want to pull in all the > TDX and SNP series and see how everything fits together. Specifically, I want to > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > don't miss an opportunity to make things simpler. The patches in the SNP series to > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > details. Nothing remotely major, but something that needs attention since it'll > be uAPI. > > I'm off Monday, so it'll be at least Tuesday before I make any more progress on > my side. Appreciate your effort. As for the next steps, if you see something we can do parallel, feel free to let me know. Thanks, Chao
Hi Sean, On Sat, Jan 14, 2023 at 12:38 AM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Dec 02, 2022, Chao Peng wrote: > > This patch series implements KVM guest private memory for confidential > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > TDX-protected guest memory, machine check can happen which can further > > crash the running host system, this is terrible for multi-tenant > > configurations. The host accesses include those from KVM userspace like > > QEMU. This series addresses KVM userspace induced crash by introducing > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > via a fd-based approach, but it can never access the guest memory > > content. > > > > The patch series touches both core mm and KVM code. I appreciate > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > reviews are always welcome. > > - 01: mm change, target for mm tree > > - 02-09: KVM change, target for KVM tree > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > is available here: > > git@github.com:sean-jc/linux.git x86/upm_base_support > > It compiles and passes the selftest, but it's otherwise barely tested. There are > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > a WIP. > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > I pushed and see if there's anything horrifically broken, and that it still works > for TDX? > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > (and I mean that). Thanks for sharing this. I've had a look at the patches, and have ported them to work with pKVM. At a high level, the new interface seems fine and it works with the arm64/pKVM port. I have a couple of comments regarding some of the details, but they can wait until v11 is posted. Cheers, /fuad > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > (SEV and TDX). For tests, I want to build a lists of tests that are required for > merging so that the criteria for merging are clear, and so that if the list is large > (haven't thought much yet), the work of writing and running tests can be distributed. > > Regarding downstream dependencies, before this lands, I want to pull in all the > TDX and SNP series and see how everything fits together. Specifically, I want to > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > don't miss an opportunity to make things simpler. The patches in the SNP series to > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > details. Nothing remotely major, but something that needs attention since it'll > be uAPI. > > I'm off Monday, so it'll be at least Tuesday before I make any more progress on > my side. > > Thanks!
On Sat, Jan 14, 2023 at 12:37:59AM +0000, Sean Christopherson <seanjc@google.com> wrote: > On Fri, Dec 02, 2022, Chao Peng wrote: > > This patch series implements KVM guest private memory for confidential > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > TDX-protected guest memory, machine check can happen which can further > > crash the running host system, this is terrible for multi-tenant > > configurations. The host accesses include those from KVM userspace like > > QEMU. This series addresses KVM userspace induced crash by introducing > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > via a fd-based approach, but it can never access the guest memory > > content. > > > > The patch series touches both core mm and KVM code. I appreciate > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > reviews are always welcome. > > - 01: mm change, target for mm tree > > - 02-09: KVM change, target for KVM tree > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > is available here: > > git@github.com:sean-jc/linux.git x86/upm_base_support > > It compiles and passes the selftest, but it's otherwise barely tested. There are > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > a WIP. > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > I pushed and see if there's anything horrifically broken, and that it still works > for TDX? > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > (and I mean that). > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > (SEV and TDX). For tests, I want to build a lists of tests that are required for > merging so that the criteria for merging are clear, and so that if the list is large > (haven't thought much yet), the work of writing and running tests can be distributed. > > Regarding downstream dependencies, before this lands, I want to pull in all the > TDX and SNP series and see how everything fits together. Specifically, I want to > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > don't miss an opportunity to make things simpler. The patches in the SNP series to > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > details. Nothing remotely major, but something that needs attention since it'll > be uAPI. Although I'm still debuging with TDX KVM, I needed the following. kvm_faultin_pfn() is called without mmu_lock held. the race to change private/shared is handled by mmu_seq. Maybe dedicated function only for kvm_faultin_pfn(). diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 02be5e1cba1e..38699ca75ab8 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2322,7 +2322,7 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn) { - lockdep_assert_held(&kvm->mmu_lock); + // lockdep_assert_held(&kvm->mmu_lock); return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)); }
On Thu, Jan 19, 2023, Isaku Yamahata wrote: > On Sat, Jan 14, 2023 at 12:37:59AM +0000, > Sean Christopherson <seanjc@google.com> wrote: > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > This patch series implements KVM guest private memory for confidential > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > TDX-protected guest memory, machine check can happen which can further > > > crash the running host system, this is terrible for multi-tenant > > > configurations. The host accesses include those from KVM userspace like > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > via a fd-based approach, but it can never access the guest memory > > > content. > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > reviews are always welcome. > > > - 01: mm change, target for mm tree > > > - 02-09: KVM change, target for KVM tree > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > is available here: > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > a WIP. > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > > I pushed and see if there's anything horrifically broken, and that it still works > > for TDX? > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > > (and I mean that). > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > > (SEV and TDX). For tests, I want to build a lists of tests that are required for > > merging so that the criteria for merging are clear, and so that if the list is large > > (haven't thought much yet), the work of writing and running tests can be distributed. > > > > Regarding downstream dependencies, before this lands, I want to pull in all the > > TDX and SNP series and see how everything fits together. Specifically, I want to > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > > don't miss an opportunity to make things simpler. The patches in the SNP series to > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > > details. Nothing remotely major, but something that needs attention since it'll > > be uAPI. > > Although I'm still debuging with TDX KVM, I needed the following. > kvm_faultin_pfn() is called without mmu_lock held. the race to change > private/shared is handled by mmu_seq. Maybe dedicated function only for > kvm_faultin_pfn(). Gah, you're not on the other thread where this was discussed[*]. Simply deleting the lockdep assertion is safe, for guest types that rely on the attributes to define shared vs. private, KVM rechecks the attributes under the protection of mmu_seq. I'll get a fixed version pushed out today. [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com
On Thu, Jan 19, 2023 at 03:25:08PM +0000, Sean Christopherson <seanjc@google.com> wrote: > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > On Sat, Jan 14, 2023 at 12:37:59AM +0000, > > Sean Christopherson <seanjc@google.com> wrote: > > > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > This patch series implements KVM guest private memory for confidential > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > TDX-protected guest memory, machine check can happen which can further > > > > crash the running host system, this is terrible for multi-tenant > > > > configurations. The host accesses include those from KVM userspace like > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > via a fd-based approach, but it can never access the guest memory > > > > content. > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > reviews are always welcome. > > > > - 01: mm change, target for mm tree > > > > - 02-09: KVM change, target for KVM tree > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > is available here: > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > a WIP. > > > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > > > I pushed and see if there's anything horrifically broken, and that it still works > > > for TDX? > > > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > > > (and I mean that). > > > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > > > (SEV and TDX). For tests, I want to build a lists of tests that are required for > > > merging so that the criteria for merging are clear, and so that if the list is large > > > (haven't thought much yet), the work of writing and running tests can be distributed. > > > > > > Regarding downstream dependencies, before this lands, I want to pull in all the > > > TDX and SNP series and see how everything fits together. Specifically, I want to > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > > > don't miss an opportunity to make things simpler. The patches in the SNP series to > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > > > details. Nothing remotely major, but something that needs attention since it'll > > > be uAPI. > > > > Although I'm still debuging with TDX KVM, I needed the following. > > kvm_faultin_pfn() is called without mmu_lock held. the race to change > > private/shared is handled by mmu_seq. Maybe dedicated function only for > > kvm_faultin_pfn(). > > Gah, you're not on the other thread where this was discussed[*]. Simply deleting > the lockdep assertion is safe, for guest types that rely on the attributes to > define shared vs. private, KVM rechecks the attributes under the protection of > mmu_seq. > > I'll get a fixed version pushed out today. > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com Now I have tdx kvm working. I've uploaded at the followings. It's rebased to v6.2-rc3. git@github.com:yamahata/linux.git tdx/upm git@github.com:yamahata/qemu.git tdx/upm kvm_mmu_do_page_fault() needs the following change. kvm_mem_is_private() queries mem_attr_array. kvm_faultin_pfn() also uses kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't make sense. This change would belong to TDX KVM patches, though. diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index 72b0da8e27e0..f45ac438bbf4 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -430,7 +430,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, .max_level = vcpu->kvm->arch.tdp_max_page_level, .req_level = PG_LEVEL_4K, .goal_level = PG_LEVEL_4K, - .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT), + .is_private = kvm_is_private_gpa(vcpu->kvm, cr2_or_gpa), }; int r;
On Thu, Jan 19, 2023, Isaku Yamahata wrote: > On Thu, Jan 19, 2023 at 03:25:08PM +0000, > Sean Christopherson <seanjc@google.com> wrote: > > > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000, > > > Sean Christopherson <seanjc@google.com> wrote: > > > > > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > > This patch series implements KVM guest private memory for confidential > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > > TDX-protected guest memory, machine check can happen which can further > > > > > crash the running host system, this is terrible for multi-tenant > > > > > configurations. The host accesses include those from KVM userspace like > > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > > via a fd-based approach, but it can never access the guest memory > > > > > content. > > > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > > reviews are always welcome. > > > > > - 01: mm change, target for mm tree > > > > > - 02-09: KVM change, target for KVM tree > > > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > > is available here: > > > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > > a WIP. > > > > > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > > > > I pushed and see if there's anything horrifically broken, and that it still works > > > > for TDX? > > > > > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > > > > (and I mean that). > > > > > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > > > > (SEV and TDX). For tests, I want to build a lists of tests that are required for > > > > merging so that the criteria for merging are clear, and so that if the list is large > > > > (haven't thought much yet), the work of writing and running tests can be distributed. > > > > > > > > Regarding downstream dependencies, before this lands, I want to pull in all the > > > > TDX and SNP series and see how everything fits together. Specifically, I want to > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > > > > don't miss an opportunity to make things simpler. The patches in the SNP series to > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > > > > details. Nothing remotely major, but something that needs attention since it'll > > > > be uAPI. > > > > > > Although I'm still debuging with TDX KVM, I needed the following. > > > kvm_faultin_pfn() is called without mmu_lock held. the race to change > > > private/shared is handled by mmu_seq. Maybe dedicated function only for > > > kvm_faultin_pfn(). > > > > Gah, you're not on the other thread where this was discussed[*]. Simply deleting > > the lockdep assertion is safe, for guest types that rely on the attributes to > > define shared vs. private, KVM rechecks the attributes under the protection of > > mmu_seq. > > > > I'll get a fixed version pushed out today. > > > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com > > Now I have tdx kvm working. I've uploaded at the followings. > It's rebased to v6.2-rc3. > git@github.com:yamahata/linux.git tdx/upm > git@github.com:yamahata/qemu.git tdx/upm And I finally got a working, building version updated and pushed out (again to): git@github.com:sean-jc/linux.git x86/upm_base_support Took longer than expected to get the memslot restrictions sussed out. I'm done working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks to resolves any remaining todos (that no one else tackles) and to do the whole "merge the world" excersise. > kvm_mmu_do_page_fault() needs the following change. > kvm_mem_is_private() queries mem_attr_array. kvm_faultin_pfn() also uses > kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't > make sense. This change would belong to TDX KVM patches, though. Yeah, SNP needs similar treatment. Sorting that out is high up on the todo list.
On 14/01/2023 00:37, Sean Christopherson wrote: > On Fri, Dec 02, 2022, Chao Peng wrote: >> This patch series implements KVM guest private memory for confidential >> computing scenarios like Intel TDX[1]. If a TDX host accesses >> TDX-protected guest memory, machine check can happen which can further >> crash the running host system, this is terrible for multi-tenant >> configurations. The host accesses include those from KVM userspace like >> QEMU. This series addresses KVM userspace induced crash by introducing >> new mm and KVM interfaces so KVM userspace can still manage guest memory >> via a fd-based approach, but it can never access the guest memory >> content. >> >> The patch series touches both core mm and KVM code. I appreciate >> Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other >> reviews are always welcome. >> - 01: mm change, target for mm tree >> - 02-09: KVM change, target for KVM tree > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > is available here: > > git@github.com:sean-jc/linux.git x86/upm_base_support > > It compiles and passes the selftest, but it's otherwise barely tested. There are > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > a WIP. > When running LTP (https://github.com/linux-test-project/ltp) on the v10 bits (and also with Sean's branch above) I encounter the following NULL pointer dereference with testcases/kernel/syscalls/madvise/madvise01 (100% reproducible). It appears that in restrictedmem_error_page() inode->i_mapping->private_data is NULL in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) but I don't know why. [ 5365.177168] BUG: kernel NULL pointer dereference, address: 0000000000000028 [ 5365.178881] #PF: supervisor read access in kernel mode [ 5365.180006] #PF: error_code(0x0000) - not-present page [ 5365.181322] PGD 8000000109dad067 P4D 8000000109dad067 PUD 107707067 PMD 0 [ 5365.183474] Oops: 0000 [#1] PREEMPT SMP PTI [ 5365.184792] CPU: 0 PID: 22086 Comm: madvise01 Not tainted 6.1.0-1.el8.seanjcupm.x86_64 #1 [ 5365.186572] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.5.1 06/16/2021 [ 5365.188816] RIP: 0010:restrictedmem_error_page+0xc7/0x1b0 [ 5365.190081] Code: 99 00 48 8b 55 00 48 8b 02 48 8d 8a e8 fe ff ff 48 2d 18 01 00 00 48 39 d5 0f 84 8a 00 00 00 48 8b 51 30 48 8b 92 b8 00 00 00 <48> 8b 4a 28 4c 39 b1 d8 00 00 00 74 22 48 8b 88 18 01 00 00 48 8d [ 5365.193984] RSP: 0018:ffff9b7343c07d80 EFLAGS: 00010206 [ 5365.195142] RAX: ffff8e5b410cfc70 RBX: 0000000000000001 RCX: ffff8e5b4048e580 [ 5365.196888] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 5365.198399] RBP: ffff8e5b410cfd88 R08: 0000000000000000 R09: 0000000000000000 [ 5365.200200] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 5365.201843] R13: ffff8e5b410cfd80 R14: ffff8e5b47cc7618 R15: ffffd49d44c05080 [ 5365.203472] FS: 00007fc96de9b5c0(0000) GS:ffff8e5deda00000(0000) knlGS:0000000000000000 [ 5365.205485] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5365.206791] CR2: 0000000000000028 CR3: 000000012047e002 CR4: 0000000000170ef0 [ 5365.208131] Call Trace: [ 5365.208752] <TASK> [ 5365.209229] me_pagecache_clean+0x58/0x100 [ 5365.210196] identify_page_state+0x84/0xd0 [ 5365.211180] memory_failure+0x231/0x8b0 [ 5365.212148] madvise_inject_error.cold+0x8d/0xa4 [ 5365.213317] do_madvise+0x363/0x3a0 [ 5365.214177] __x64_sys_madvise+0x2c/0x40 [ 5365.215159] do_syscall_64+0x3f/0xa0 [ 5365.216016] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 5365.217130] RIP: 0033:0x7fc96d8399ab [ 5365.217953] Code: 73 01 c3 48 8b 0d dd 54 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ad 54 38 00 f7 d8 64 89 01 48 [ 5365.222323] RSP: 002b:00007fff62a99b18 EFLAGS: 00000206 ORIG_RAX: 000000000000001c [ 5365.224026] RAX: ffffffffffffffda RBX: 000000000041ce00 RCX: 00007fc96d8399ab [ 5365.225375] RDX: 0000000000000064 RSI: 000000000000a000 RDI: 00007fc96de8e000 [ 5365.226999] RBP: 00007fc96de9b540 R08: 0000000000000001 R09: 0000000000415c80 [ 5365.228641] R10: 0000000000000001 R11: 0000000000000206 R12: 0000000000000008 [ 5365.230074] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Regards, Liam > As for next steps, can you (handwaving all of the TDX folks) take a look at what > I pushed and see if there's anything horrifically broken, and that it still works > for TDX? > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > (and I mean that). > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > (SEV and TDX). For tests, I want to build a lists of tests that are required for > merging so that the criteria for merging are clear, and so that if the list is large > (haven't thought much yet), the work of writing and running tests can be distributed. > > Regarding downstream dependencies, before this lands, I want to pull in all the > TDX and SNP series and see how everything fits together. Specifically, I want to > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > don't miss an opportunity to make things simpler. The patches in the SNP series to > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > details. Nothing remotely major, but something that needs attention since it'll > be uAPI. > > I'm off Monday, so it'll be at least Tuesday before I make any more progress on > my side. > > Thanks! >
On Tue, Jan 24, 2023, Liam Merwick wrote: > On 14/01/2023 00:37, Sean Christopherson wrote: > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > This patch series implements KVM guest private memory for confidential > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > TDX-protected guest memory, machine check can happen which can further > > > crash the running host system, this is terrible for multi-tenant > > > configurations. The host accesses include those from KVM userspace like > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > via a fd-based approach, but it can never access the guest memory > > > content. > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > reviews are always welcome. > > > - 01: mm change, target for mm tree > > > - 02-09: KVM change, target for KVM tree > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > is available here: > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > a WIP. > > > > When running LTP (https://github.com/linux-test-project/ltp) on the v10 > bits (and also with Sean's branch above) I encounter the following NULL > pointer dereference with testcases/kernel/syscalls/madvise/madvise01 > (100% reproducible). > > It appears that in restrictedmem_error_page() inode->i_mapping->private_data > is NULL > in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) > but I don't know why. Kirill, can you take a look? Or pass the buck to someone who can? :-)
On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote: > On Tue, Jan 24, 2023, Liam Merwick wrote: > > On 14/01/2023 00:37, Sean Christopherson wrote: > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > This patch series implements KVM guest private memory for confidential > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > TDX-protected guest memory, machine check can happen which can further > > > > crash the running host system, this is terrible for multi-tenant > > > > configurations. The host accesses include those from KVM userspace like > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > via a fd-based approach, but it can never access the guest memory > > > > content. > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > reviews are always welcome. > > > > - 01: mm change, target for mm tree > > > > - 02-09: KVM change, target for KVM tree > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > is available here: > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > a WIP. > > > > > > > When running LTP (https://github.com/linux-test-project/ltp) on the v10 > > bits (and also with Sean's branch above) I encounter the following NULL > > pointer dereference with testcases/kernel/syscalls/madvise/madvise01 > > (100% reproducible). > > > > It appears that in restrictedmem_error_page() inode->i_mapping->private_data > > is NULL > > in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) > > but I don't know why. > > Kirill, can you take a look? Or pass the buck to someone who can? :-) The patch below should help. diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c index 15c52301eeb9..39ada985c7c0 100644 --- a/mm/restrictedmem.c +++ b/mm/restrictedmem.c @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping) spin_lock(&sb->s_inode_list_lock); list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { - struct restrictedmem *rm = inode->i_mapping->private_data; struct restrictedmem_notifier *notifier; - struct file *memfd = rm->memfd; + struct restrictedmem *rm; unsigned long index; + struct file *memfd; - if (memfd->f_mapping != mapping) + if (atomic_read(&inode->i_count)) continue; + spin_lock(&inode->i_lock); + if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { + spin_unlock(&inode->i_lock); + continue; + } + + rm = inode->i_mapping->private_data; + memfd = rm->memfd; + + if (memfd->f_mapping != mapping) { + spin_unlock(&inode->i_lock); + continue; + } + spin_unlock(&inode->i_lock); + xa_for_each_range(&rm->bindings, index, notifier, start, end) notifier->ops->error(notifier, start, end); break;
On 25/01/2023 12:53, Kirill A. Shutemov wrote: > On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote: >> On Tue, Jan 24, 2023, Liam Merwick wrote: >>> On 14/01/2023 00:37, Sean Christopherson wrote: >>>> On Fri, Dec 02, 2022, Chao Peng wrote: ... >>> >>> When running LTP (https://github.com/linux-test-project/ltp) on the v10 >>> bits (and also with Sean's branch above) I encounter the following NULL >>> pointer dereference with testcases/kernel/syscalls/madvise/madvise01 >>> (100% reproducible). >>> >>> It appears that in restrictedmem_error_page() inode->i_mapping->private_data >>> is NULL >>> in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) >>> but I don't know why. >> >> Kirill, can you take a look? Or pass the buck to someone who can? :-) > > The patch below should help. Thanks, this works for me. Regards, Liam > > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > index 15c52301eeb9..39ada985c7c0 100644 > --- a/mm/restrictedmem.c > +++ b/mm/restrictedmem.c > @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping) > > spin_lock(&sb->s_inode_list_lock); > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { > - struct restrictedmem *rm = inode->i_mapping->private_data; > struct restrictedmem_notifier *notifier; > - struct file *memfd = rm->memfd; > + struct restrictedmem *rm; > unsigned long index; > + struct file *memfd; > > - if (memfd->f_mapping != mapping) > + if (atomic_read(&inode->i_count)) > continue; > > + spin_lock(&inode->i_lock); > + if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { > + spin_unlock(&inode->i_lock); > + continue; > + } > + > + rm = inode->i_mapping->private_data; > + memfd = rm->memfd; > + > + if (memfd->f_mapping != mapping) { > + spin_unlock(&inode->i_lock); > + continue; > + } > + spin_unlock(&inode->i_lock); > + > xa_for_each_range(&rm->bindings, index, notifier, start, end) > notifier->ops->error(notifier, start, end); > break;
On Tue, Jan 24, 2023 at 01:27:50AM +0000, Sean Christopherson <seanjc@google.com> wrote: > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > On Thu, Jan 19, 2023 at 03:25:08PM +0000, > > Sean Christopherson <seanjc@google.com> wrote: > > > > > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000, > > > > Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > > > This patch series implements KVM guest private memory for confidential > > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > > > TDX-protected guest memory, machine check can happen which can further > > > > > > crash the running host system, this is terrible for multi-tenant > > > > > > configurations. The host accesses include those from KVM userspace like > > > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > > > via a fd-based approach, but it can never access the guest memory > > > > > > content. > > > > > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > > > reviews are always welcome. > > > > > > - 01: mm change, target for mm tree > > > > > > - 02-09: KVM change, target for KVM tree > > > > > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > > > is available here: > > > > > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > > > a WIP. > > > > > > > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > > > > > I pushed and see if there's anything horrifically broken, and that it still works > > > > > for TDX? > > > > > > > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > > > > > (and I mean that). > > > > > > > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > > > > > (SEV and TDX). For tests, I want to build a lists of tests that are required for > > > > > merging so that the criteria for merging are clear, and so that if the list is large > > > > > (haven't thought much yet), the work of writing and running tests can be distributed. > > > > > > > > > > Regarding downstream dependencies, before this lands, I want to pull in all the > > > > > TDX and SNP series and see how everything fits together. Specifically, I want to > > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > > > > > don't miss an opportunity to make things simpler. The patches in the SNP series to > > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > > > > > details. Nothing remotely major, but something that needs attention since it'll > > > > > be uAPI. > > > > > > > > Although I'm still debuging with TDX KVM, I needed the following. > > > > kvm_faultin_pfn() is called without mmu_lock held. the race to change > > > > private/shared is handled by mmu_seq. Maybe dedicated function only for > > > > kvm_faultin_pfn(). > > > > > > Gah, you're not on the other thread where this was discussed[*]. Simply deleting > > > the lockdep assertion is safe, for guest types that rely on the attributes to > > > define shared vs. private, KVM rechecks the attributes under the protection of > > > mmu_seq. > > > > > > I'll get a fixed version pushed out today. > > > > > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com > > > > Now I have tdx kvm working. I've uploaded at the followings. > > It's rebased to v6.2-rc3. > > git@github.com:yamahata/linux.git tdx/upm > > git@github.com:yamahata/qemu.git tdx/upm > > And I finally got a working, building version updated and pushed out (again to): > > git@github.com:sean-jc/linux.git x86/upm_base_support > Ok, I rebased TDX part to the updated branch. git@github.com:yamahata/linux.git tdx/upm git@github.com:yamahata/qemu.git tdx/upm Now it's v6.2-rc7 based. qemu needs more patches to avoid registering memory slot for SMM.
On Tue, Jan 24, 2023 at 01:27:50AM +0000, Sean Christopherson wrote: > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > On Thu, Jan 19, 2023 at 03:25:08PM +0000, > > Sean Christopherson <seanjc@google.com> wrote: > > > > > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000, > > > > Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > > > This patch series implements KVM guest private memory for confidential > > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > > > TDX-protected guest memory, machine check can happen which can further > > > > > > crash the running host system, this is terrible for multi-tenant > > > > > > configurations. The host accesses include those from KVM userspace like > > > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > > > via a fd-based approach, but it can never access the guest memory > > > > > > content. > > > > > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > > > reviews are always welcome. > > > > > > - 01: mm change, target for mm tree > > > > > > - 02-09: KVM change, target for KVM tree > > > > > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > > > is available here: > > > > > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > > > a WIP. > > > > > > > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > > > > > I pushed and see if there's anything horrifically broken, and that it still works > > > > > for TDX? > > > > > > > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > > > > > (and I mean that). > > > > > > > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > > > > > (SEV and TDX). For tests, I want to build a lists of tests that are required for > > > > > merging so that the criteria for merging are clear, and so that if the list is large > > > > > (haven't thought much yet), the work of writing and running tests can be distributed. > > > > > > > > > > Regarding downstream dependencies, before this lands, I want to pull in all the > > > > > TDX and SNP series and see how everything fits together. Specifically, I want to > > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > > > > > don't miss an opportunity to make things simpler. The patches in the SNP series to > > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > > > > > details. Nothing remotely major, but something that needs attention since it'll > > > > > be uAPI. > > > > > > > > Although I'm still debuging with TDX KVM, I needed the following. > > > > kvm_faultin_pfn() is called without mmu_lock held. the race to change > > > > private/shared is handled by mmu_seq. Maybe dedicated function only for > > > > kvm_faultin_pfn(). > > > > > > Gah, you're not on the other thread where this was discussed[*]. Simply deleting > > > the lockdep assertion is safe, for guest types that rely on the attributes to > > > define shared vs. private, KVM rechecks the attributes under the protection of > > > mmu_seq. > > > > > > I'll get a fixed version pushed out today. > > > > > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com > > > > Now I have tdx kvm working. I've uploaded at the followings. > > It's rebased to v6.2-rc3. > > git@github.com:yamahata/linux.git tdx/upm > > git@github.com:yamahata/qemu.git tdx/upm > > And I finally got a working, building version updated and pushed out (again to): > > git@github.com:sean-jc/linux.git x86/upm_base_support > > Took longer than expected to get the memslot restrictions sussed out. I'm done > working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks > to resolves any remaining todos (that no one else tackles) and to do the whole > "merge the world" excersise. > > > kvm_mmu_do_page_fault() needs the following change. > > kvm_mem_is_private() queries mem_attr_array. kvm_faultin_pfn() also uses > > kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't > > make sense. This change would belong to TDX KVM patches, though. > > Yeah, SNP needs similar treatment. Sorting that out is high up on the todo list. Hi Sean, We've rebased the SEV+SNP support onto your updated UPM base support tree and things seem to be working okay, but we needed some fixups on top of the base support get things working, along with 1 workaround for an issue that hasn't been root-caused yet: https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes We plan to post an updated RFC for v8 soon, but also wanted to share the staging tree in case you end up looking at the UPM integration aspects before then. -Mike
Hi, On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote: > This patch series implements KVM guest private memory for confidential > computing scenarios like Intel TDX[1]. If a TDX host accesses > TDX-protected guest memory, machine check can happen which can further > crash the running host system, this is terrible for multi-tenant > configurations. The host accesses include those from KVM userspace like > QEMU. This series addresses KVM userspace induced crash by introducing > new mm and KVM interfaces so KVM userspace can still manage guest memory > via a fd-based approach, but it can never access the guest memory > content. Sorry for jumping late. Unless I'm missing something, hibernation will also cause an machine check when there is TDX-protected memory in the system. When the hibernation creates memory snapshot it essentially walks all physical pages and saves their contents, so for TDX memory this will trigger machine check, right? > Documentation/virt/kvm/api.rst | 125 ++++++- > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > arch/x86/include/asm/kvm_host.h | 9 + > arch/x86/kvm/Kconfig | 3 + > arch/x86/kvm/mmu/mmu.c | 205 ++++++++++- > arch/x86/kvm/mmu/mmu_internal.h | 14 +- > arch/x86/kvm/mmu/mmutrace.h | 1 + > arch/x86/kvm/mmu/tdp_mmu.c | 2 +- > arch/x86/kvm/x86.c | 17 +- > include/linux/kvm_host.h | 103 +++++- > include/linux/restrictedmem.h | 71 ++++ > include/linux/syscalls.h | 1 + > include/uapi/asm-generic/unistd.h | 5 +- > include/uapi/linux/kvm.h | 53 +++ > include/uapi/linux/magic.h | 1 + > kernel/sys_ni.c | 3 + > mm/Kconfig | 4 + > mm/Makefile | 1 + > mm/memory-failure.c | 3 + > mm/restrictedmem.c | 318 +++++++++++++++++ > virt/kvm/Kconfig | 6 + > virt/kvm/kvm_main.c | 469 +++++++++++++++++++++---- > 23 files changed, 1323 insertions(+), 93 deletions(-) > create mode 100644 include/linux/restrictedmem.h > create mode 100644 mm/restrictedmem.c
On 16.02.23 06:13, Mike Rapoport wrote: > Hi, > > On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote: >> This patch series implements KVM guest private memory for confidential >> computing scenarios like Intel TDX[1]. If a TDX host accesses >> TDX-protected guest memory, machine check can happen which can further >> crash the running host system, this is terrible for multi-tenant >> configurations. The host accesses include those from KVM userspace like >> QEMU. This series addresses KVM userspace induced crash by introducing >> new mm and KVM interfaces so KVM userspace can still manage guest memory >> via a fd-based approach, but it can never access the guest memory >> content. > > Sorry for jumping late. > > Unless I'm missing something, hibernation will also cause an machine check > when there is TDX-protected memory in the system. When the hibernation > creates memory snapshot it essentially walks all physical pages and saves > their contents, so for TDX memory this will trigger machine check, right? I recall bringing that up in the past (also memory access due to kdump, /prov/kcore) and was told that the main focus for now is preventing unprivileged users from crashing the system, that is, not mapping such memory into user space (e.g., QEMU). In the long run, we'll want to handle such pages also properly in the other events where the kernel might access them.
> Hi Sean, > > We've rebased the SEV+SNP support onto your updated UPM base support > tree and things seem to be working okay, but we needed some fixups on > top of the base support get things working, along with 1 workaround > for an issue that hasn't been root-caused yet: > > https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip > > *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation > *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check What I'm seeing is Slot#3 gets added first and then deleted. When it's gets added, Slot#0 already has the same range bound to restrictedmem so trigger the exclusive check. This check is exactly the current code for. > *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding > *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations As many kernel APIs treat 'end' as exclusive, I would rather keep using exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind and notifier callbacks) but fix it internally in the restrictedmem. E.g. all the places where xarray API needs a 'last'/'max' we use 'end - 1'. See below for the change. > *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations Subtracting slot->restrictedmem.index for start/end in restrictedmem_get_gfn_range() is the correct fix. > *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes > > We plan to post an updated RFC for v8 soon, but also wanted to share > the staging tree in case you end up looking at the UPM integration aspects > before then. > > -Mike This is the restrictedmem fix to solve 'end' being stored and checked in xarray: --- a/mm/restrictedmem.c +++ b/mm/restrictedmem.c @@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode, */ down_read(&rm->lock); - xa_for_each_range(&rm->bindings, index, notifier, start, end) + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) notifier->ops->invalidate_start(notifier, start, end); ret = memfd->f_op->fallocate(memfd, mode, offset, len); - xa_for_each_range(&rm->bindings, index, notifier, start, end) + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) notifier->ops->invalidate_end(notifier, start, end); up_read(&rm->lock); @@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct address_space *mapping, } spin_unlock(&inode->i_lock); - xa_for_each_range(&rm->bindings, index, notifier, start, end) + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) notifier->ops->error(notifier, start, end); break; } @@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, if (exclusive != rm->exclusive) goto out_unlock; - if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT)) + if (exclusive && + xa_find(&rm->bindings, &start, end - 1, XA_PRESENT)) goto out_unlock; } - xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL); + xa_store_range(&rm->bindings, start, end - 1, notifier, GFP_KERNEL); rm->exclusive = exclusive; ret = 0; out_unlock: @@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end, struct restrictedmem *rm = file->f_mapping->private_data; down_write(&rm->lock); - xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL); + xa_store_range(&rm->bindings, start, end - 1, NULL, GFP_KERNEL); synchronize_rcu(); up_write(&rm->lock); }
On Thu, Feb 16, 2023, David Hildenbrand wrote: > On 16.02.23 06:13, Mike Rapoport wrote: > > Hi, > > > > On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote: > > > This patch series implements KVM guest private memory for confidential > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > TDX-protected guest memory, machine check can happen which can further > > > crash the running host system, this is terrible for multi-tenant > > > configurations. The host accesses include those from KVM userspace like > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > via a fd-based approach, but it can never access the guest memory > > > content. > > > > Sorry for jumping late. > > > > Unless I'm missing something, hibernation will also cause an machine check > > when there is TDX-protected memory in the system. When the hibernation > > creates memory snapshot it essentially walks all physical pages and saves > > their contents, so for TDX memory this will trigger machine check, right? For hibernation specifically, I think that should be handled elsewhere as hibernation is simply incompatible with TDX, SNP, pKVM, etc. without paravirtualizing the guest, as none of those technologies support auto-export a la s390. I suspect the right approach is to disallow hibernation if KVM is running any protected guests. > I recall bringing that up in the past (also memory access due to kdump, > /prov/kcore) and was told that the main focus for now is preventing > unprivileged users from crashing the system, that is, not mapping such > memory into user space (e.g., QEMU). In the long run, we'll want to handle > such pages also properly in the other events where the kernel might access > them. Ya, unless someone strongly objects, the plan is to essentially treat "attacks" from privileged users as out of to scope for initial support, and then iterate as needed to fix/enable more features. FWIW, read accesses, e.g. kdump, should be ok for TDX and SNP as they both play nice with "bad" reads. pKVM is a different beast though as I believe any access to guest private memory will fault. But my understanding is that this series would be a big step forward for pKVM, which currently doesn't have any safeguards.
On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote: > > Hi Sean, > > > > We've rebased the SEV+SNP support onto your updated UPM base support > > tree and things seem to be working okay, but we needed some fixups on > > top of the base support get things working, along with 1 workaround > > for an issue that hasn't been root-caused yet: > > > > https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip > > > > *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation > > *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check > > What I'm seeing is Slot#3 gets added first and then deleted. When it's > gets added, Slot#0 already has the same range bound to restrictedmem so > trigger the exclusive check. This check is exactly the current code for. With the following change in QEMU, we no longer trigger this check: diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c index 20da121374..849b5de469 100644 --- a/hw/pci-host/q35.c +++ b/hw/pci-host/q35.c @@ -588,9 +588,9 @@ static void mch_realize(PCIDevice *d, Error **errp) memory_region_init_alias(&mch->open_high_smram, OBJECT(mch), "smram-open-high", mch->ram_memory, MCH_HOST_BRIDGE_SMRAM_C_BASE, MCH_HOST_BRIDGE_SMRAM_C_SIZE); + memory_region_set_enabled(&mch->open_high_smram, false); memory_region_add_subregion_overlap(mch->system_memory, 0xfeda0000, &mch->open_high_smram, 1); - memory_region_set_enabled(&mch->open_high_smram, false); I'm not sure if QEMU is actually doing something wrong here though or if this check is putting tighter restrictions on userspace than what was expected before. Will look into it more. > > > *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding > > *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations > > As many kernel APIs treat 'end' as exclusive, I would rather keep using > exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind > and notifier callbacks) but fix it internally in the restrictedmem. E.g. > all the places where xarray API needs a 'last'/'max' we use 'end - 1'. > See below for the change. Yes I did feel like I was fighting the kernel a bit on that; your suggestion seems like it would be a better fit. > > > *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations > > Subtracting slot->restrictedmem.index for start/end in > restrictedmem_get_gfn_range() is the correct fix. > > > *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes > > > > We plan to post an updated RFC for v8 soon, but also wanted to share > > the staging tree in case you end up looking at the UPM integration aspects > > before then. > > > > -Mike > > This is the restrictedmem fix to solve 'end' being stored and checked in xarray: Looks good. Thanks! -Mike > > --- a/mm/restrictedmem.c > +++ b/mm/restrictedmem.c > @@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode, > */ > down_read(&rm->lock); > > - xa_for_each_range(&rm->bindings, index, notifier, start, end) > + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) > notifier->ops->invalidate_start(notifier, start, end); > > ret = memfd->f_op->fallocate(memfd, mode, offset, len); > > - xa_for_each_range(&rm->bindings, index, notifier, start, end) > + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) > notifier->ops->invalidate_end(notifier, start, end); > > up_read(&rm->lock); > @@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct address_space *mapping, > } > spin_unlock(&inode->i_lock); > > - xa_for_each_range(&rm->bindings, index, notifier, start, end) > + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) > notifier->ops->error(notifier, start, end); > break; > } > @@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, > if (exclusive != rm->exclusive) > goto out_unlock; > > - if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT)) > + if (exclusive && > + xa_find(&rm->bindings, &start, end - 1, XA_PRESENT)) > goto out_unlock; > } > > - xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL); > + xa_store_range(&rm->bindings, start, end - 1, notifier, GFP_KERNEL); > rm->exclusive = exclusive; > ret = 0; > out_unlock: > @@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end, > struct restrictedmem *rm = file->f_mapping->private_data; > > down_write(&rm->lock); > - xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL); > + xa_store_range(&rm->bindings, start, end - 1, NULL, GFP_KERNEL); > synchronize_rcu(); > up_write(&rm->lock); > }
On Wed, Mar 22, 2023 at 08:27:37PM -0500, Michael Roth wrote: > On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote: > > > Hi Sean, > > > > > > We've rebased the SEV+SNP support onto your updated UPM base support > > > tree and things seem to be working okay, but we needed some fixups on > > > top of the base support get things working, along with 1 workaround > > > for an issue that hasn't been root-caused yet: > > > > > > https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip > > > > > > *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation > > > *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check > > > > What I'm seeing is Slot#3 gets added first and then deleted. When it's > > gets added, Slot#0 already has the same range bound to restrictedmem so > > trigger the exclusive check. This check is exactly the current code for. > > With the following change in QEMU, we no longer trigger this check: > > diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c > index 20da121374..849b5de469 100644 > --- a/hw/pci-host/q35.c > +++ b/hw/pci-host/q35.c > @@ -588,9 +588,9 @@ static void mch_realize(PCIDevice *d, Error **errp) > memory_region_init_alias(&mch->open_high_smram, OBJECT(mch), "smram-open-high", > mch->ram_memory, MCH_HOST_BRIDGE_SMRAM_C_BASE, > MCH_HOST_BRIDGE_SMRAM_C_SIZE); > + memory_region_set_enabled(&mch->open_high_smram, false); > memory_region_add_subregion_overlap(mch->system_memory, 0xfeda0000, > &mch->open_high_smram, 1); > - memory_region_set_enabled(&mch->open_high_smram, false); > > I'm not sure if QEMU is actually doing something wrong here though or if > this check is putting tighter restrictions on userspace than what was > expected before. Will look into it more. I don't think above QEMU change is upstream acceptable. It may break functionality for 'normal' VMs. The UPM check does putting tighter restriction, the restriction is that you can't bind the same fd range to more than one memslot. For SMRAM in QEMU however, it violates this restriction. The right 'fix' is disabling SMM in QEMU for UPM usages rather than trying to work around it. There is more discussion in below link: https://lore.kernel.org/all/Y8bOB7VuVIsxoMcn@google.com/ Chao > > > > > > *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding > > > *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations > > > > As many kernel APIs treat 'end' as exclusive, I would rather keep using > > exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind > > and notifier callbacks) but fix it internally in the restrictedmem. E.g. > > all the places where xarray API needs a 'last'/'max' we use 'end - 1'. > > See below for the change. > > Yes I did feel like I was fighting the kernel a bit on that; your > suggestion seems like it would be a better fit. > > > > > > *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations > > > > Subtracting slot->restrictedmem.index for start/end in > > restrictedmem_get_gfn_range() is the correct fix. > > > > > *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes > > > > > > We plan to post an updated RFC for v8 soon, but also wanted to share > > > the staging tree in case you end up looking at the UPM integration aspects > > > before then. > > > > > > -Mike > > > > This is the restrictedmem fix to solve 'end' being stored and checked in xarray: > > Looks good. > > Thanks! > > -Mike > > > > > --- a/mm/restrictedmem.c > > +++ b/mm/restrictedmem.c > > @@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode, > > */ > > down_read(&rm->lock); > > > > - xa_for_each_range(&rm->bindings, index, notifier, start, end) > > + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) > > notifier->ops->invalidate_start(notifier, start, end); > > > > ret = memfd->f_op->fallocate(memfd, mode, offset, len); > > > > - xa_for_each_range(&rm->bindings, index, notifier, start, end) > > + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) > > notifier->ops->invalidate_end(notifier, start, end); > > > > up_read(&rm->lock); > > @@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct address_space *mapping, > > } > > spin_unlock(&inode->i_lock); > > > > - xa_for_each_range(&rm->bindings, index, notifier, start, end) > > + xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) > > notifier->ops->error(notifier, start, end); > > break; > > } > > @@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, > > if (exclusive != rm->exclusive) > > goto out_unlock; > > > > - if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT)) > > + if (exclusive && > > + xa_find(&rm->bindings, &start, end - 1, XA_PRESENT)) > > goto out_unlock; > > } > > > > - xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL); > > + xa_store_range(&rm->bindings, start, end - 1, notifier, GFP_KERNEL); > > rm->exclusive = exclusive; > > ret = 0; > > out_unlock: > > @@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end, > > struct restrictedmem *rm = file->f_mapping->private_data; > > > > down_write(&rm->lock); > > - xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL); > > + xa_store_range(&rm->bindings, start, end - 1, NULL, GFP_KERNEL); > > synchronize_rcu(); > > up_write(&rm->lock); > > }
On Wed, Mar 22, 2023, Michael Roth wrote: > On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote: > > > *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding > > > *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations > > > > As many kernel APIs treat 'end' as exclusive, I would rather keep using > > exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind > > and notifier callbacks) but fix it internally in the restrictedmem. E.g. > > all the places where xarray API needs a 'last'/'max' we use 'end - 1'. > > See below for the change. > > Yes I did feel like I was fighting the kernel a bit on that; your > suggestion seems like it would be a better fit. Comically belated +1, XArray is the odd one here.
On Wed, Jan 25, 2023, Kirill A. Shutemov wrote: > On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote: > > On Tue, Jan 24, 2023, Liam Merwick wrote: > > > On 14/01/2023 00:37, Sean Christopherson wrote: > > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > > This patch series implements KVM guest private memory for confidential > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > > TDX-protected guest memory, machine check can happen which can further > > > > > crash the running host system, this is terrible for multi-tenant > > > > > configurations. The host accesses include those from KVM userspace like > > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > > via a fd-based approach, but it can never access the guest memory > > > > > content. > > > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > > reviews are always welcome. > > > > > - 01: mm change, target for mm tree > > > > > - 02-09: KVM change, target for KVM tree > > > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > > is available here: > > > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > > a WIP. > > > > > > > > > > When running LTP (https://github.com/linux-test-project/ltp) on the v10 > > > bits (and also with Sean's branch above) I encounter the following NULL > > > pointer dereference with testcases/kernel/syscalls/madvise/madvise01 > > > (100% reproducible). > > > > > > It appears that in restrictedmem_error_page() > > > inode->i_mapping->private_data is NULL in the > > > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) but I > > > don't know why. > > > > Kirill, can you take a look? Or pass the buck to someone who can? :-) > > The patch below should help. > > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > index 15c52301eeb9..39ada985c7c0 100644 > --- a/mm/restrictedmem.c > +++ b/mm/restrictedmem.c > @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping) > > spin_lock(&sb->s_inode_list_lock); > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { > - struct restrictedmem *rm = inode->i_mapping->private_data; > struct restrictedmem_notifier *notifier; > - struct file *memfd = rm->memfd; > + struct restrictedmem *rm; > unsigned long index; > + struct file *memfd; > > - if (memfd->f_mapping != mapping) > + if (atomic_read(&inode->i_count)) Kirill, should this be if (!atomic_read(&inode->i_count)) continue; i.e. skip unreferenced inodes, not skip referenced inodes?
On Wed, Apr 12, 2023 at 06:07:28PM -0700, Sean Christopherson wrote: > On Wed, Jan 25, 2023, Kirill A. Shutemov wrote: > > On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote: > > > On Tue, Jan 24, 2023, Liam Merwick wrote: > > > > On 14/01/2023 00:37, Sean Christopherson wrote: > > > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > > > This patch series implements KVM guest private memory for confidential > > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > > > TDX-protected guest memory, machine check can happen which can further > > > > > > crash the running host system, this is terrible for multi-tenant > > > > > > configurations. The host accesses include those from KVM userspace like > > > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > > > via a fd-based approach, but it can never access the guest memory > > > > > > content. > > > > > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > > > reviews are always welcome. > > > > > > - 01: mm change, target for mm tree > > > > > > - 02-09: KVM change, target for KVM tree > > > > > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > > > is available here: > > > > > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > > > a WIP. > > > > > > > > > > > > > When running LTP (https://github.com/linux-test-project/ltp) on the v10 > > > > bits (and also with Sean's branch above) I encounter the following NULL > > > > pointer dereference with testcases/kernel/syscalls/madvise/madvise01 > > > > (100% reproducible). > > > > > > > > It appears that in restrictedmem_error_page() > > > > inode->i_mapping->private_data is NULL in the > > > > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) but I > > > > don't know why. > > > > > > Kirill, can you take a look? Or pass the buck to someone who can? :-) > > > > The patch below should help. > > > > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > > index 15c52301eeb9..39ada985c7c0 100644 > > --- a/mm/restrictedmem.c > > +++ b/mm/restrictedmem.c > > @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping) > > > > spin_lock(&sb->s_inode_list_lock); > > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { > > - struct restrictedmem *rm = inode->i_mapping->private_data; > > struct restrictedmem_notifier *notifier; > > - struct file *memfd = rm->memfd; > > + struct restrictedmem *rm; > > unsigned long index; > > + struct file *memfd; > > > > - if (memfd->f_mapping != mapping) > > + if (atomic_read(&inode->i_count)) > > Kirill, should this be > > if (!atomic_read(&inode->i_count)) > continue; > > i.e. skip unreferenced inodes, not skip referenced inodes? Ouch. Yes. But looking at other instances of s_inodes usage, I think we can drop the check altogether. inode cannot be completely free until it is removed from s_inodes list. While there, replace list_for_each_entry_safe() with list_for_each_entry() as we don't remove anything from the list. diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c index 55e99e6c09a1..8e8a4420d3d1 100644 --- a/mm/restrictedmem.c +++ b/mm/restrictedmem.c @@ -194,22 +194,19 @@ static int restricted_error_remove_page(struct address_space *mapping, struct page *page) { struct super_block *sb = restrictedmem_mnt->mnt_sb; - struct inode *inode, *next; + struct inode *inode; pgoff_t start, end; start = page->index; end = start + thp_nr_pages(page); spin_lock(&sb->s_inode_list_lock); - list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { struct restrictedmem_notifier *notifier; struct restrictedmem *rm; unsigned long index; struct file *memfd; - if (atomic_read(&inode->i_count)) - continue; - spin_lock(&inode->i_lock); if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { spin_unlock(&inode->i_lock);
On Tue, Jan 24, 2023 at 01:27:50AM +0000, Sean Christopherson wrote: > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > On Thu, Jan 19, 2023 at 03:25:08PM +0000, > > Sean Christopherson <seanjc@google.com> wrote: > > > > > On Thu, Jan 19, 2023, Isaku Yamahata wrote: > > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000, > > > > Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > > On Fri, Dec 02, 2022, Chao Peng wrote: > > > > > > This patch series implements KVM guest private memory for confidential > > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > > > TDX-protected guest memory, machine check can happen which can further > > > > > > crash the running host system, this is terrible for multi-tenant > > > > > > configurations. The host accesses include those from KVM userspace like > > > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > > > via a fd-based approach, but it can never access the guest memory > > > > > > content. > > > > > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > > > reviews are always welcome. > > > > > > - 01: mm change, target for mm tree > > > > > > - 02-09: KVM change, target for KVM tree > > > > > > > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest, > > > > > is available here: > > > > > > > > > > git@github.com:sean-jc/linux.git x86/upm_base_support > > > > > > > > > > It compiles and passes the selftest, but it's otherwise barely tested. There are > > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still > > > > > a WIP. > > > > > > > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what > > > > > I pushed and see if there's anything horrifically broken, and that it still works > > > > > for TDX? > > > > > > > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush > > > > > (and I mean that). > > > > > > > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies > > > > > (SEV and TDX). For tests, I want to build a lists of tests that are required for > > > > > merging so that the criteria for merging are clear, and so that if the list is large > > > > > (haven't thought much yet), the work of writing and running tests can be distributed. > > > > > > > > > > Regarding downstream dependencies, before this lands, I want to pull in all the > > > > > TDX and SNP series and see how everything fits together. Specifically, I want to > > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we > > > > > don't miss an opportunity to make things simpler. The patches in the SNP series to > > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor > > > > > details. Nothing remotely major, but something that needs attention since it'll > > > > > be uAPI. > > > > > > > > Although I'm still debuging with TDX KVM, I needed the following. > > > > kvm_faultin_pfn() is called without mmu_lock held. the race to change > > > > private/shared is handled by mmu_seq. Maybe dedicated function only for > > > > kvm_faultin_pfn(). > > > > > > Gah, you're not on the other thread where this was discussed[*]. Simply deleting > > > the lockdep assertion is safe, for guest types that rely on the attributes to > > > define shared vs. private, KVM rechecks the attributes under the protection of > > > mmu_seq. > > > > > > I'll get a fixed version pushed out today. > > > > > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com > > > > Now I have tdx kvm working. I've uploaded at the followings. > > It's rebased to v6.2-rc3. > > git@github.com:yamahata/linux.git tdx/upm > > git@github.com:yamahata/qemu.git tdx/upm > > And I finally got a working, building version updated and pushed out (again to): > > git@github.com:sean-jc/linux.git x86/upm_base_support > > Took longer than expected to get the memslot restrictions sussed out. I'm done > working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks > to resolves any remaining todos (that no one else tackles) and to do the whole > "merge the world" excersise. Hi Sean, In case you started working on the code again, I have a branch [1] originally planned as v11 candidate which I believe I addressed all the discussions we had for v10 except the very latest one [2] and integrated all the newly added selftests from Ackerley and myself. The branch was based on your original upm_base_support and then rebased to your kvm-x86/mmu head. Feel free to take anything you think useful( most of them are trivial things but also some fixes for bugs). [1] https://github.com/chao-p/linux/commits/privmem-v11.6 [2] https://lore.kernel.org/all/20230413160405.h6ov2yl6l3i7mvsj@box.shutemov.name/ Chao > > > kvm_mmu_do_page_fault() needs the following change. > > kvm_mem_is_private() queries mem_attr_array. kvm_faultin_pfn() also uses > > kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't > > make sense. This change would belong to TDX KVM patches, though. > > Yeah, SNP needs similar treatment. Sorting that out is high up on the todo list.
On Mon, Apr 17, 2023, Chao Peng wrote: > In case you started working on the code again, I have a branch [1] > originally planned as v11 candidate which I believe I addressed all the > discussions we had for v10 except the very latest one [2] and integrated > all the newly added selftests from Ackerley and myself. The branch was > based on your original upm_base_support and then rebased to your > kvm-x86/mmu head. Feel free to take anything you think useful( most of > them are trivial things but also some fixes for bugs). Nice! I am going to work on splicing together the various series this week, I'll make sure to grab your work. Thanks much!
What do y'all think about renaming "restrictedmem" to "guardedmem"? I want to start referring to the code/patches by its syscall/implementation name instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort and not just the non-KVM code, and (c) will likely be confusing for future reviewers since there's nothing in the code that mentions "UPM" in any way. But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is already used to refer to "reserved memory". Renaming the syscall to "guardedmem"... 1. Allows for a shorthand and namespace, "gmem", that isn't already in use by the kernel (see "reserved memory above"). 2. Provides a stronger hint as to its purpose. "Restricted" conveys that the allocated memory is limited in some way, but doesn't capture how the memory is restricted, e.g. "restricted" could just as easily mean that the allocation can be restricted to certain types of backing stores or something. "Guarded" on the other hand captures that the memory has extra defenses of some form. 3. Is shorter to type and speak. Work smart, not hard :-) 4. Isn't totally wrong for the KVM use case if someone assumes the "g" means "guest" when reading mail and whatnot. P.S. I trimmed the Cc/To substantially for this particular discussion to avoid spamming folks that don't (yet) care about this stuff with another potentially lengthy thread. Feel free to add (back) any people/lists.
On 17.04.23 17:40, Sean Christopherson wrote: > What do y'all think about renaming "restrictedmem" to "guardedmem"? Yeay, let's add more confusion :D If we're at renaming, I'd appreciate if we could find a terminology that does look/sound less horrible. > > I want to start referring to the code/patches by its syscall/implementation name > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort > and not just the non-KVM code, and (c) will likely be confusing for future reviewers > since there's nothing in the code that mentions "UPM" in any way. > > But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is > already used to refer to "reserved memory". > > Renaming the syscall to "guardedmem"... restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
On Mon, Apr 17, 2023, David Hildenbrand wrote: > On 17.04.23 17:40, Sean Christopherson wrote: > > I want to start referring to the code/patches by its syscall/implementation name > > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort > > and not just the non-KVM code, and (c) will likely be confusing for future reviewers > > since there's nothing in the code that mentions "UPM" in any way. > > > > But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is > > already used to refer to "reserved memory". > > > > Renaming the syscall to "guardedmem"... > > restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ... I'm definitely open to other suggestions, but I suspect it's going to be difficult to be more precise than something like "guarded". E.g. we discussed "unmappable" at one point, but the memory can still be mapped, just not via mmap(). And it's not just about mappings, e.g. read() and its many variants are all disallowed too, despite the kernel direct map still being live (modulo SNP requirements).
On 17.04.23 18:40, Sean Christopherson wrote: > On Mon, Apr 17, 2023, David Hildenbrand wrote: >> On 17.04.23 17:40, Sean Christopherson wrote: >>> I want to start referring to the code/patches by its syscall/implementation name >>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort >>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers >>> since there's nothing in the code that mentions "UPM" in any way. >>> >>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is >>> already used to refer to "reserved memory". >>> >>> Renaming the syscall to "guardedmem"... >> >> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ... > > I'm definitely open to other suggestions, but I suspect it's going to be difficult > to be more precise than something like "guarded". Guardedmem is just as bad as restrictedmem IMHO, sorry. Restricted: what's restricted? how does the restriction manifest? secretmem also has it's restrictions/limitations (pinning), why does that one not fall under the same category? Make a stranger guess what "restrictedmem" is and I can guarantee that it has nothing to do with the concept we're introducing here. Guarded: what's guarded? From whom? For which purpose? How does the "guarding" manifest? Again, make a stranger guess what "guardedmem" is and I can guarantee that it has nothing to do with the concept we're introducing here. If, at all, the guess might be "guarded storage" [1] on s390x, which, of course, has nothing to do with the concept here. (storage on s390x is just the dinosaur slang for memory) Often, if we fail to find a good name, the concept is either unclear or not well defined. So what are the characteristics we want to generalize under that new name? We want to have an fd, that (a) cannot be mapped into user space (mmap) (b) cannot be accessed using ordinary system calls (read/write) (c) can still be managed like other fds (fallocate, future NUMA policies?) (d) can be consumed by some special entities that are allowed to read/write/map. So the fd content is inaccessible using the ordinary POSIX syscalls. It's only accessible by special entities (e.g., KVM). Most probably I am forgetting something. But maybe that will help to find a more expressive name. Maybe :) > > E.g. we discussed "unmappable" at one point, but the memory can still be mapped, > just not via mmap(). And it's not just about mappings, e.g. read() and its many > variants are all disallowed too, despite the kernel direct map still being live > (modulo SNP requirements). > [1] https://man.archlinux.org/man/s390_guarded_storage.2.en
On Mon, Apr 17, 2023, Ackerley Tng wrote: > Sean Christopherson <seanjc@google.com> writes: > > > On Mon, Apr 17, 2023, David Hildenbrand wrote: > > > On 17.04.23 17:40, Sean Christopherson wrote: > > > > I want to start referring to the code/patches by its > > > syscall/implementation name > > > > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to > > > the broader effort > > > > and not just the non-KVM code, and (c) will likely be confusing > > > for future reviewers > > > > since there's nothing in the code that mentions "UPM" in any way. > > > > > > > > But typing out restrictedmem is quite tedious, and git grep shows > > > that "rmem" is Your mail client appears to be wrapping too aggressively and mangling quotes. I'm guessing gmail is to blame? > > > > already used to refer to "reserved memory". > > > > > > > > Renaming the syscall to "guardedmem"... > > > > restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask > > > me ... > > > I'm definitely open to other suggestions, but I suspect it's going to be > > difficult > > to be more precise than something like "guarded". > > > E.g. we discussed "unmappable" at one point, but the memory can still be > > mapped, > > just not via mmap(). And it's not just about mappings, e.g. read() and > > its many > > variants are all disallowed too, despite the kernel direct map still > > being live > > (modulo SNP requirements). > > I'm for renaming the concept because restrictedmem is quite a > mouthful. :) > > How about "concealedmem" or "obscuredmem" to highlight the idea of this > memory being hidden/unreadable/unmappable from userspace? I'm hesitant to use something like "concealed" becuase it's too close to secretmem, e.g. might be miscontrued as concealing the memory from anything _but_ the process that creates the file. Obscured has similar problems, and obscure often suggests that something is unclear, as opposed to outright unreachable. The other aspect of hidden/concealed/etc is that the memory isn't necessarily concealed from the user. Long term, I hope to get to the point where even "normal" VMs use restricted/guarded/??? memory, e.g. to guard (heh) against _unintentional_ access from userspace. In that use case, the memory isn't truly concealed, espeically if the user is both the "admin" and the consumer. Though by that argument, "guarded" is also a poor choice. And now I'm remembering how we ended up with "restricted"... > Guarded is better than restricted but doesn't really highlight how/in > what way it is being guarded. Ya, though in practice I think it's infeasible for us to get a name that is both precise and succinct.
On Mon, Apr 17, 2023, David Hildenbrand wrote: > On 17.04.23 18:40, Sean Christopherson wrote: > > On Mon, Apr 17, 2023, David Hildenbrand wrote: > > > On 17.04.23 17:40, Sean Christopherson wrote: > > > > I want to start referring to the code/patches by its syscall/implementation name > > > > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort > > > > and not just the non-KVM code, and (c) will likely be confusing for future reviewers > > > > since there's nothing in the code that mentions "UPM" in any way. > > > > > > > > But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is > > > > already used to refer to "reserved memory". > > > > > > > > Renaming the syscall to "guardedmem"... > > > > > > restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ... > > > > I'm definitely open to other suggestions, but I suspect it's going to be difficult > > to be more precise than something like "guarded". > > Guardedmem is just as bad as restrictedmem IMHO, sorry. > > > Restricted: what's restricted? how does the restriction manifest? secretmem > also has it's restrictions/limitations (pinning), why does that one not fall > under the same category? > > Make a stranger guess what "restrictedmem" is and I can guarantee that it > has nothing to do with the concept we're introducing here. > > > Guarded: what's guarded? From whom? For which purpose? How does the > "guarding" manifest? I completely agree that "guarded" lacks precision, but as above, I've pretty much given up hope of being super precise. I actually like "restricted", I just don't like that I can't shorten the name. Hmm, maybe that won't be a huge problem in practice. I can't say I've ever heard any use "rmem" in verbale or written communication, it's primarily just "rmem" in code that we can't use, and I don't mind having to use restrictedmem for the namespace. So maybe we can use "rmem", just not in code? Or, we could pretend we're pirates and call it arrrmem!, which is definitely going to be how I refer to it in my internal dialogue if we keep "restricted" :-) > Again, make a stranger guess what "guardedmem" is and I can guarantee that > it has nothing to do with the concept we're introducing here. > > If, at all, the guess might be "guarded storage" [1] on s390x, which, of > course, has nothing to do with the concept here. Oof, and guarded storage is even documented in Documentation/virt/kvm/api.rst. > (storage on s390x is just the dinosaur slang for memory) > > > Often, if we fail to find a good name, the concept is either unclear or not > well defined. > > So what are the characteristics we want to generalize under that new name? > We want to have an fd, that > > (a) cannot be mapped into user space (mmap) > (b) cannot be accessed using ordinary system calls (read/write) > (c) can still be managed like other fds (fallocate, future NUMA > policies?) > (d) can be consumed by some special entities that are allowed to > read/write/map. > > So the fd content is inaccessible using the ordinary POSIX syscalls. It's > only accessible by special entities (e.g., KVM). > > Most probably I am forgetting something. But maybe that will help to find a > more expressive name. Maybe :) Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem, and depending on the use case, the memory may not actually be concealed from the user that controls the VMM. Restricted - "rmem" collides with "reserved memory" in code. Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem. Inaccessible - Many of the same problems as "hidden". Unmappable - Doesn't cover things like read/write, and is wrong in the sense that the memory is still mappable, just not via mmap(). Secured - I'm not getting anywhere near this one :-)
On Mon, Apr 17, 2023 at 8:16 PM Sean Christopherson <seanjc@google.com> wrote: .... > > So the fd content is inaccessible using the ordinary POSIX syscalls. It's > > only accessible by special entities (e.g., KVM). > > > > Most probably I am forgetting something. But maybe that will help to find a > > more expressive name. Maybe :) > > Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem, > and depending on the use case, the memory may not actually be concealed from the > user that controls the VMM. > > Restricted - "rmem" collides with "reserved memory" in code. > > Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem. > > Inaccessible - Many of the same problems as "hidden". > > Unmappable - Doesn't cover things like read/write, and is wrong in the sense that > the memory is still mappable, just not via mmap(). > > Secured - I'm not getting anywhere near this one :-) How about "protected" ;)? _ducks_ To me the name doesn't matter much, but fwiw I have developed a liking to "restricted", more than the previous "private", since of all of the one-word suggestions I think it captures most of what it's trying to do. Cheers, /fuad
On 17.04.23 21:16, Sean Christopherson wrote: > On Mon, Apr 17, 2023, David Hildenbrand wrote: >> On 17.04.23 18:40, Sean Christopherson wrote: >>> On Mon, Apr 17, 2023, David Hildenbrand wrote: >>>> On 17.04.23 17:40, Sean Christopherson wrote: >>>>> I want to start referring to the code/patches by its syscall/implementation name >>>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort >>>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers >>>>> since there's nothing in the code that mentions "UPM" in any way. >>>>> >>>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is >>>>> already used to refer to "reserved memory". >>>>> >>>>> Renaming the syscall to "guardedmem"... >>>> >>>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ... >>> >>> I'm definitely open to other suggestions, but I suspect it's going to be difficult >>> to be more precise than something like "guarded". >> >> Guardedmem is just as bad as restrictedmem IMHO, sorry. >> >> >> Restricted: what's restricted? how does the restriction manifest? secretmem >> also has it's restrictions/limitations (pinning), why does that one not fall >> under the same category? >> >> Make a stranger guess what "restrictedmem" is and I can guarantee that it >> has nothing to do with the concept we're introducing here. >> >> >> Guarded: what's guarded? From whom? For which purpose? How does the >> "guarding" manifest? > > I completely agree that "guarded" lacks precision, but as above, I've pretty much > given up hope of being super precise. I actually like "restricted", I just don't > like that I can't shorten the name. > > Hmm, maybe that won't be a huge problem in practice. I can't say I've ever heard > any use "rmem" in verbale or written communication, it's primarily just "rmem" in > code that we can't use, and I don't mind having to use restrictedmem for the namespace. > So maybe we can use "rmem", just not in code? > > Or, we could pretend we're pirates and call it arrrmem!, which is definitely going > to be how I refer to it in my internal dialogue if we keep "restricted" :-) :) > >> Again, make a stranger guess what "guardedmem" is and I can guarantee that >> it has nothing to do with the concept we're introducing here. >> >> If, at all, the guess might be "guarded storage" [1] on s390x, which, of >> course, has nothing to do with the concept here. > > Oof, and guarded storage is even documented in Documentation/virt/kvm/api.rst. > >> (storage on s390x is just the dinosaur slang for memory) >> >> >> Often, if we fail to find a good name, the concept is either unclear or not >> well defined. >> >> So what are the characteristics we want to generalize under that new name? >> We want to have an fd, that >> >> (a) cannot be mapped into user space (mmap) >> (b) cannot be accessed using ordinary system calls (read/write) >> (c) can still be managed like other fds (fallocate, future NUMA >> policies?) >> (d) can be consumed by some special entities that are allowed to >> read/write/map. >> >> So the fd content is inaccessible using the ordinary POSIX syscalls. It's >> only accessible by special entities (e.g., KVM). >> >> Most probably I am forgetting something. But maybe that will help to find a >> more expressive name. Maybe :) > > Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem, > and depending on the use case, the memory may not actually be concealed from the > user that controls the VMM. > > Restricted - "rmem" collides with "reserved memory" in code. > > Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem. > > Inaccessible - Many of the same problems as "hidden". > > Unmappable - Doesn't cover things like read/write, and is wrong in the sense that > the memory is still mappable, just not via mmap(). > > Secured - I'm not getting anywhere near this one :-) The think about "secretmem" that I kind-of like (a little) is that it explains what it's used for: storing secrets. We don't call it "unmapped" memory because we unmap it from the directmap or "unpinnable" memory or "inaccessible" memory ... or even "restricted" because it has restrictions ... how the secrets are protected is kind of an implementation detail. So instead of describing *why*/*how* restrictedmem is the weird kid (restricted/guarded/hidden/restricted/inaccessible/ ...), maybe rather describe what it's used for? I know, I know, "there are other use cases where it will be used outside of VM context". I really don't care. "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here) much easier to get. It's a special fd to be used to back VM memory. Depending on the VM type (encrypted/protected/whatever), restrictions might apply (not able to mmap, not able to read/write ...). For example, there really is no need to disallow mmap/read/write when using that memory to back a simple VM where all we want to do is avoid user-space page tables.
On Tue, Apr 18, 2023, David Hildenbrand wrote: > On 17.04.23 21:16, Sean Christopherson wrote: > > Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem, > > and depending on the use case, the memory may not actually be concealed from the > > user that controls the VMM. > > > > Restricted - "rmem" collides with "reserved memory" in code. > > > > Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem. > > > > Inaccessible - Many of the same problems as "hidden". > > > > Unmappable - Doesn't cover things like read/write, and is wrong in the sense that > > the memory is still mappable, just not via mmap(). > > > > Secured - I'm not getting anywhere near this one :-) > > The think about "secretmem" that I kind-of like (a little) is that it > explains what it's used for: storing secrets. We don't call it "unmapped" > memory because we unmap it from the directmap or "unpinnable" memory or > "inaccessible" memory ... or even "restricted" because it has restrictions > ... how the secrets are protected is kind of an implementation detail. > > So instead of describing *why*/*how* restrictedmem is the weird kid > (restricted/guarded/hidden/restricted/inaccessible/ ...), maybe rather > describe what it's used for? > > I know, I know, "there are other use cases where it will be used outside of > VM context". I really don't care. Heh, we originally proposed F_SEAL_GUEST, but that was also sub-optimal[1] ;-) > "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here) > much easier to get. It's a special fd to be used to back VM memory. Depending > on the VM type (encrypted/protected/whatever), restrictions might apply (not > able to mmap, not able to read/write ...). For example, there really is no > need to disallow mmap/read/write when using that memory to back a simple VM > where all we want to do is avoid user-space page tables. In seriousness, I do agree with Jason's very explicit objection[2] against naming a non-KVM uAPI "guest", or any variation thereof. An alternative that we haven't considered since the very early days is making the uAPI a KVM ioctl() instead of a memfd() flag or dedicated syscall. Looking at the code for "pure shim" implementation[3], that's actually not that crazy of an idea. I don't know that I love the idea of burying this in KVM, but there are benefits to coupling restrictedmem to KVM (aside from getting out from behind this bikeshed that I created). The big benefit is that the layer of indirection goes away. That simplifies things like enhancing restrictedmem to allow host userspace access for debug purposes, batching TLB flushes if a PUNCH_HOLE spans multiple memslots, enforcing exclusive access, likely the whole "share with a device" story if/when we get there, etc. The obvious downsides are that (a) maintenance falls under the KVM umbrella, but that's likely to be true in practice regardless of where the code lands, and (b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk someone reinventing a similar solution. If we can get Gunyah on board and they don't need substantial changes to the restrictedmem implementation, then I'm all for continuing on the path we're on. But if Gunyah wants to do their own thing, and the lightweight shim approach is viable, then it's awfully tempting to put this all behind a KVM ioctl(). [1] https://lore.kernel.org/all/df11d753-6242-8f7c-cb04-c095f68b41fa@redhat.com [2] https://lore.kernel.org/all/20211123171723.GD5112@ziepe.ca [3] https://lore.kernel.org/all/ZDiCG%2F7OgDI0SwMR@google.com [4] https://lore.kernel.org/all/Y%2FkI66qQFJJ6bkTq@google.com [5] https://lore.kernel.org/all/20230304010632.2127470-13-quic_eberman@quicinc.com
On 19.04.23 02:47, Sean Christopherson wrote: > On Tue, Apr 18, 2023, David Hildenbrand wrote: >> On 17.04.23 21:16, Sean Christopherson wrote: >>> Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem, >>> and depending on the use case, the memory may not actually be concealed from the >>> user that controls the VMM. >>> >>> Restricted - "rmem" collides with "reserved memory" in code. >>> >>> Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem. >>> >>> Inaccessible - Many of the same problems as "hidden". >>> >>> Unmappable - Doesn't cover things like read/write, and is wrong in the sense that >>> the memory is still mappable, just not via mmap(). >>> >>> Secured - I'm not getting anywhere near this one :-) >> >> The think about "secretmem" that I kind-of like (a little) is that it >> explains what it's used for: storing secrets. We don't call it "unmapped" >> memory because we unmap it from the directmap or "unpinnable" memory or >> "inaccessible" memory ... or even "restricted" because it has restrictions >> ... how the secrets are protected is kind of an implementation detail. >> >> So instead of describing *why*/*how* restrictedmem is the weird kid >> (restricted/guarded/hidden/restricted/inaccessible/ ...), maybe rather >> describe what it's used for? >> >> I know, I know, "there are other use cases where it will be used outside of >> VM context". I really don't care. > > Heh, we originally proposed F_SEAL_GUEST, but that was also sub-optimal[1] ;-) > >> "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here) >> much easier to get. It's a special fd to be used to back VM memory. Depending >> on the VM type (encrypted/protected/whatever), restrictions might apply (not >> able to mmap, not able to read/write ...). For example, there really is no >> need to disallow mmap/read/write when using that memory to back a simple VM >> where all we want to do is avoid user-space page tables. > > In seriousness, I do agree with Jason's very explicit objection[2] against naming > a non-KVM uAPI "guest", or any variation thereof. While I agree, it's all better than the naming we use right now ... Let me throw "tee_mem" / "memfd_tee" into the picture. That could eventually catch what we want to have. Or "coco_mem" / "memfd_coco". Of course, both expect that people know the terminology (just like what "vm" stands for), but it's IMHO significantly better than restricted/guarded/opaque/whatsoever. Again, expresses what it's used for, not why it behaves in weird ways. > > An alternative that we haven't considered since the very early days is making the > uAPI a KVM ioctl() instead of a memfd() flag or dedicated syscall. Looking at the > code for "pure shim" implementation[3], that's actually not that crazy of an idea. Yes. > > I don't know that I love the idea of burying this in KVM, but there are benefits > to coupling restrictedmem to KVM (aside from getting out from behind this bikeshed > that I created). Yes, it's all better than jumping through hoops to come up with a bad name like "restrictedmem". > > The big benefit is that the layer of indirection goes away. That simplifies things > like enhancing restrictedmem to allow host userspace access for debug purposes, > batching TLB flushes if a PUNCH_HOLE spans multiple memslots, enforcing exclusive > access, likely the whole "share with a device" story if/when we get there, etc. > > The obvious downsides are that (a) maintenance falls under the KVM umbrella, but > that's likely to be true in practice regardless of where the code lands, and Yes. > (b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk > someone reinventing a similar solution. I agree. But if it's as simple as providing an ioctl for that hypervisor that simply wires up the existing implementation, it's not too bad. > > If we can get Gunyah on board and they don't need substantial changes to the > restrictedmem implementation, then I'm all for continuing on the path we're on. > But if Gunyah wants to do their own thing, and the lightweight shim approach is > viable, then it's awfully tempting to put this all behind a KVM ioctl(). Right. Or we still succeed in finding a name that's not as bad as what we had so far.
On Wed, Apr 19, 2023, David Hildenbrand wrote: > On 19.04.23 02:47, Sean Christopherson wrote: > > On Tue, Apr 18, 2023, David Hildenbrand wrote: > > > "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here) > > > much easier to get. It's a special fd to be used to back VM memory. Depending > > > on the VM type (encrypted/protected/whatever), restrictions might apply (not > > > able to mmap, not able to read/write ...). For example, there really is no > > > need to disallow mmap/read/write when using that memory to back a simple VM > > > where all we want to do is avoid user-space page tables. > > > > In seriousness, I do agree with Jason's very explicit objection[2] against naming > > a non-KVM uAPI "guest", or any variation thereof. > > While I agree, it's all better than the naming we use right now ... > > > Let me throw "tee_mem" / "memfd_tee" into the picture. That could eventually > catch what we want to have. > > Or "coco_mem" / "memfd_coco". > > Of course, both expect that people know the terminology (just like what "vm" > stands for), but it's IMHO significantly better than > restricted/guarded/opaque/whatsoever. > > Again, expresses what it's used for, not why it behaves in weird ways. I don't want to explicitly tie this to trusted execution or confidential compute, as there is value in backing "normal" guests with memory that cannot be accessed by the host userspace without jumping through a few extra hoops, e.g. to add a layer of protection against data corruption due to host userspace bugs. > > (b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk > > someone reinventing a similar solution. > > I agree. But if it's as simple as providing an ioctl for that hypervisor > that simply wires up the existing implementation, it's not too bad. Yeah, my mind was wandering in this direction too. The absolute worst case scenario seems to be that we do end up creating a generic syscall that is a superset of KVM's functionality, in which case KVM would end up with an ioctl() that is just a redirect/wrapper.
On 19.04.23 17:17, Sean Christopherson wrote: > On Wed, Apr 19, 2023, David Hildenbrand wrote: >> On 19.04.23 02:47, Sean Christopherson wrote: >>> On Tue, Apr 18, 2023, David Hildenbrand wrote: >>>> "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here) >>>> much easier to get. It's a special fd to be used to back VM memory. Depending >>>> on the VM type (encrypted/protected/whatever), restrictions might apply (not >>>> able to mmap, not able to read/write ...). For example, there really is no >>>> need to disallow mmap/read/write when using that memory to back a simple VM >>>> where all we want to do is avoid user-space page tables. >>> >>> In seriousness, I do agree with Jason's very explicit objection[2] against naming >>> a non-KVM uAPI "guest", or any variation thereof. >> >> While I agree, it's all better than the naming we use right now ... >> >> >> Let me throw "tee_mem" / "memfd_tee" into the picture. That could eventually >> catch what we want to have. >> >> Or "coco_mem" / "memfd_coco". >> >> Of course, both expect that people know the terminology (just like what "vm" >> stands for), but it's IMHO significantly better than >> restricted/guarded/opaque/whatsoever. >> >> Again, expresses what it's used for, not why it behaves in weird ways. > > I don't want to explicitly tie this to trusted execution or confidential compute, > as there is value in backing "normal" guests with memory that cannot be accessed > by the host userspace without jumping through a few extra hoops, e.g. to add a > layer of protection against data corruption due to host userspace bugs. Nothing speaks against using tee_mem for the same purpose I guess. I like the sound of it after all. :)
+Christian and Hugh On Wed, Apr 19, 2023, David Hildenbrand wrote: > On 19.04.23 02:47, Sean Christopherson wrote: > > An alternative that we haven't considered since the very early days is making the > > uAPI a KVM ioctl() instead of a memfd() flag or dedicated syscall. Looking at the > > code for "pure shim" implementation[3], that's actually not that crazy of an idea. > > Yes. > > > > > I don't know that I love the idea of burying this in KVM, but there are benefits > > to coupling restrictedmem to KVM (aside from getting out from behind this bikeshed > > that I created). > > Yes, it's all better than jumping through hoops to come up with a bad name > like "restrictedmem". > > > > > The big benefit is that the layer of indirection goes away. That simplifies things > > like enhancing restrictedmem to allow host userspace access for debug purposes, > > batching TLB flushes if a PUNCH_HOLE spans multiple memslots, enforcing exclusive > > access, likely the whole "share with a device" story if/when we get there, etc. > > > > The obvious downsides are that (a) maintenance falls under the KVM umbrella, but > > that's likely to be true in practice regardless of where the code lands, and > > Yes. > > > (b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk > > someone reinventing a similar solution. > > I agree. But if it's as simple as providing an ioctl for that hypervisor > that simply wires up the existing implementation, it's not too bad. > > > > > If we can get Gunyah on board and they don't need substantial changes to the > > restrictedmem implementation, then I'm all for continuing on the path we're on. > > But if Gunyah wants to do their own thing, and the lightweight shim approach is > > viable, then it's awfully tempting to put this all behind a KVM ioctl(). > > Right. Or we still succeed in finding a name that's not as bad as what we > had so far. Okie dokie. So as not to bury the lede: I think we should provide a KVM ioctl() and have KVM provide its own low-level implementation, i.e. not wrap shmem. As much as I don't want to have KVM serving up memory to userspace, by trying to keep KVM out of the memory management business, we're actually making things *more* complex and harder to maintain (and merge). Hugh said something to this effect quite early on[1], it just unfortunately took me a long time to understand the benefits. : QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps : the fallocate syscall interface itself) to allocate and free the memory, : ioctl for initializing some of it too. KVM in control of whether that : fd can be read or written or mmap'ed or whatever, no need to prevent it : in shmem.c, no need for flags, seals, notifications to and fro because : KVM is already in control and knows the history. If shmem actually has : value, call into it underneath - somewhat like SysV SHM, and /dev/zero : mmap, and i915/gem make use of it underneath. If shmem has nothing to : add, just allocate and free kernel memory directly, recorded in your : own xarray. Christian also suggested that we stop trying to be lazy and create a proper API[2]. : I find this hard to navigate tbh and it feels like taking a shortcut to : avoid building a proper api. If you only care about a specific set of : operations specific to memfd restricte that needs to be available to : in-kernel consumers, I wonder if you shouldn't just go one step further : then your proposal below and build a dedicated minimal ops api. Idk, : sketching like a madman on a drawning board here with no claim to : feasibility from a mm perspective whatsoever The key point from Hugh is that, except for a few minor things that are trivially easy to replicate, the things that make shmem "shmem" don't provide any value for the KVM use case: - We have no plans to support swap, and migration support is dubious at best. Swap in particular brings in a lot of complexity for no benefit (to us). That complexity doesn't mean depending on shmem is inherently bad, but it does mean rolling our own implementation is highly unlikely to result in reinventing shmem's wheel. - Sharing a backing file between arbitrary process is _unwanted_ for the initial use cases. There may come a time when mutually trusted VMs can share "private" data, but (a) that's a distant future problem and (b) it'll likely require even more KVM control over the memory. - All of the interfaces for read/write/mmap/etc. are dead weight from our perspective. Even worse, we have to actively avoid those interfaces. That can kinda sorta be done without polluting shmem code by using a shim, but that has problems of its own (see below). And Christian pointed out several flaws with wrapping shmem: - Implementing a partial overlay filesystem leads to inconsistencies because only some of the ops are changed, e.g. poking at the inode_operations or super_operations will show shmem stuff, whereas address_space_operations and file_operations will show restrictedmem stuff. - Usurping just f_ops and a_ops without creating a full blown filesystem avoids the partial overlay issues, but partially overriding ops isn't really supported (because it's weird and brittle), e.g. blindly forwarding a fallocate() after restrictedmem does it's thing "works", but only because we very carefully and deliberately designed restrictedmem on top of shmem. On top of the points raised by Hugh and Christian, wrapping shmem isn't really any simpler, just different. E.g. there's a very subtle bug in the shim variant: by passing SGP_WRITE, shmem skips zeroing the page because restrictemem is telling shmem that it (restrictedmem) will immediately write the page. For TDX and SNP, that's a "feature" because in those cases trusted firmware will zero (or init) memory when it's assigned to the guest, but it's a nasty flaw for other use cases. I'm not saying that we'll magically avoid such bugs by avoiding shmem, just pointing out that using shmem requires understanding exactly how shmem works, i.e. using shmem isn't necessarily any easier than building directly on filemap and/or folio APIs. And I gotta imagine it will be a similar story if/when we want to add hugetlbfs support. Building on filemap/folio will definitely have its own challenges, but after prototyping an implementation I can confidently say that none of the problems will be insurmountable. KVM has _way_ more complex code in its memslot and MMU implementations. And another benefit of building on filemap and/or folio APIs is that, because we would be reinventing the wheel to some extent, when we do inevitablly run into problems, it will be easier to get help solving those problems because (a) we won't be doing weird things no one wants to deal with and (b) because the problems will likely be things others have already dealt with. The other angle I've been looking at is whether or not having KVM provide its own implementation will lead to maintenance problems in the future, specifically if we get to the point where we want to support "fancy" things like swap and migration. For migration, I'm quite confident that a dedicated KVM ioctl() versus wrapping shmem would be at worst a wash, and more than likely simpler if KVM owns everything. E.g. migrating pages under TDX and SNP requires invoking magic instructions, and so we'd be overriding ->migrate_folio() no matter what. As for swap, I think we should put a stake in the ground and say that KVM will never support swap for KVM's ioctl(). Sooo much of the infrastructure around swap/reclaim is tied to userspace mappings, e.g. knowing which pages are LRU and/or cold. I poked around a bit to see how we could avoid reinventing all of that infrastructure for fd-only memory, and the best idea I could come up with is basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e. allow "mapping" fd-only memory, but ensure that memory is never actually present from hardware's perspective. But on top of the various problems with that approach, the only use cases I can think of for using fd-only to back non-confidential VMs is to guard against spurious writes/reads to guest memory and/or avoid memory overhead for mapping guest memory into the user page tables. Avoiding memory overhead is completely defeated if the page tables are populated PROT_NONE, which just leaves the "harden against guest data corruption use case". And for that specific use case, _if_ swap is desirable, it would be far, far easier to teach the kernel to allow KVM to follow PROT_NONE (as Kirill's series did), as opposed to trying to teach the kernel and/or KVM how to swap fd-only memory. In other words, fd-only memory is purely for slice-of-hardware VMs. If someone wants to overcommit VMs, then they use KVM's traditional API for mapping memory into the guest. Regarding putting this in KVM, as mentioned (way) above in the quote, the worst case scenario of making this a KVM ioctl() seems to be that KVM will end up with an ioctl() that redirects to common kernel code. On the flip side, implementing this in KVM gives us a much clearer path to getting all of this merged. There will be a few small non-KVM patches, but they will either be trivial, e.g. exporting APIs (that aren't contentious) or not strictly required, e.g. adding a flag to mark an address space as completely unmigratable (for improved performance). I.e. the non-KVM patches will be small and be very actionable for their maintainers, and other than that we control our own destiny. And on the topic of control, making this a KVM ioctl() and implementation gives KVM a _lot_ of control. E.g. we can make the file size immutable to simplify the implementation, bind the fd to a single VM at creation time, add KVM-defined flags for controlling hugepage behavior, etc. Last but not least, I think forcing us KVM folks to get our hands at least a little dirty with MM and FS stuff would be a *good* thing. KVM has had more than a few bugs that we missed in no small part because most of the people that work on KVM have almost zero knowledge of MM and FS, and especially at the boundaries between those two. Note, I implemented the POC on top of the filemap APIs because that was much faster and less error prone than re-implementing xarray management myself. We may or may not want to actually make the kernel at large aware of these allocations, i.e. it may be better to follow Hugh's suggestion and use the folio APIs directly instead of piggybacking filemap+folios. E.g. I _think_ migration becomes a complete non-issue if the pages are "raw" allocations and never get marked as LRU-friendly. What I'm not so sure about is if there is anything substantial that we'd lose out on by not reusing the filemap stuff. The POC also doesn't play nice with Ackerley's patches to allocate files on a user-defined mount. AIUI, that's largely a non-issue as something like fbind() would provide a superset of that functionality and more than one maintainer has expressed (handwavy) support for a generic fbind(). Code is available here if folks want to take a look before any kind of formal posting: https://github.com/sean-jc/linux.git x86/kvm_gmem_solo [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
On Mon Apr 17, 2023 at 6:40 PM EEST, Sean Christopherson wrote: > What do y'all think about renaming "restrictedmem" to "guardedmem"? > > I want to start referring to the code/patches by its syscall/implementation name > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort > and not just the non-KVM code, and (c) will likely be confusing for future reviewers > since there's nothing in the code that mentions "UPM" in any way. > > But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is > already used to refer to "reserved memory". > > Renaming the syscall to "guardedmem"... > > 1. Allows for a shorthand and namespace, "gmem", that isn't already in use by > the kernel (see "reserved memory above"). > > 2. Provides a stronger hint as to its purpose. "Restricted" conveys that the > allocated memory is limited in some way, but doesn't capture how the memory > is restricted, e.g. "restricted" could just as easily mean that the allocation > can be restricted to certain types of backing stores or something. "Guarded" > on the other hand captures that the memory has extra defenses of some form. > > 3. Is shorter to type and speak. Work smart, not hard :-) > > 4. Isn't totally wrong for the KVM use case if someone assumes the "g" means > "guest" when reading mail and whatnot. > > > P.S. I trimmed the Cc/To substantially for this particular discussion to avoid > spamming folks that don't (yet) care about this stuff with another potentially > lengthy thread. Feel free to add (back) any people/lists. I guess 'guarded' could be a good noun in the sense that it does not get easily mixed up to anything pre-existing, and it does give the idea of the purpose. BR, Jarkko
On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote: > On 17.04.23 17:40, Sean Christopherson wrote: > > What do y'all think about renaming "restrictedmem" to "guardedmem"? > > Yeay, let's add more confusion :D > > If we're at renaming, I'd appreciate if we could find a terminology that > does look/sound less horrible. > > > > > I want to start referring to the code/patches by its syscall/implementation name > > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort > > and not just the non-KVM code, and (c) will likely be confusing for future reviewers > > since there's nothing in the code that mentions "UPM" in any way. > > > > But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is > > already used to refer to "reserved memory". > > > > Renaming the syscall to "guardedmem"... > > restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ... In the world of TEE's and confidential computing it is fairly common to call memory areas enclaves, even outside SGX context. So in that sense enclave memory would be the most correct terminology. BR, Jarkko
On 23.04.23 15:28, Jarkko Sakkinen wrote: > On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote: >> On 17.04.23 17:40, Sean Christopherson wrote: >>> What do y'all think about renaming "restrictedmem" to "guardedmem"? >> >> Yeay, let's add more confusion :D >> >> If we're at renaming, I'd appreciate if we could find a terminology that >> does look/sound less horrible. >> >>> >>> I want to start referring to the code/patches by its syscall/implementation name >>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort >>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers >>> since there's nothing in the code that mentions "UPM" in any way. >>> >>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is >>> already used to refer to "reserved memory". >>> >>> Renaming the syscall to "guardedmem"... >> >> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ... > > In the world of TEE's and confidential computing it is fairly common to > call memory areas enclaves, even outside SGX context. So in that sense > enclave memory would be the most correct terminology. I was also thinking along the lines of isolated_mem or imem ... essentially, isolated from (unprivileged) user space. ... if we still want to have a common syscall for it.
On Fri, May 05, 2023, Ackerley Tng wrote: > > Hi Sean, > > Thanks for implementing this POC! > > I’ve started porting the selftests (both Chao’s and those I added [1]). > > guest mem seems to cover the use cases that have been discussed and > proposed so far, but I still need to figure out how gmem can work with > > + hugetlbfs > + specification of/storing memory policy (for NUMA node bindings) > + memory accounting - we may need to account for memory used separately, > so that guest mem shows up separately on /proc/meminfo and similar > places. > > One issue I’ve found so far is that the pointer to kvm (gmem->kvm) is > not cleaned up, and hence it is possible to crash the host kernel in the > following way > > 1. Create a KVM VM > 2. Create a guest mem fd on that VM > 3. Create a memslot with the guest mem fd (hence binding the fd to the > VM) > 4. Close/destroy the KVM VM > 5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm > when it tries to do invalidation. > > I then tried to clean up the gmem->kvm pointer during unbinding when the > KVM VM is destroyed. > > That works, but then I realized there’s a simpler way to use the pointer > after freeing: > > 1. Create a KVM VM > 2. Create a guest mem fd on that VM > 3. Close/destroy the KVM VM > 4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm > when it tries to do invalidation. > > Perhaps binding should mean setting the gmem->kvm pointer in addition to > gmem->bindings. This makes binding and unbinding symmetric and avoids > the use-after-frees described above. Hrm, that would work, though it's a bit convoluted, e.g. would require detecting when the last binding is being removed. A similar (also ugly) solution would be to nullify gmem->kvm when KVM dies. I don't love either approach idea because it means a file created in the context of a VM can outlive the VM itself, and then userspace ends up with a file descriptor that it can't do anything with except close(). I doubt that matters in practice though, e.g. when the VM dies, all memory can be freed so that the file ends up being little more than a shell. And if we go that route, there's no need to grab a reference to the file during bind, KVM can just grab a longterm reference when the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm). Blech, another wart is that I believe gmem would need to do __module_get() during file creation to prevent kvm.ko from being unloaded after the last VM dies. Ah, but that'd also be true if we went with a system-scoped KVM ioctl(), so I suppose it's not _that_ ugly. Exchanging references (at binding or at creation) doesn't work, because that creates a circular dependency, i.e. gmem and KVM would pin each other. A "proper" refcounting approach, where the file pins KVM and not vice versa, gets nasty because of how KVM's memslots work. The least awful approach I can think of would be to delete the associated memslot(s) when the file is released, possibly via deferred work to avoid deadlock issues. Not the prettiest thing ever and in some ways that'd yield an even worse ABI. Side topic, there's a second bug (and probably more lurking): kvm_swap_active_memslots()'s call to synchronize_srcu_expedited() is done _before_ the call to kvm_gmem_unbind(), i.e. doesn't wait for readers in kvm_gmem_invalidate_begin() to go away. The easy solution for that one is to add another synchronize_srcu_expedited() after unbinding. > This also means that creating a guest mem fd is no longer dependent on > the VM. Perhaps we can make creating a gmem fd a system ioctl (like > KVM_GET_API_VERSION and KVM_CREATE_VM) instead of a vm ioctl? My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl() and not a common syscall. If the file isn't tightly coupled to a single VM, then punching a hole is further complicated by needing to deal with invalidating multiple regions that are bound to different @kvm instances. It's not super complex, but AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think having one VM own the memory will complicate even if/when we get to the point where VMs can share "private" memory, and the gmem code would still need to deal with grabbing a module reference.
On Fri, May 5, 2023 at 5:55 PM Sean Christopherson <seanjc@google.com> wrote: > > ... > My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl() > and not a common syscall. If the file isn't tightly coupled to a single VM, then > punching a hole is further complicated by needing to deal with invalidating multiple > regions that are bound to different @kvm instances. It's not super complex, but > AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think > having one VM own the memory will complicate even if/when we get to the point where > VMs can share "private" memory, and the gmem code would still need to deal with > grabbing a module reference. Copyless migration would be a scenario where "private" memory may need to be shared between source and target VMs depending on how migration support is implemented. Regards, Vishal
On 5/5/23 22:00, David Hildenbrand wrote: > On 23.04.23 15:28, Jarkko Sakkinen wrote: >> On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote: >>> On 17.04.23 17:40, Sean Christopherson wrote: >>>> What do y'all think about renaming "restrictedmem" to "guardedmem"? >>> >>> Yeay, let's add more confusion :D >>> >>> If we're at renaming, I'd appreciate if we could find a terminology that >>> does look/sound less horrible. >>> >>>> >>>> I want to start referring to the code/patches by its syscall/implementation name >>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort >>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers >>>> since there's nothing in the code that mentions "UPM" in any way. >>>> >>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is >>>> already used to refer to "reserved memory". >>>> >>>> Renaming the syscall to "guardedmem"... >>> >>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ... >> >> In the world of TEE's and confidential computing it is fairly common to >> call memory areas enclaves, even outside SGX context. So in that sense >> enclave memory would be the most correct terminology. > > I was also thinking along the lines of isolated_mem or imem ... > essentially, isolated from (unprivileged) user space. > > ... if we still want to have a common syscall for it. I'm fan of the ioctl, if it has a chance of working out.
On 06.05.23 09:44, Vlastimil Babka wrote: > On 5/5/23 22:00, David Hildenbrand wrote: >> On 23.04.23 15:28, Jarkko Sakkinen wrote: >>> On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote: >>>> On 17.04.23 17:40, Sean Christopherson wrote: >>>>> What do y'all think about renaming "restrictedmem" to "guardedmem"? >>>> >>>> Yeay, let's add more confusion :D >>>> >>>> If we're at renaming, I'd appreciate if we could find a terminology that >>>> does look/sound less horrible. >>>> >>>>> >>>>> I want to start referring to the code/patches by its syscall/implementation name >>>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort >>>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers >>>>> since there's nothing in the code that mentions "UPM" in any way. >>>>> >>>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is >>>>> already used to refer to "reserved memory". >>>>> >>>>> Renaming the syscall to "guardedmem"... >>>> >>>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ... >>> >>> In the world of TEE's and confidential computing it is fairly common to >>> call memory areas enclaves, even outside SGX context. So in that sense >>> enclave memory would be the most correct terminology. >> >> I was also thinking along the lines of isolated_mem or imem ... >> essentially, isolated from (unprivileged) user space. >> >> ... if we still want to have a common syscall for it. > > I'm fan of the ioctl, if it has a chance of working out. Yes, me too.
On Fri, May 05, 2023 at 07:39:36PM +0000, Ackerley Tng wrote: > > Hi Sean, > > Thanks for implementing this POC! > > I’ve started porting the selftests (both Chao’s and those I added [1]). Hi Sean/Ackerley, Thanks for doing that. Overall making gmem a KVM ioctl() looks good to me and it should also play nice with Intel TDX. Besides what Ackerley mentioned below, I think we haven't discussed device assignment, which will be supported in not too long distance. Current VFIO_IOMMU_MAP_DMA consumes virtual address so that needs to be fixed for fd-based memory anyway, and the fix looks not related to whether this being a syscall() or a KVM ioctl(). There will be some initialization sequence dependency, e.g. if gmem is finally a VM-scope ioctl() then we need VM created first before can we map fd-based memory in VFIO, but that sounds not an issue at all. I also see Vlastimil/David expressed their preference on ioctl. So maybe we can move forward on your current PoC. Do you already have a plan to post a formal version? Chao > > guest mem seems to cover the use cases that have been discussed and > proposed so far, but I still need to figure out how gmem can work with > > + hugetlbfs > + specification of/storing memory policy (for NUMA node bindings) > + memory accounting - we may need to account for memory used separately, > so that guest mem shows up separately on /proc/meminfo and similar > places. > > One issue I’ve found so far is that the pointer to kvm (gmem->kvm) is > not cleaned up, and hence it is possible to crash the host kernel in the > following way > > 1. Create a KVM VM > 2. Create a guest mem fd on that VM > 3. Create a memslot with the guest mem fd (hence binding the fd to the > VM) > 4. Close/destroy the KVM VM > 5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm > when it tries to do invalidation. > > I then tried to clean up the gmem->kvm pointer during unbinding when the > KVM VM is destroyed. > > That works, but then I realized there’s a simpler way to use the pointer > after freeing: > > 1. Create a KVM VM > 2. Create a guest mem fd on that VM > 3. Close/destroy the KVM VM > 4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm > when it tries to do invalidation. > > Perhaps binding should mean setting the gmem->kvm pointer in addition to > gmem->bindings. This makes binding and unbinding symmetric and avoids > the use-after-frees described above. > > This also means that creating a guest mem fd is no longer dependent on > the VM. Perhaps we can make creating a gmem fd a system ioctl (like > KVM_GET_API_VERSION and KVM_CREATE_VM) instead of a vm ioctl? > > [1] > https://lore.kernel.org/all/cover.1678926164.git.ackerleytng@google.com/T/ > > Ackerley
On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <seanjc@google.com> wrote: > > ... > cold. I poked around a bit to see how we could avoid reinventing all of that > infrastructure for fd-only memory, and the best idea I could come up with is > basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e. > allow "mapping" fd-only memory, but ensure that memory is never actually present > from hardware's perspective. > I am most likely missing a lot of context here and possibly venturing into an infeasible/already shot down direction here. But I would still like to get this discussed here before we move on. I am wondering if it would make sense to implement restricted_mem/guest_mem file to expose both private and shared memory regions, inline with Kirill's original proposal now that the file implementation is controlled by KVM. Thinking from userspace perspective: 1) Userspace creates guest mem files and is able to mmap them but all accesses to these files result into faults as no memory is allowed to be mapped into userspace VMM pagetables. 2) Userspace registers mmaped HVA ranges with KVM with additional KVM_MEM_PRIVATE flag 3) Userspace converts memory attributes and this memory conversion allows userspace to access shared ranges of the file because those are allowed to be faulted in from guest_mem. Shared to private conversion unmaps the file ranges from userspace VMM pagetables. 4) Granularity of userspace pagetable mappings for shared ranges will have to be dictated by KVM guest_mem file implementation. Caveat here is that once private pages are mapped into userspace view. Benefits here: 1) Userspace view remains consistent while still being able to use HVA ranges 2) It would be possible to use HVA based APIs from userspace to do things like binding. 3) Double allocation wouldn't be a concern since hva ranges and gpa ranges possibly map to the same HPA ranges. > > Code is available here if folks want to take a look before any kind of formal > posting: > > https://github.com/sean-jc/linux.git x86/kvm_gmem_solo > > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
On Wed, May 10, 2023 at 10:26 AM Vishal Annapurve <vannapurve@google.com> wrote: > > On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <seanjc@google.com> wrote: > > > > ... > > cold. I poked around a bit to see how we could avoid reinventing all of that > > infrastructure for fd-only memory, and the best idea I could come up with is > > basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e. > > allow "mapping" fd-only memory, but ensure that memory is never actually present > > from hardware's perspective. > > > > I am most likely missing a lot of context here and possibly venturing > into an infeasible/already shot down direction here. But I would still > like to get this discussed here before we move on. > > I am wondering if it would make sense to implement > restricted_mem/guest_mem file to expose both private and shared memory > regions, inline with Kirill's original proposal now that the file > implementation is controlled by KVM. > > Thinking from userspace perspective: > 1) Userspace creates guest mem files and is able to mmap them but all > accesses to these files result into faults as no memory is allowed to > be mapped into userspace VMM pagetables. > 2) Userspace registers mmaped HVA ranges with KVM with additional > KVM_MEM_PRIVATE flag > 3) Userspace converts memory attributes and this memory conversion > allows userspace to access shared ranges of the file because those are > allowed to be faulted in from guest_mem. Shared to private conversion > unmaps the file ranges from userspace VMM pagetables. > 4) Granularity of userspace pagetable mappings for shared ranges will > have to be dictated by KVM guest_mem file implementation. > > Caveat here is that once private pages are mapped into userspace view. > > Benefits here: > 1) Userspace view remains consistent while still being able to use HVA ranges > 2) It would be possible to use HVA based APIs from userspace to do > things like binding. > 3) Double allocation wouldn't be a concern since hva ranges and gpa > ranges possibly map to the same HPA ranges. > I realize now that VFIO IOMMU mappings are not associated with userspace MMU in any way. So this approach does leave the gap of not being able to handle modifications of IOMMU mappings when HVA mappings are modified in userspace page tables. I guess this could be a good enough reason to give up on this option. > > > > Code is available here if folks want to take a look before any kind of formal > > posting: > > > > https://github.com/sean-jc/linux.git x86/kvm_gmem_solo > > > > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com > > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner > > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
On Wed, May 10, 2023, Vishal Annapurve wrote: > On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <seanjc@google.com> wrote: > > > > ... > > cold. I poked around a bit to see how we could avoid reinventing all of that > > infrastructure for fd-only memory, and the best idea I could come up with is > > basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e. > > allow "mapping" fd-only memory, but ensure that memory is never actually present > > from hardware's perspective. > > > > I am most likely missing a lot of context here and possibly venturing > into an infeasible/already shot down direction here. Both :-) > But I would still like to get this discussed here before we move on. > > I am wondering if it would make sense to implement > restricted_mem/guest_mem file to expose both private and shared memory > regions, inline with Kirill's original proposal now that the file > implementation is controlled by KVM. > > Thinking from userspace perspective: > 1) Userspace creates guest mem files and is able to mmap them but all > accesses to these files result into faults as no memory is allowed to > be mapped into userspace VMM pagetables. Never mapping anything into the userspace page table is infeasible. Technically it's doable, but it'd effectively require all of the work of an fd-based approach (and probably significantly more), _and_ it'd require touching core mm code. VMAs don't provide hva=>pfn information, they're the kernel's way of implementing the abstraction provided to userspace by mmap(), mprotect() etc. Among many other things, a VMA describes properties of what is mapped, e.g. hugetblfs versus anonymous, where memory is mapped (virtual address), how memory is mapped, e.g. RWX protections, etc. But a VMA doesn't track the physical address, that info is all managed through the userspace page tables. To make it possible to allow userspace to mmap() but not access memory (without redoing how the kernel fundamentally manages virtual=>physical mappings), the simplest approach is to install PTEs into userspace page tables, but never mark them Present in hardware, i.e. prevent actually accessing the backing memory. This is is exactly what Kirill's series in link [3] below implemented. Issues that led to us abandoning the "map with special !Present PTEs" approach: - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings is inefficient and inflexible compared to software defined structures, especially for the expected use cases for CoCo guests. - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association, let alone a 1:1 pfn:gfn mapping. - Does not work for memory that isn't backed by 'struct page', e.g. if devices gain support for exposing encrypted memory regions to guests. - Poking into the VMAs to convert memory would be likely be less performant due to using infrastructure that is much "heavier", e.g. would require taking mmap_lock for write. In short, shoehorning this into mmap() requires fighting how the kernel works at pretty much every step, and in the end, adding e.g. fbind() is a lot easier. > 2) Userspace registers mmaped HVA ranges with KVM with additional > KVM_MEM_PRIVATE flag > 3) Userspace converts memory attributes and this memory conversion > allows userspace to access shared ranges of the file because those are > allowed to be faulted in from guest_mem. Shared to private conversion > unmaps the file ranges from userspace VMM pagetables. > 4) Granularity of userspace pagetable mappings for shared ranges will > have to be dictated by KVM guest_mem file implementation. > > Caveat here is that once private pages are mapped into userspace view. > > Benefits here: > 1) Userspace view remains consistent while still being able to use HVA ranges > 2) It would be possible to use HVA based APIs from userspace to do > things like binding. > 3) Double allocation wouldn't be a concern since hva ranges and gpa > ranges possibly map to the same HPA ranges. #3 isn't entirely correct. If a different process (call it "B") maps shared memory, and then the guest converts that memory from shared to private, the backing pages for the previously shared mapping will still be mapped by process B unless userspace also ensures process B also unmaps on conversion. #3 is also a limiter. E.g. if a guest is primarly backed by 1GiB pages, keeping the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared, and possibly even if the guest converts a few MiB of memory. > > Code is available here if folks want to take a look before any kind of formal > > posting: > > > > https://github.com/sean-jc/linux.git x86/kvm_gmem_solo > > > > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com > > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner > > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
On Wed, May 10, 2023 at 2:39 PM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, May 10, 2023, Vishal Annapurve wrote: > > On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > ... > > > cold. I poked around a bit to see how we could avoid reinventing all of that > > > infrastructure for fd-only memory, and the best idea I could come up with is > > > basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e. > > > allow "mapping" fd-only memory, but ensure that memory is never actually present > > > from hardware's perspective. > > > > > > > I am most likely missing a lot of context here and possibly venturing > > into an infeasible/already shot down direction here. > > Both :-) > > > But I would still like to get this discussed here before we move on. > > > > I am wondering if it would make sense to implement > > restricted_mem/guest_mem file to expose both private and shared memory > > regions, inline with Kirill's original proposal now that the file > > implementation is controlled by KVM. > > > > Thinking from userspace perspective: > > 1) Userspace creates guest mem files and is able to mmap them but all > > accesses to these files result into faults as no memory is allowed to > > be mapped into userspace VMM pagetables. > > Never mapping anything into the userspace page table is infeasible. Technically > it's doable, but it'd effectively require all of the work of an fd-based approach > (and probably significantly more), _and_ it'd require touching core mm code. > > VMAs don't provide hva=>pfn information, they're the kernel's way of implementing > the abstraction provided to userspace by mmap(), mprotect() etc. Among many other > things, a VMA describes properties of what is mapped, e.g. hugetblfs versus > anonymous, where memory is mapped (virtual address), how memory is mapped, e.g. > RWX protections, etc. But a VMA doesn't track the physical address, that info > is all managed through the userspace page tables. > > To make it possible to allow userspace to mmap() but not access memory (without > redoing how the kernel fundamentally manages virtual=>physical mappings), the > simplest approach is to install PTEs into userspace page tables, but never mark > them Present in hardware, i.e. prevent actually accessing the backing memory. > This is is exactly what Kirill's series in link [3] below implemented. > Maybe it's simpler to do when mmaped regions are backed with files. I see that shmem has fault handlers for accesses to VMA regions associated with the files, In theory a file implementation can always choose to not allocate physical pages for such faults (similar to F_SEAL_FAULT_AUTOALLOCATE that was discussed earlier). > Issues that led to us abandoning the "map with special !Present PTEs" approach: > > - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings > is inefficient and inflexible compared to software defined structures, especially > for the expected use cases for CoCo guests. > > - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association, > let alone a 1:1 pfn:gfn mapping. Maybe KVM can ensure that each page of the guest_mem file is associated with a single memslot. HVAs when they are registered can be associated with offsets into guest_mem files. > > - Does not work for memory that isn't backed by 'struct page', e.g. if devices > gain support for exposing encrypted memory regions to guests. > > - Poking into the VMAs to convert memory would be likely be less performant due > to using infrastructure that is much "heavier", e.g. would require taking > mmap_lock for write. Converting memory doesn't necessarily need to poke holes into VMA, but rather just unmap pagetables just like what would happen when mmapped files are punched to free the backing file offsets. > > In short, shoehorning this into mmap() requires fighting how the kernel works at > pretty much every step, and in the end, adding e.g. fbind() is a lot easier. > > > 2) Userspace registers mmaped HVA ranges with KVM with additional > > KVM_MEM_PRIVATE flag > > 3) Userspace converts memory attributes and this memory conversion > > allows userspace to access shared ranges of the file because those are > > allowed to be faulted in from guest_mem. Shared to private conversion > > unmaps the file ranges from userspace VMM pagetables. > > 4) Granularity of userspace pagetable mappings for shared ranges will > > have to be dictated by KVM guest_mem file implementation. > > > > Caveat here is that once private pages are mapped into userspace view. > > > > Benefits here: > > 1) Userspace view remains consistent while still being able to use HVA ranges > > 2) It would be possible to use HVA based APIs from userspace to do > > things like binding. > > 3) Double allocation wouldn't be a concern since hva ranges and gpa > > ranges possibly map to the same HPA ranges. > > #3 isn't entirely correct. If a different process (call it "B") maps shared memory, > and then the guest converts that memory from shared to private, the backing pages > for the previously shared mapping will still be mapped by process B unless userspace > also ensures process B also unmaps on conversion. > This should be ideally handled by something like: unmap_mapping_range() > #3 is also a limiter. E.g. if a guest is primarly backed by 1GiB pages, keeping > the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared, > and possibly even if the guest converts a few MiB of memory. This caveat maybe can be lived with as shared ranges most likely will not be backed by 1G pages anyways, possibly causing IO performance to get hit. This possibly needs more discussion about conversion granularity used by guests. > > > > Code is available here if folks want to take a look before any kind of formal > > > posting: > > > > > > https://github.com/sean-jc/linux.git x86/kvm_gmem_solo > > > > > > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com > > > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner > > > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
On Wed, May 10, 2023, Vishal Annapurve wrote: > On Wed, May 10, 2023 at 2:39 PM Sean Christopherson <seanjc@google.com> wrote: > > > But I would still like to get this discussed here before we move on. > > > > > > I am wondering if it would make sense to implement > > > restricted_mem/guest_mem file to expose both private and shared memory > > > regions, inline with Kirill's original proposal now that the file > > > implementation is controlled by KVM. > > > > > > Thinking from userspace perspective: > > > 1) Userspace creates guest mem files and is able to mmap them but all > > > accesses to these files result into faults as no memory is allowed to > > > be mapped into userspace VMM pagetables. > > > > Never mapping anything into the userspace page table is infeasible. Technically > > it's doable, but it'd effectively require all of the work of an fd-based approach > > (and probably significantly more), _and_ it'd require touching core mm code. > > > > VMAs don't provide hva=>pfn information, they're the kernel's way of implementing > > the abstraction provided to userspace by mmap(), mprotect() etc. Among many other > > things, a VMA describes properties of what is mapped, e.g. hugetblfs versus > > anonymous, where memory is mapped (virtual address), how memory is mapped, e.g. > > RWX protections, etc. But a VMA doesn't track the physical address, that info > > is all managed through the userspace page tables. > > > > To make it possible to allow userspace to mmap() but not access memory (without > > redoing how the kernel fundamentally manages virtual=>physical mappings), the > > simplest approach is to install PTEs into userspace page tables, but never mark > > them Present in hardware, i.e. prevent actually accessing the backing memory. > > This is is exactly what Kirill's series in link [3] below implemented. > > > > Maybe it's simpler to do when mmaped regions are backed with files. > > I see that shmem has fault handlers for accesses to VMA regions > associated with the files, In theory a file implementation can always > choose to not allocate physical pages for such faults (similar to > F_SEAL_FAULT_AUTOALLOCATE that was discussed earlier). Ah, you're effectively suggesting a hybrid model where the file is the single source of truth for what's private versus shared, ad KVM gets pfns through direct communication with the backing store via the file descriptor, but userspace can still control things via mmap() and friends. If you're not suggesting a backdoor, i.e. KVM still gets private pfns via hvas, then we're back at Kirill's series, because otherwise there's no easy way for KVM to retrieve the pfn. A form of this was also discussed, though I don't know how much of the discussion happened on-list. KVM actually does something like this for s390's Ultravisor (UV), which is quite a bit like TDX (UV is a trusted intermediary) except that it handles faults much, much more gracefully. Specifically, when the untrusted host attempts to access a secure page, a fault occurs and the kernel responds by telling UV to export the page. The fault is gracefully handled even even for kernel accesses (see do_secure_storage_access()). The kernel does BUG() if the export fails when handling fault from kernel context, but my understanding is that export can fail if and only if there's a fatal error elsewhere, i.e. the UV essentialy _ensures_ success, and goes straight to BUG()/panic() if something goes wrong. On the guest side, accesses to exported (swapped) secure pages generate intercepts and KVM faults in the page. To do so, KVM freezes the page/folio refcount, tells the UV to import the page, and then unfreezes the page/folio. But very crucially, when _anything_ in the untrusted host attempts to access the secure page, the above fault handling for untrusted host accesses kicks in. In other words, the guest can cause thrash, but can't bring down the host. TDX on the other hand silently poisons memory, i.e. doesn't even generate a synchronous fault. Thus the kernel needs to be 100% perfect on preventing _any_ accesses to private memory from the host, and doing that is non-trivial and invasive. SNP does synchronously fault, but the automatically converting in the #PF handler got NAK'd[*] for good reasons, e.g. SNP doesn't guarantee conversion success as the guest can trigger concurrent RMP modifications. So the end result ends up being the same as TDX, host accesses need to be completely prevented. Again, this is all doable, but costly. And IMO, provides very little value. Allowing things like mbind() is nice-to-have at best, as implementing fbind() isn't straightforward and arguably valuable to have irrespective of this discussion, e.g. to allow userspace to say "use this policy regardless of what process maps the file". Using a common memory pool (same physical page is used for both shared and private) is a similar story. There are plenty of existing controls to limit userspace/guest memory usage and to deal with OOM scenarios, so barring egregious host accounting and/or limiting bugs, which would affect _all_ VM types, the worst case scenario is that a VM is terminated because host userspace is buggy. On the slip side, using a common pool brings complexity into the kernel, as backing stores would need to be taught to deny access to a subset of pages in their mappings, and in multiple paths, e.g. faults, read()/write() and similar, page migration, swap, etc. [*] https://lore.kernel.org/linux-mm/8a244d34-2b10-4cf8-894a-1bf12b59cf92@www.fastmail.com > > Issues that led to us abandoning the "map with special !Present PTEs" approach: > > > > - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings > > is inefficient and inflexible compared to software defined structures, especially > > for the expected use cases for CoCo guests. > > > > - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association, > > let alone a 1:1 pfn:gfn mapping. > > Maybe KVM can ensure that each page of the guest_mem file is > associated with a single memslot. This is a hard NAK. Guest physical address space is guaranteed to have holes and/or be discontiguous, for the PCI hole at the top of lower memory. Allowing only a single binding would prevent userspace from backing all (or large chunks) of guest memory with a single file. > HVAs when they are registered can be associated with offsets into guest_mem files. Enforcing 1:1 assocations is doable if KVM inserts a shim/interposer, e.g. essentially implements the exclusivity bits of restrictedmem. But that's adding even more complexity. > > - Does not work for memory that isn't backed by 'struct page', e.g. if devices > > gain support for exposing encrypted memory regions to guests. > > > > - Poking into the VMAs to convert memory would be likely be less performant due > > to using infrastructure that is much "heavier", e.g. would require taking > > mmap_lock for write. > > Converting memory doesn't necessarily need to poke holes into VMA, but > rather just unmap pagetables just like what would happen when mmapped > files are punched to free the backing file offsets. Sorry, bad choice of word on my part. I didn't intend to imply poking holes, in this case I used "poking" to mean "modifying". munmap(), mprotected(), etc all require modifying VMAs, which means taking mmap_lock for write. > > In short, shoehorning this into mmap() requires fighting how the kernel works at > > pretty much every step, and in the end, adding e.g. fbind() is a lot easier. > > > > > 2) Userspace registers mmaped HVA ranges with KVM with additional > > > KVM_MEM_PRIVATE flag > > > 3) Userspace converts memory attributes and this memory conversion > > > allows userspace to access shared ranges of the file because those are > > > allowed to be faulted in from guest_mem. Shared to private conversion > > > unmaps the file ranges from userspace VMM pagetables. > > > 4) Granularity of userspace pagetable mappings for shared ranges will > > > have to be dictated by KVM guest_mem file implementation. > > > > > > Caveat here is that once private pages are mapped into userspace view. > > > > > > Benefits here: > > > 1) Userspace view remains consistent while still being able to use HVA ranges > > > 2) It would be possible to use HVA based APIs from userspace to do > > > things like binding. > > > 3) Double allocation wouldn't be a concern since hva ranges and gpa > > > ranges possibly map to the same HPA ranges. > > > > #3 isn't entirely correct. If a different process (call it "B") maps shared memory, > > and then the guest converts that memory from shared to private, the backing pages > > for the previously shared mapping will still be mapped by process B unless userspace > > also ensures process B also unmaps on conversion. > > > > This should be ideally handled by something like: unmap_mapping_range() That'd work for the hybrid model (fd backdoor with pseudo mmap() support), but not for a generic VMA-based implementation. If the file isn't the single source of truth, then forcing all mappings to go away simply can't work. > > #3 is also a limiter. E.g. if a guest is primarly backed by 1GiB pages, keeping > > the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared, > > and possibly even if the guest converts a few MiB of memory. > > This caveat maybe can be lived with as shared ranges most likely will > not be backed by 1G pages anyways, possibly causing IO performance to > get hit. This possibly needs more discussion about conversion > granularity used by guests. Yes, it's not the end of the world. My point is that separating shared and private memory provides more flexibility. Maybe that flexibility never ends up being super important, but at the same time we shouldn't willingly paint ourselves into a corner.
On Fri, Apr 21, 2023 at 06:33:26PM -0700, Sean Christopherson wrote: > > Code is available here if folks want to take a look before any kind of formal > posting: > > https://github.com/sean-jc/linux.git x86/kvm_gmem_solo Hi Sean, I've been working on getting the SNP patches ported to this but I'm having some trouble working out a reasonable scheme for how to work the RMPUPDATE hooks into the proposed design. One of the main things is kvm_gmem_punch_hole(): this is can free pages back to the host whenever userspace feels like it. Pages that are still marked private in the RMP table will blow up the host if they aren't returned to the normal state before handing them back to the kernel. So I'm trying to add a hook, orchestrated by kvm_arch_gmem_invalidate(), to handle that, e.g.: static long kvm_gmem_punch_hole(struct file *file, int mode, loff_t offset, loff_t len) { struct kvm_gmem *gmem = file->private_data; pgoff_t start = offset >> PAGE_SHIFT; pgoff_t end = (offset + len) >> PAGE_SHIFT; struct kvm *kvm = gmem->kvm; /* * Bindings must stable across invalidation to ensure the start+end * are balanced. */ filemap_invalidate_lock(file->f_mapping); kvm_gmem_invalidate_begin(kvm, gmem, start, end); /* Handle arch-specific cleanups before releasing pages */ kvm_arch_gmem_invalidate(kvm, gmem, start, end); truncate_inode_pages_range(file->f_mapping, offset, offset + len); kvm_gmem_invalidate_end(kvm, gmem, start, end); filemap_invalidate_unlock(file->f_mapping); return 0; } But there's another hook, kvm_arch_gmem_set_mem_attributes(), needed to put the page in its intended state in the RMP table prior to mapping it into the guest's NPT. Currently I'm calling that hook via kvm_vm_ioctl_set_mem_attributes(), just after kvm->mem_attr_array is updated based on the ioctl. The reasoning there is that KVM MMU can then rely on the existing mmu_invalidate_seq logic to ensure both the state in the mem_attr_array and the RMP table are in sync and up-to-date once MMU lock is acquired and MMU is ready to map it, or retry #NPF otherwise. But for kvm_gmem_punch_hole(), kvm_vm_ioctl_set_mem_attributes() can potentially result in something like the following sequence if I implement things as above: CPU0: kvm_gmem_punch_hole(): kvm_gmem_invalidate_begin() kvm_arch_gmem_invalidate() // set pages to default/shared state in RMP table before free'ing CPU1: kvm_vm_ioctl_set_mem_attributes(): kvm_arch_gmem_set_mem_attributes() // maliciously set pages to private in RMP table CPU0: truncate_inode_pages_range() // HOST BLOWS UP TOUCHING PRIVATE PAGES kvm_arch_gmem_invalidate_end() One quick and lazy solution is to rely on the fact that kvm_vm_ioctl_set_mem_attributes() holds the kvm->slots_lock throughout the entire begin()/end() portion of the invalidation sequence, and to similarly hold the kvm->slots_lock throughout the begin()/end() sequence in kvm_gmem_punch_hole() to prevent any interleaving. But I'd imagine overloading kvm->slots_lock is not the proper approach. But would introducing a similar mutex to keep these operations grouped/atomic be a reasonable approach to you, or should we be doing something else entirely here? Keep in mind that RMP updates can't be done while holding KVM->mmu_lock spinlock, because we also need to unmap pages from the directmap, which can lead to scheduling-while-atomic BUG()s[1], so that's another constraint we need to work around. Thanks! -Mike [1] https://lore.kernel.org/linux-coco/20221214194056.161492-7-michael.roth@amd.com/T/#m45a1af063aa5ac0b9314d6a7d46eecb1253bba7a > > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
On Thu, May 11, 2023, Michael Roth wrote: > On Fri, Apr 21, 2023 at 06:33:26PM -0700, Sean Christopherson wrote: > > > > Code is available here if folks want to take a look before any kind of formal > > posting: > > > > https://github.com/sean-jc/linux.git x86/kvm_gmem_solo > > Hi Sean, > > I've been working on getting the SNP patches ported to this but I'm having > some trouble working out a reasonable scheme for how to work the > RMPUPDATE hooks into the proposed design. > > One of the main things is kvm_gmem_punch_hole(): this is can free pages > back to the host whenever userspace feels like it. Pages that are still > marked private in the RMP table will blow up the host if they aren't returned > to the normal state before handing them back to the kernel. So I'm trying to > add a hook, orchestrated by kvm_arch_gmem_invalidate(), to handle that, > e.g.: > > static long kvm_gmem_punch_hole(struct file *file, int mode, loff_t offset, > loff_t len) > { > struct kvm_gmem *gmem = file->private_data; > pgoff_t start = offset >> PAGE_SHIFT; > pgoff_t end = (offset + len) >> PAGE_SHIFT; > struct kvm *kvm = gmem->kvm; > > /* > * Bindings must stable across invalidation to ensure the start+end > * are balanced. > */ > filemap_invalidate_lock(file->f_mapping); > kvm_gmem_invalidate_begin(kvm, gmem, start, end); > > /* Handle arch-specific cleanups before releasing pages */ > kvm_arch_gmem_invalidate(kvm, gmem, start, end); > truncate_inode_pages_range(file->f_mapping, offset, offset + len); > > kvm_gmem_invalidate_end(kvm, gmem, start, end); > filemap_invalidate_unlock(file->f_mapping); > > return 0; > } > > But there's another hook, kvm_arch_gmem_set_mem_attributes(), needed to put > the page in its intended state in the RMP table prior to mapping it into the > guest's NPT. IMO, this approach is wrong. kvm->mem_attr_array is the source of truth for whether userspace wants _guest_ physical pages mapped private vs. shared, but the attributes array has zero insight into the _host_ physical pages. I.e. SNP shouldn't hook kvm_mem_attrs_changed(), because operating on the RMP from that code is fundamentally wrong. A good analogy is moving a memslot (ignoring that AFAIK no VMM actually moves memslots, but it's a good analogy for KVM internals). KVM needs to zap all mappings for the old memslot gfn, but KVM does not create mappings for the new memslot gfn. Same for changing attributes; unmap, but never map. As for the unmapping side of things, kvm_unmap_gfn_range() will unmap all relevant NPT entries, and the elevated mmu_invalidate_in_progress will prevent KVM from establishing a new NPT mapping. And mmu_invalidate_in_progress will reach '0' only after both truncation _and_ kvm_vm_ioctl_set_mem_attributes() complete, i.e. KVM can create new mappings only when both kvm->mem_attr_array and any relevant guest_mem bindings have reached steady state. That leaves the question of when/where to do RMP updates. Off the cuff, I think RMP updates (and I _think_ also TDX page conversions) should _always_ be done in the context of either (a) file truncation (make host owned due, a.k.a. TDX reclaim) or (b) allocating a new page/folio in guest_mem, a.k.a. kvm_gmem_get_folio(). Under the hood, even though the gfn is the same, the backing pfn is different, i.e. installing a shared mapping should _never_ need to touch the RMP because pages common from the normal (non-guest_mem) pool must already be host owned. > Currently I'm calling that hook via kvm_vm_ioctl_set_mem_attributes(), just > after kvm->mem_attr_array is updated based on the ioctl. The reasoning there > is that KVM MMU can then rely on the existing mmu_invalidate_seq logic to > ensure both the state in the mem_attr_array and the RMP table are in sync and > up-to-date once MMU lock is acquired and MMU is ready to map it, or retry > #NPF otherwise. > > But for kvm_gmem_punch_hole(), kvm_vm_ioctl_set_mem_attributes() can potentially > result in something like the following sequence if I implement things as above: > > CPU0: kvm_gmem_punch_hole(): > kvm_gmem_invalidate_begin() > kvm_arch_gmem_invalidate() // set pages to default/shared state in RMP table before free'ing > CPU1: kvm_vm_ioctl_set_mem_attributes(): > kvm_arch_gmem_set_mem_attributes() // maliciously set pages to private in RMP table > CPU0: truncate_inode_pages_range() // HOST BLOWS UP TOUCHING PRIVATE PAGES > kvm_arch_gmem_invalidate_end() > > One quick and lazy solution is to rely on the fact that > kvm_vm_ioctl_set_mem_attributes() holds the kvm->slots_lock throughout the > entire begin()/end() portion of the invalidation sequence, and to similarly > hold the kvm->slots_lock throughout the begin()/end() sequence in > kvm_gmem_punch_hole() to prevent any interleaving. > > But I'd imagine overloading kvm->slots_lock is not the proper approach. But > would introducing a similar mutex to keep these operations grouped/atomic be > a reasonable approach to you, or should we be doing something else entirely > here? > > Keep in mind that RMP updates can't be done while holding KVM->mmu_lock > spinlock, because we also need to unmap pages from the directmap, which can > lead to scheduling-while-atomic BUG()s[1], so that's another constraint we > need to work around. > > Thanks! > > -Mike > > [1] https://lore.kernel.org/linux-coco/20221214194056.161492-7-michael.roth@amd.com/T/#m45a1af063aa5ac0b9314d6a7d46eecb1253bba7a > > > > > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com > > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner > > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
On Fri, May 05, 2023, Sean Christopherson wrote: > On Fri, May 05, 2023, Ackerley Tng wrote: > > One issue I’ve found so far is that the pointer to kvm (gmem->kvm) is > > not cleaned up, and hence it is possible to crash the host kernel in the > > following way > > > > 1. Create a KVM VM > > 2. Create a guest mem fd on that VM > > 3. Create a memslot with the guest mem fd (hence binding the fd to the > > VM) > > 4. Close/destroy the KVM VM > > 5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm > > when it tries to do invalidation. > > > > I then tried to clean up the gmem->kvm pointer during unbinding when the > > KVM VM is destroyed. > > > > That works, but then I realized there’s a simpler way to use the pointer > > after freeing: > > > > 1. Create a KVM VM > > 2. Create a guest mem fd on that VM > > 3. Close/destroy the KVM VM > > 4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm > > when it tries to do invalidation. > > > > Perhaps binding should mean setting the gmem->kvm pointer in addition to > > gmem->bindings. This makes binding and unbinding symmetric and avoids > > the use-after-frees described above. > > Hrm, that would work, though it's a bit convoluted, e.g. would require detecting > when the last binding is being removed. A similar (also ugly) solution would be > to nullify gmem->kvm when KVM dies. > > I don't love either approach idea because it means a file created in the context > of a VM can outlive the VM itself, and then userspace ends up with a file descriptor > that it can't do anything with except close(). I doubt that matters in practice > though, e.g. when the VM dies, all memory can be freed so that the file ends up > being little more than a shell. And if we go that route, there's no need to grab > a reference to the file during bind, KVM can just grab a longterm reference when > the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm). > > Blech, another wart is that I believe gmem would need to do __module_get() during > file creation to prevent kvm.ko from being unloaded after the last VM dies. Ah, > but that'd also be true if we went with a system-scoped KVM ioctl(), so I suppose > it's not _that_ ugly. > > Exchanging references (at binding or at creation) doesn't work, because that > creates a circular dependency, i.e. gmem and KVM would pin each other. > > A "proper" refcounting approach, where the file pins KVM and not vice versa, gets > nasty because of how KVM's memslots work. The least awful approach I can think of > would be to delete the associated memslot(s) when the file is released, possibly > via deferred work to avoid deadlock issues. Not the prettiest thing ever and in > some ways that'd yield an even worse ABI. Circling back to this. Pending testing, the "proper" refcounting approach seems to be the way to go. KVM's existing memslots actually work this way, e.g. if userspace does munmap() on an active memslot, KVM zaps any PTEs but the memslot stays alive. A similar approach can be taken here, the only wrinkle is that the association between gmem and the memslot is stronger than between VMAs and memslots, specifically KVM binds the file and not simply the file descriptor. This is necessary because not binding to an exact file would let userspace install a different file at the file descriptor. That's solvable without having to delete memslots though, e.g. by protecting the file pointer in the memslot with RCU and directly bumping the refcount in the two places where KVM needs to get at gmem (the file) via the memslot (unbind and faulting in a page). E.g. static struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot) { struct file *file; rcu_read_lock(); file = rcu_dereference(slot->gmem.file); if (file && !get_file_rcu(file)) file = NULL; rcu_read_unlock(); return file; } The gotcha is that ->release could race with memslot deletion, as kvm_gmem_unbind() won't be able to differentiate between "file was deleted" and "file is currently being deleted". That's easy enough to deal with though, kvm_gmem_release() can take slots_lock to prevent the memslot from going away when nullifying and invalidating ranges for the memslot. > Side topic, there's a second bug (and probably more lurking): kvm_swap_active_memslots()'s > call to synchronize_srcu_expedited() is done _before_ the call to kvm_gmem_unbind(), > i.e. doesn't wait for readers in kvm_gmem_invalidate_begin() to go away. The easy > solution for that one is to add another synchronize_srcu_expedited() after unbinding. There's a bug here, but not the one I pointed out. Acquiring kvm->srcu doesn't provide any protection, the binding already has a pointer to the memslot, i.e. isn't doing an SRCU-protected lookup in the memslots. The actual protection is provided by the filemap invalidate lock, which prevents unbinding a memslot until all invalidations complete, i.e. acquiring kvm->srcu in the punch hole path is completely unnecessary.
On Thu, May 11, 2023 at 1:22 PM Sean Christopherson <seanjc@google.com> wrote: > ... > Ah, you're effectively suggesting a hybrid model where the file is the single > source of truth for what's private versus shared, ad KVM gets pfns through > direct communication with the backing store via the file descriptor, but userspace > can still control things via mmap() and friends. > > If you're not suggesting a backdoor, i.e. KVM still gets private pfns via hvas, > then we're back at Kirill's series, because otherwise there's no easy way for KVM > to retrieve the pfn. > Yeah, I was hinting towards using the backdoor, where KVM still gets private pfns via HVAs. > A form of this was also discussed, though I don't know how much of the discussion > happened on-list. > > KVM actually does something like this for s390's Ultravisor (UV), which is quite > a bit like TDX (UV is a trusted intermediary) except that it handles faults much, > much more gracefully. Specifically, when the untrusted host attempts to access a > secure page, a fault occurs and the kernel responds by telling UV to export the > page. The fault is gracefully handled even even for kernel accesses > (see do_secure_storage_access()). The kernel does BUG() if the export fails when > handling fault from kernel context, but my understanding is that export can fail > if and only if there's a fatal error elsewhere, i.e. the UV essentialy _ensures_ > success, and goes straight to BUG()/panic() if something goes wrong. > > On the guest side, accesses to exported (swapped) secure pages generate intercepts > and KVM faults in the page. To do so, KVM freezes the page/folio refcount, tells > the UV to import the page, and then unfreezes the page/folio. But very crucially, > when _anything_ in the untrusted host attempts to access the secure page, the > above fault handling for untrusted host accesses kicks in. In other words, the > guest can cause thrash, but can't bring down the host. > Yeah, this is very similar to what I was trying to propose. Except in this case, the backing store i.e. guest_mem will have to let the fault be unhandled for untrusted host accesses to private ranges of guest_mem file. > TDX on the other hand silently poisons memory, i.e. doesn't even generate a > synchronous fault. Thus the kernel needs to be 100% perfect on preventing _any_ > accesses to private memory from the host, and doing that is non-trivial and > invasive. > > SNP does synchronously fault, but the automatically converting in the #PF handler > got NAK'd[*] for good reasons, e.g. SNP doesn't guarantee conversion success as the > guest can trigger concurrent RMP modifications. So the end result ends up being > the same as TDX, host accesses need to be completely prevented. > > Again, this is all doable, but costly. And IMO, provides very little value. With this hybrid approach with the backdoor access to pfns from KVM, do we see a scenario where host can bypass the guest_mem restrictions and still be able to access the private ranges using HVA ranges? One possibility is that these pages are mapped in the IOMMU (when they are shared) and then get converted to private without getting unmapped from IOMMU. Maybe KVM can disallow converting the ranges which are pinned for DMA (not sure if there is a way to do that). Few additional benefits here: 1) Possibly handle the pkvm usecase in this series without the need for additional modifications. 2) Handling UPM for normal VMs possibly could get simpler as this hybrid approach can allow preserving the contents across conversions. > > Allowing things like mbind() is nice-to-have at best, as implementing fbind() > isn't straightforward and arguably valuable to have irrespective of this > discussion, e.g. to allow userspace to say "use this policy regardless of what > process maps the file". > Agreed, having mbind supported is not a significant gain given the cost here. > Using a common memory pool (same physical page is used for both shared and private) > is a similar story. There are plenty of existing controls to limit userspace/guest > memory usage and to deal with OOM scenarios, so barring egregious host accounting > and/or limiting bugs, which would affect _all_ VM types, the worst case scenario > is that a VM is terminated because host userspace is buggy. On the slip side, using > a common pool brings complexity into the kernel, as backing stores would need to > be taught to deny access to a subset of pages in their mappings, and in multiple > paths, e.g. faults, read()/write() and similar, page migration, swap, etc. In this case the backing store that needs to be modified would just be guest_mem though. > > [*] https://lore.kernel.org/linux-mm/8a244d34-2b10-4cf8-894a-1bf12b59cf92@www.fastmail.com > > > > Issues that led to us abandoning the "map with special !Present PTEs" approach: > > > > > > - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings > > > is inefficient and inflexible compared to software defined structures, especially > > > for the expected use cases for CoCo guests. > > > > > > - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association, > > > let alone a 1:1 pfn:gfn mapping. > > > > Maybe KVM can ensure that each page of the guest_mem file is > > associated with a single memslot. > > This is a hard NAK. Guest physical address space is guaranteed to have holes > and/or be discontiguous, for the PCI hole at the top of lower memory. Allowing > only a single binding would prevent userspace from backing all (or large chunks) > of guest memory with a single file. > Poor choice of words from my side. I meant to suggest that KVM can ensure that ANY page of the guest_mem file is associated with at max one memslot. > ... > That'd work for the hybrid model (fd backdoor with pseudo mmap() support), but > not for a generic VMA-based implementation. If the file isn't the single source > of truth, then forcing all mappings to go away simply can't work. > > > > #3 is also a limiter. E.g. if a guest is primarly backed by 1GiB pages, keeping > > > the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared, > > > and possibly even if the guest converts a few MiB of memory. > > > > This caveat maybe can be lived with as shared ranges most likely will > > not be backed by 1G pages anyways, possibly causing IO performance to > > get hit. This possibly needs more discussion about conversion > > granularity used by guests. > > Yes, it's not the end of the world. My point is that separating shared and private > memory provides more flexibility. Maybe that flexibility never ends up being > super important, but at the same time we shouldn't willingly paint ourselves into > a corner. There are some performance implications here with the split approach. This flexibility is actually coming with the cost of managing double allocation effectively. As the granularity of mappings increases for the shared memory, it gets difficult to cap the amount of double allocation. So effectively it comes down to always using smaller granularity for shared memory and also the private memory for converted ranges. In general the performance requirements would always try to push for higher mapping granularities depending on the scale of usage. In general private memory (and also the shared memory on respective conversions) will always need to be hole punched to ensure that the double allocation won't happen. And so, even if this is something for the future, using hugetlbfs pages for backing private memory with the split model effectively makes it impossible to cap the double allocation. I am not sure if the 1G pages can be handled better with the hybrid model but maybe it's worth checking. Split shared/private mem approach also increases the uncertainties around memory management in general where the same amount of memory which was available earlier is first freed to the system and then allocated back from the system. .e.g. Even if hugepages were around when private memory was initially allocated, further allocations keep increasing the possibilities of not being able to use a huge page to back the memory even if the whole huge page is private/shared.
On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote: > On Thu, May 11, 2023, Michael Roth wrote: > > On Fri, Apr 21, 2023 at 06:33:26PM -0700, Sean Christopherson wrote: > > > > > > Code is available here if folks want to take a look before any kind of formal > > > posting: > > > > > > https://github.com/sean-jc/linux.git x86/kvm_gmem_solo > > > > Hi Sean, > > > > I've been working on getting the SNP patches ported to this but I'm having > > some trouble working out a reasonable scheme for how to work the > > RMPUPDATE hooks into the proposed design. > > > > One of the main things is kvm_gmem_punch_hole(): this is can free pages > > back to the host whenever userspace feels like it. Pages that are still > > marked private in the RMP table will blow up the host if they aren't returned > > to the normal state before handing them back to the kernel. So I'm trying to > > add a hook, orchestrated by kvm_arch_gmem_invalidate(), to handle that, > > e.g.: > > > > static long kvm_gmem_punch_hole(struct file *file, int mode, loff_t offset, > > loff_t len) > > { > > struct kvm_gmem *gmem = file->private_data; > > pgoff_t start = offset >> PAGE_SHIFT; > > pgoff_t end = (offset + len) >> PAGE_SHIFT; > > struct kvm *kvm = gmem->kvm; > > > > /* > > * Bindings must stable across invalidation to ensure the start+end > > * are balanced. > > */ > > filemap_invalidate_lock(file->f_mapping); > > kvm_gmem_invalidate_begin(kvm, gmem, start, end); > > > > /* Handle arch-specific cleanups before releasing pages */ > > kvm_arch_gmem_invalidate(kvm, gmem, start, end); > > truncate_inode_pages_range(file->f_mapping, offset, offset + len); > > > > kvm_gmem_invalidate_end(kvm, gmem, start, end); > > filemap_invalidate_unlock(file->f_mapping); > > > > return 0; > > } > > > > But there's another hook, kvm_arch_gmem_set_mem_attributes(), needed to put > > the page in its intended state in the RMP table prior to mapping it into the > > guest's NPT. > > IMO, this approach is wrong. kvm->mem_attr_array is the source of truth for whether > userspace wants _guest_ physical pages mapped private vs. shared, but the attributes > array has zero insight into the _host_ physical pages. I.e. SNP shouldn't hook > kvm_mem_attrs_changed(), because operating on the RMP from that code is fundamentally > wrong. > > A good analogy is moving a memslot (ignoring that AFAIK no VMM actually moves > memslots, but it's a good analogy for KVM internals). KVM needs to zap all mappings > for the old memslot gfn, but KVM does not create mappings for the new memslot gfn. > Same for changing attributes; unmap, but never map. > > As for the unmapping side of things, kvm_unmap_gfn_range() will unmap all relevant > NPT entries, and the elevated mmu_invalidate_in_progress will prevent KVM from > establishing a new NPT mapping. And mmu_invalidate_in_progress will reach '0' only > after both truncation _and_ kvm_vm_ioctl_set_mem_attributes() complete, i.e. KVM > can create new mappings only when both kvm->mem_attr_array and any relevant > guest_mem bindings have reached steady state. > > That leaves the question of when/where to do RMP updates. Off the cuff, I think > RMP updates (and I _think_ also TDX page conversions) should _always_ be done in > the context of either (a) file truncation (make host owned due, a.k.a. TDX reclaim) > or (b) allocating a new page/folio in guest_mem, a.k.a. kvm_gmem_get_folio(). > Under the hood, even though the gfn is the same, the backing pfn is different, i.e. > installing a shared mapping should _never_ need to touch the RMP because pages > common from the normal (non-guest_mem) pool must already be host owned. Hi Sean, thanks for the suggestions. I reworked things based on this approach and things seems to work out pretty nicely for SNP. I needed to add the hook to kvm_gmem_get_pfn() instead of kvm_gmem_get_folio() because SNP needs to know the GFN in order to mark the page as private in the RMP table, but otherwise I think things are the same as what you had in mind. One downside to this approach is since the hook always gets called during kvm_gmem_get_pfn(), we need to do an extra RMP lookup to determine whether or not that page has already been set to private state, vs. being able to assume it's already been put in the expected state, but it's only a memory access so not a huge overhead. Not sure if that would be a concern of not on the TDX side though. I put together a tree with some fixups that are needed for against the kvm_gmem_solo base tree, and a set of hooks to handle invalidations, preparing the initial private state as suggested above, and a platform-configurable mask that the x86 MMU code can use for determining whether a fault is for private vs. shared pages. KVM: x86: Determine shared/private faults using a configurable mask ^ for TDX we could trivially add an inverted analogue of the mask/logic KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault KVM: x86: Add platform hooks for private memory invalidations KVM: x86: Add platform hook for initializing private memory *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations *fixup (kvm_gmem_solo): KVM: selftests: update kvm_create_guest_memfd struct usage https://github.com/mdroth/linux/commits/kvm_gmem_solo_x86 I'm hoping these are similarly usable for TDX, but could use some input from TDX folks on that aspect. > > > > Keep in mind that RMP updates can't be done while holding KVM->mmu_lock > > spinlock, because we also need to unmap pages from the directmap, which can > > lead to scheduling-while-atomic BUG()s[1], so that's another constraint we > > need to work around. This concern also ends up going away since GFP_RECLAIM also has similar issues when called under kvm->mmu_lock, so having the hook in kvm_gmem_get_pfn() sort of guarantees we wouldn't hit issues with this. -Mike > > > > Thanks! > > > > -Mike > > > > [1] https://lore.kernel.org/linux-coco/20221214194056.161492-7-michael.roth@amd.com/T/#m45a1af063aa5ac0b9314d6a7d46eecb1253bba7a > > > > > > > > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com > > > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner > > > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
On Mon, May 22, 2023, Michael Roth wrote: > On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote: > > On Thu, May 11, 2023, Michael Roth wrote: > I put together a tree with some fixups that are needed for against the > kvm_gmem_solo base tree, and a set of hooks to handle invalidations, > preparing the initial private state as suggested above, and a > platform-configurable mask that the x86 MMU code can use for determining > whether a fault is for private vs. shared pages. > > KVM: x86: Determine shared/private faults using a configurable mask > ^ for TDX we could trivially add an inverted analogue of the mask/logic > KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault > KVM: x86: Add platform hooks for private memory invalidations Hrm, I'd prefer to avoid adding another hook for this case, arch code already has a "hook" in the form of kvm_unmap_gfn_range(). We'd probably just need a kvm_gfn_range.is_private flag to communicate to arch/vendor code that the memory being zapped is private. That'd leave a gap for the unbind() case because kvm_unmap_gfn_range() is invoked if and only if there's an overlapping memslot. I'll chew on that a bit to see if there's a way to cleanly handle that case without another hook. I think it's worth mapping out exactly what we want unbind() to look like anyways, e.g. right now the code subtly relies on private memslots being immutable. > KVM: x86: Add platform hook for initializing private memory This should also be unnecessary. The call to kvm_gmem_get_pfn() is from arch code, KVM just needs to ensure the RMP is converted before acquiring mmu_lock, e.g. KVM has all the necessary info in kvm_tdp_mmu_page_fault(). The only reason to add another arch hook would be if we wanted to converted the RMP when _allocating_, e.g. to preconvert in response to fallocate() instead of waiting until #NPF. But I think I would rather add a generic ioctl() to allow userspace to effectively prefault guest memory, e.g. to setup the RMP before running a vCPU. Such an ioctl() would potentially be useful in other scenarios, e.g. on the dest during live migration to reduce jitter. > *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations There was another bug in this path. The math for handling a non-zero offsets into the file was wrong. The code now looks like: xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) { struct kvm_gfn_range gfn_range = { .start = slot->base_gfn + start - slot->gmem.index, .end = slot->base_gfn + min(end - slot->gmem.index, slot->npages), .slot = slot, .pte = __pte(0), .may_block = true, }; if (WARN_ON_ONCE(start < slot->gmem.index || end > slot->gmem.index + slot->npages)) continue; kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end); flush |= kvm_unmap_gfn_range(kvm, &gfn_range); }
On Mon, May 22, 2023 at 10:09:40AM -0700, Sean Christopherson wrote: > On Mon, May 22, 2023, Michael Roth wrote: > > On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote: > > > On Thu, May 11, 2023, Michael Roth wrote: > > I put together a tree with some fixups that are needed for against the > > kvm_gmem_solo base tree, and a set of hooks to handle invalidations, > > preparing the initial private state as suggested above, and a > > platform-configurable mask that the x86 MMU code can use for determining > > whether a fault is for private vs. shared pages. > > > > KVM: x86: Determine shared/private faults using a configurable mask > > ^ for TDX we could trivially add an inverted analogue of the mask/logic > > KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault > > KVM: x86: Add platform hooks for private memory invalidations > > Hrm, I'd prefer to avoid adding another hook for this case, arch code already has > a "hook" in the form of kvm_unmap_gfn_range(). We'd probably just need a > kvm_gfn_range.is_private flag to communicate to arch/vendor code that the memory > being zapped is private. kvm_unmap_gfn_range() does however get called with kvm->mmu_lock held so it might be tricky to tie RMP updates into that path. > > That'd leave a gap for the unbind() case because kvm_unmap_gfn_range() is invoked > if and only if there's an overlapping memslot. I'll chew on that a bit to see if > there's a way to cleanly handle that case without another hook. I think it's worth > mapping out exactly what we want unbind() to look like anyways, e.g. right now the > code subtly relies on private memslots being immutable. I thought the direction you sort of driving at was to completely decouple RMP updates for physical pages from the KVM MMU map/unmap paths since the life-cycles of those backing pages and associated RMP state are somewhat separate from the state of the GFNs and kvm->mem_attr_array. It seems to make sense when dealing with things like this unbind() case. There's also cases like userspaces that opt to not discard memory after conversions because they highly favor performance over memory usage. In those cases it would make sense to defer marking the pages as shared in the RMP until the FALLOC_FL_PUNCH_HOLE, rather than triggering it via KVM MMU invalidation path after a conversion. > > > KVM: x86: Add platform hook for initializing private memory > > This should also be unnecessary. The call to kvm_gmem_get_pfn() is from arch > code, KVM just needs to ensure the RMP is converted before acquiring mmu_lock, > e.g. KVM has all the necessary info in kvm_tdp_mmu_page_fault(). I think that approach would work fine. The way I was thinking of things is that KVM MMU would necessarily call kvm_gmem_get_pfn() to grab the page before mapping it into the guest, so moving it out into an explicit call should work just as well. That would also drop the need for the __kvm_gmem_get_pfn() stuff I needed to add for the initial case where we need to access the PFN prior to making it private. > > The only reason to add another arch hook would be if we wanted to converted the > RMP when _allocating_, e.g. to preconvert in response to fallocate() instead of > waiting until #NPF. But I think I would rather add a generic ioctl() to allow > userspace to effectively prefault guest memory, e.g. to setup the RMP before > running a vCPU. Such an ioctl() would potentially be useful in other scenarios, > e.g. on the dest during live migration to reduce jitter. Agreed, deferring the RMPUPDATE until it's actually needed would give us more flexibility on optimizing for things like lazy-acceptance. For less-common scenarios like preallocation it makes sense to make that an opt-in sort of thing for userspace to configure explicitly. > > > *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations > > There was another bug in this path. The math for handling a non-zero offsets into > the file was wrong. The code now looks like: > > xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) { > struct kvm_gfn_range gfn_range = { > .start = slot->base_gfn + start - slot->gmem.index, Sorry if I'm missing something here, but isn't there a risk that: start - slot->gmem.index would be less than zero? E.g. starting GFN was 0, but current slot is bound at some non-zero offset in the same gmem instance. I guess the warning below shouldn't caught that, but it seems like a real scenario. Since 'index' corresponds to the gmem offset of the current slot, is there any reason not to do something like this?: .start = slot->base_gfn + index - slot->gmem.index, But then, if that's the case, wouldn't index == slot->gmem.index? Suggesting we case just simplify to this?: .start = slot->base_gfn, -Mike > .end = slot->base_gfn + min(end - slot->gmem.index, slot->npages), > .slot = slot, > .pte = __pte(0), > .may_block = true, > }; > > if (WARN_ON_ONCE(start < slot->gmem.index || > end > slot->gmem.index + slot->npages)) > continue; > > kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end); > > flush |= kvm_unmap_gfn_range(kvm, &gfn_range); > }
On Mon, May 22, 2023, Michael Roth wrote: > On Mon, May 22, 2023 at 10:09:40AM -0700, Sean Christopherson wrote: > > On Mon, May 22, 2023, Michael Roth wrote: > > > On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote: > > > > On Thu, May 11, 2023, Michael Roth wrote: > > > I put together a tree with some fixups that are needed for against the > > > kvm_gmem_solo base tree, and a set of hooks to handle invalidations, > > > preparing the initial private state as suggested above, and a > > > platform-configurable mask that the x86 MMU code can use for determining > > > whether a fault is for private vs. shared pages. > > > > > > KVM: x86: Determine shared/private faults using a configurable mask > > > ^ for TDX we could trivially add an inverted analogue of the mask/logic > > > KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault > > > KVM: x86: Add platform hooks for private memory invalidations > > > > Hrm, I'd prefer to avoid adding another hook for this case, arch code already has > > a "hook" in the form of kvm_unmap_gfn_range(). We'd probably just need a > > kvm_gfn_range.is_private flag to communicate to arch/vendor code that the memory > > being zapped is private. > > kvm_unmap_gfn_range() does however get called with kvm->mmu_lock held so > it might be tricky to tie RMP updates into that path. Gah, I caught the mmu_lock issue before the end of my email, but forgot to go back and rethink the first half. > > That'd leave a gap for the unbind() case because kvm_unmap_gfn_range() is invoked > > if and only if there's an overlapping memslot. I'll chew on that a bit to see if > > there's a way to cleanly handle that case without another hook. I think it's worth > > mapping out exactly what we want unbind() to look like anyways, e.g. right now the > > code subtly relies on private memslots being immutable. m > I thought the direction you sort of driving at was to completely decouple > RMP updates for physical pages from the KVM MMU map/unmap paths since the > life-cycles of those backing pages and associated RMP state are somewhat > separate from the state of the GFNs and kvm->mem_attr_array. It seems to > make sense when dealing with things like this unbind() case. > > There's also cases like userspaces that opt to not discard memory after > conversions because they highly favor performance over memory usage. In > those cases it would make sense to defer marking the pages as shared in > the RMP until the FALLOC_FL_PUNCH_HOLE, rather than triggering it via > KVM MMU invalidation path after a conversion. Hmm, right. I got overzealous in my desire to avoid new hooks. > > > KVM: x86: Add platform hook for initializing private memory > > > > This should also be unnecessary. The call to kvm_gmem_get_pfn() is from arch > > code, KVM just needs to ensure the RMP is converted before acquiring mmu_lock, > > e.g. KVM has all the necessary info in kvm_tdp_mmu_page_fault(). > > I think that approach would work fine. The way I was thinking of things > is that KVM MMU would necessarily call kvm_gmem_get_pfn() to grab the > page before mapping it into the guest, so moving it out into an explicit > call should work just as well. That would also drop the need for the > __kvm_gmem_get_pfn() stuff I needed to add for the initial case where we > need to access the PFN prior to making it private. > > > > > The only reason to add another arch hook would be if we wanted to converted the > > RMP when _allocating_, e.g. to preconvert in response to fallocate() instead of > > waiting until #NPF. But I think I would rather add a generic ioctl() to allow > > userspace to effectively prefault guest memory, e.g. to setup the RMP before > > running a vCPU. Such an ioctl() would potentially be useful in other scenarios, > > e.g. on the dest during live migration to reduce jitter. > > Agreed, deferring the RMPUPDATE until it's actually needed would give us > more flexibility on optimizing for things like lazy-acceptance. > > For less-common scenarios like preallocation it makes sense to make that > an opt-in sort of thing for userspace to configure explicitly. > > > > > > *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations > > > > There was another bug in this path. The math for handling a non-zero offsets into > > the file was wrong. The code now looks like: > > > > xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) { > > struct kvm_gfn_range gfn_range = { > > .start = slot->base_gfn + start - slot->gmem.index, > > Sorry if I'm missing something here, but isn't there a risk that: > > start - slot->gmem.index > > would be less than zero? E.g. starting GFN was 0, but current slot is bound > at some non-zero offset in the same gmem instance. I guess the warning below > shouldn't caught that, but it seems like a real scenario. Heh, only if there's a testcase for it. Assuming start >= the slot offset does seem broken, e.g. if the range-to-invalidate overlaps multiple slots, later slots will have index==slot->gmem.index > start. > Since 'index' corresponds to the gmem offset of the current slot, is there any > reason not to do something like this?: > > .start = slot->base_gfn + index - slot->gmem.index, > > But then, if that's the case, wouldn't index == slot->gmem.index? Suggesting > we case just simplify to this?: > > .start = slot->base_gfn, No, e.g. if start is partway through a memslot, there's no need to invalidate the entire memslot. I'll stare at this tomorrow when my brain is hopefully a bit more functional, I suspect there is a min() and/or max() needed somewhere.
On Tue, Jun 06, 2023, Ackerley Tng wrote: > > I've ported selftests from Chao and I [1] while working on hugetlb support > for guest_mem [2]. > > In the process, I found some bugs and have some suggestions for guest_mem. > Please see separate commits at [3]. > > Here are some highlights/questions: > > + "KVM: guest_mem: Explain the use of the uptodate flag for gmem" > + Generally, uptodate flags means that the contents of this page match the > backing store. Since gmem is memory-backed, does "uptodate" for gmem mean > "zeroed"? Don't read too much into the code, my POC was very much a "beat on it until it works" scenario. > + "KVM: guest_mem: Don't re-mark accessed after getting a folio" and "KVM: > guest_mem: Don't set dirty flag for folio" > + Do we need to folio_mark_accessed(), when it was created with > FGP_ACCESSED? Probably not. And as you note below, it's all pretty nonsensical anyways. > + What is the significance of these LRU flags when gmem doesn't support > swapping/eviction? Likely none. I used the filemap APIs in my POC because it was easy, not because it was necessarily the best approach, i.e. that the folios/pages show up in the LRUs is an unwanted side effect, not a feature. If guest_memfd only needs a small subset of the filemap support, going with a true from-scratch implemenation on top of xarray might be cleaner overall, e.g. would avoid the need for a new flag to say "this folio can't be migrated even though it's on the LRUs". > + "KVM: guest_mem: Align so that at least 1 page is allocated" > + Bug in current implementation: without this alignment, fallocate() of > a size less than the gmem page size will result in no allocation at all I'm not convinced this is a bug. I don't see any reason to allow allocating and punching holes in sub-page granularity. > + Both shmem and hugetlbfs perform this alignment > + "KVM: guest_mem: Add alignment checks" > + Implemented the alignment checks for guest_mem because hugetlb on gmem > would hit a BUG_ON without this check > + "KVM: guest_mem: Prevent overflows in kvm_gmem_invalidate_begin()" > + Sean fixed a bug in the offset-to-gfn conversion in > kvm_gmem_invalidate_begin() earlier, adding a WARN_ON_ONCE() As Mike pointed out, there's likely still a bug here[*]. I was planning on diving into that last week, but that never happened. If you or anyone else can take a peek and/or write a testcase, that would be awesome. : Heh, only if there's a testcase for it. Assuming start >= the slot offset does : seem broken, e.g. if the range-to-invalidate overlaps multiple slots, later slots : will have index==slot->gmem.index > start. : : > Since 'index' corresponds to the gmem offset of the current slot, is there any : > reason not to do something like this?: : > : > .start = slot->base_gfn + index - slot->gmem.index, : > : > But then, if that's the case, wouldn't index == slot->gmem.index? Suggesting : > we case just simplify to this?: : > : > .start = slot->base_gfn, : : No, e.g. if start is partway through a memslot, there's no need to invalidate : the entire memslot. I'll stare at this tomorrow when my brain is hopefully a : bit more functional, I suspect there is a min() and/or max() needed somewhere. [*] https://lore.kernel.org/all/20230512002124.3sap3kzxpegwj3n2@amd.com > + Code will always hit WARN_ON_ONCE() when the entire file is closed and > all offsets are invalidated, so WARN_ON_ONCE() should be removed > + Vishal noticed that the conversion might result in an overflow, so I > fixed that > + And of course, hugetlb support! Please let me know what you think of the > approach proposed at [2]. > > [1] https://lore.kernel.org/all/cover.1678926164.git.ackerleytng@google.com/T/ > [2] https://lore.kernel.org/lkml/cover.1686077275.git.ackerleytng@google.com/T/ > [3] https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1
On Thu, Jul 13, 2023, Ackerley Tng wrote: > Sean Christopherson <seanjc@google.com> writes: > > > On Fri, May 05, 2023, Ackerley Tng wrote: > >> > >> Hi Sean, > >> > >> Thanks for implementing this POC! > >> > >> ... snip ... > >> > > > > I don't love either approach idea because it means a file created in the context > > of a VM can outlive the VM itself, and then userspace ends up with a file descriptor > > that it can't do anything with except close(). I doubt that matters in practice > > though, e.g. when the VM dies, all memory can be freed so that the file ends up > > being little more than a shell. And if we go that route, there's no need to grab > > a reference to the file during bind, KVM can just grab a longterm reference when > > the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm). > > > > ... snip ... > > > > My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl() > > and not a common syscall. If the file isn't tightly coupled to a single VM, then > > punching a hole is further complicated by needing to deal with invalidating multiple > > regions that are bound to different @kvm instances. It's not super complex, but > > AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think > > having one VM own the memory will complicate even if/when we get to the point where > > VMs can share "private" memory, and the gmem code would still need to deal with > > grabbing a module reference. > > I’d like to follow up on this discussion about a guest_mem file > outliving the VM and whether to have a VM-scoped ioctl or a KVM ioctl. > > Here's a POC of delayed binding of a guest_mem file to a memslot, where > the guest_mem file outlives the VM [1]. > > I also hope to raise some points before we do the first integration of > guest_mem patches! > > A use case for guest_mem inodes outliving the VM is when the host VMM > needs to be upgraded. Translation: to support intra-host migration, a.k.a. KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM > The guest_mem inode is passed between two VMs on the same host machine and > all memory associated with the inode needs to be retained. > > To support the above use case, binding of memslots is delayed until > first use, so that the following inode passing flow can be used: > > 1. Source (old version of host VMM) process passes guest_mem inodes to > destination (new version of host VMM) process via unix sockets. > 2. Destination process initializes memslots identical to source process. > 3. Destination process invokes ioctl to migrate guest_mem inode over to > destination process by unbinding all memslots from the source VM and > binding them to the destination VM. (The kvm pointer is updated in > this process) > > Without delayed binding, step 2 will fail since initialization of > memslots would check and find that the kvm pointer in the guest_mem > inode points to the kvm in the source process. > > > These two patches contain the meat of the changes required to support > delayed binding: > > https://github.com/googleprodkernel/linux-cc/commit/93b31a006ef2e4dbe1ef0ec5d2534ca30f3bf60c > https://github.com/googleprodkernel/linux-cc/commit/dd5ac5e53f14a1ef9915c9c1e4cc1006a40b49df > > Some things to highlight for the approach set out in these two patches: > > 1. Previously, closing the guest_mem file in userspace is taken to mean > that all associated memory is to be removed and cleared. With these > two patches, each memslot also holds a reference to the file (and > therefore inode) and so even if the host VMM closes the fd, the VM > will be able to continue to function. > > This is desirable to userspace since closing the file should not be > interpreted as a command to clear memory. 100% agreed. However, after more thought since we first discussed this, I think that a deferred binding is the wrong way to solve this particular problem. More below. > This is aligned with the > way tmpfs files are used with KVM before guest_mem: when the file is > closed in userspace, the memory contents are still mapped and can > still be used by the VM. fallocate(PUNCH_HOLE) is how userspace > should command memory to be removed, just like munmap() would be used > to remove memory from use by KVM. > > 2. Creating a guest mem file no longer depends on a specific VM and > hence the guest_mem creation ioctl can be a system ioctl instead of a > VM specific ioctl. This will also address Chao's concern at [3]. That concern is a non-issue for QEMU as memory backends are created after accelerators are initialized, and AFAICT Google's VMM behaves similarly. And _if_ there is a VMM that instantiates memory before KVM_CREATE_VM, IMO making the ioctl() /dev/kvm scoped would have no meaningful impact on adapting userspace to play nice with the required ordering. If userspace can get at /dev/kvm, then it can do KVM_CREATE_VM, because the only input to KVM_CREATE_VM is the type, i.e. the only dependencies for KVM_CREATE_VM should be known/resolved long before the VMM knows it wants to use gmem. Using a non-KVM syscall would eliminate any such dependencies, but in my very strong opinion, that is not a good reason to go with a syscall. > I also separated cleaning up files vs inodes in > https://github.com/googleprodkernel/linux-cc/commit/0f5aa18910c515141e57e05c4cc791022047a242, > which I believe is more aligned with how files and inodes are cleaned up > in FSes. I didn't take these, though I am in the process of incorporating parts of the underlying feedback (see below). > This alignment makes it easier to extend gmem to hugetlb, for one. I am not convinced that utilizing hugetlb is the best way to provide 1GiB support in gmem. I'm not necessarily against it, but there's a fair bit of uncertainty around the future of hugetlb, and there are fundamental aspects of hugetlb that may be non-goals for the gmem use case, e.g. mapping the memory with 1GiB pages in the userspace page tables likely isn't necessariy, and might even be undesirable. And practically speaking, the changes aren't _that_ invasive, i.e. punting any necessary refactoring should not substantially increase the size/complexity of hugetlb support (*if* we end up adding it). That said, I do think we'll implement .evict_inode() very early on in order to support SNP and TDX, because (in addition to PUNCH_HOLE) that's when the backing memory will be freed and thus reclaimed, i.e. unassigned in the RMP (SNP) / zeroed with the shared key ID (TDX). > While working on this, I was also wondering if we should perhaps be > storing the inode pointer in slot->gmem instead of the file pointer? The > memory is associated with an inode->mapping rather than the file. Are we > binding to a userspace handle on the inode (store file pointer), or are > we really referencing the inode (store inode pointer)? Conceptually, I think KVM should to bind to the file. The inode is effectively the raw underlying physical storage, while the file is the VM's view of that storage. Practically, I think that gives us a clean, intuitive way to handle intra-host migration. Rather than transfer ownership of the file, instantiate a new file for the target VM, using the gmem inode from the source VM, i.e. create a hard link. That'd probably require new uAPI, but I don't think that will be hugely problematic. KVM would need to ensure the new VM's guest_memfd can't be mapped until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the memslots/bindings are identical), but that should be easy enough to enforce. That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing the memory and the *contents* of memory to outlive the VM, i.e. be effectively transfered to the new target VM. And we'll maintain the invariant that each guest_memfd is bound 1:1 with a single VM. As above, that should also help us draw the line between mapping memory into a VM (file), and freeing/reclaiming the memory (inode). There will be extra complexity/overhead as we'll have to play nice with the possibility of multiple files per inode, e.g. to zap mappings across all files when punching a hole, but the extra complexity is quite small, e.g. we can use address_space.private_list to keep track of the guest_memfd instances associated with the inode. Setting aside TDX and SNP for the moment, as it's not clear how they'll support memory that is "private" but shared between multiple VMs, I think per-VM files would work well for sharing gmem between two VMs. E.g. would allow a give page to be bound to a different gfn for each VM, would allow having different permissions for each file (e.g. to allow fallocate() only from the original owner).
On Fri, Jul 14, 2023 at 12:29 PM Sean Christopherson <seanjc@google.com> wrote: > ... > And _if_ there is a VMM that instantiates memory before KVM_CREATE_VM, IMO making > the ioctl() /dev/kvm scoped would have no meaningful impact on adapting userspace > to play nice with the required ordering. If userspace can get at /dev/kvm, then > it can do KVM_CREATE_VM, because the only input to KVM_CREATE_VM is the type, i.e. > the only dependencies for KVM_CREATE_VM should be known/resolved long before the > VMM knows it wants to use gmem. I am not sure about the benefits of tying gmem creation to any given kvm instance. I think the most important requirement here is that a given gmem range is always tied to a single VM - This can be enforced when memslots are bound to the gmem files. I believe "Required ordering" is that gmem files are created first and then supplied while creating the memslots whose gpa ranges can generate private memory accesses. Is there any other ordering we want to enforce here? > ... > Practically, I think that gives us a clean, intuitive way to handle intra-host > migration. Rather than transfer ownership of the file, instantiate a new file > for the target VM, using the gmem inode from the source VM, i.e. create a hard > link. That'd probably require new uAPI, but I don't think that will be hugely > problematic. KVM would need to ensure the new VM's guest_memfd can't be mapped > until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the > memslots/bindings are identical), but that should be easy enough to enforce. > > That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing > the memory and the *contents* of memory to outlive the VM, i.e. be effectively > transfered to the new target VM. And we'll maintain the invariant that each > guest_memfd is bound 1:1 with a single VM. > > As above, that should also help us draw the line between mapping memory into a > VM (file), and freeing/reclaiming the memory (inode). > > There will be extra complexity/overhead as we'll have to play nice with the > possibility of multiple files per inode, e.g. to zap mappings across all files > when punching a hole, but the extra complexity is quite small, e.g. we can use > address_space.private_list to keep track of the guest_memfd instances associated > with the inode. Are we talking about a different usecase of sharing gmem fd across VMs other than intra-host migration? If not, ideally only one of the files should be catering to the guest memory mappings at any given time. i.e. any inode should be ideally bound to (through the file) a single kvm instance, as we we are planning to ensure that guest_memfd can't be mapped until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM is invoked on the target side.
On Fri, Jul 14, 2023, Vishal Annapurve wrote: > On Fri, Jul 14, 2023 at 12:29 PM Sean Christopherson <seanjc@google.com> wrote: > > ... > > And _if_ there is a VMM that instantiates memory before KVM_CREATE_VM, IMO making > > the ioctl() /dev/kvm scoped would have no meaningful impact on adapting userspace > > to play nice with the required ordering. If userspace can get at /dev/kvm, then > > it can do KVM_CREATE_VM, because the only input to KVM_CREATE_VM is the type, i.e. > > the only dependencies for KVM_CREATE_VM should be known/resolved long before the > > VMM knows it wants to use gmem. > > I am not sure about the benefits of tying gmem creation to any given > kvm instance. IMO, making gmem->kvm immutable is very nice to have, e.g. gmem->kvm will always be valid and the refcounting rules are fairly straightforward. > I think the most important requirement here is that a given gmem range is always > tied to a single VM I'm not convinced that that requirement will always hold true (see below). > This can be enforced when memslots are bound to the gmem files. Yeah, but TBH, waiting until the guest faults in memory to detect an invalid memslot is gross. And looking more closely, taking filemap_invalidate_lock(), i.e. taking a semaphore for write, in the page fault path is a complete non-starter. The "if (existing_slot == slot)" check is likely a non-starter, because KVM handles FLAGS_ONLY memslot updates, e.g. toggling dirty logging, by duplicating and replacing the memslot, not by updating the live memslot. > I believe "Required ordering" is that gmem files are created first and > then supplied while creating the memslots whose gpa ranges can > generate private memory accesses. > Is there any other ordering we want to enforce here? I wasn't talking about enforcing arbitrary ordering, I was simply talking about what userspace literally needs to be able to do KVM_CREATE_GUEST_MEMFD. > > Practically, I think that gives us a clean, intuitive way to handle intra-host > > migration. Rather than transfer ownership of the file, instantiate a new file > > for the target VM, using the gmem inode from the source VM, i.e. create a hard > > link. That'd probably require new uAPI, but I don't think that will be hugely > > problematic. KVM would need to ensure the new VM's guest_memfd can't be mapped > > until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the > > memslots/bindings are identical), but that should be easy enough to enforce. > > > > That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing > > the memory and the *contents* of memory to outlive the VM, i.e. be effectively > > transfered to the new target VM. And we'll maintain the invariant that each > > guest_memfd is bound 1:1 with a single VM. > > > > As above, that should also help us draw the line between mapping memory into a > > VM (file), and freeing/reclaiming the memory (inode). > > > > There will be extra complexity/overhead as we'll have to play nice with the > > possibility of multiple files per inode, e.g. to zap mappings across all files > > when punching a hole, but the extra complexity is quite small, e.g. we can use > > address_space.private_list to keep track of the guest_memfd instances associated > > with the inode. > > Are we talking about a different usecase of sharing gmem fd across VMs > other than intra-host migration? Well, I am :-) I don't want to build all of this on an assumption that we'll never ever want to share a guest_memfd across multiple VMs. E.g. SEV (and SEV-ES?) already have the migration helper concept, and I've heard more than a few rumblings of TDX utilizing helper TDs. IMO, it's not far fetched at all to think that there will eventually be a need to let multiple VMs share a guest_memfd. > If not, ideally only one of the files should be catering to the guest > memory mappings at any given time. i.e. any inode should be ideally > bound to (through the file) a single kvm instance, Why? Honest question, what does it buy us? For TDX and SNP intra-host migration, it should be easy enough to ensure the new VM can't create mappings before migration, and that the old VM can't create mappings or run after migration. I don't see that being any harder if the source and dest use different files. FWIW, it might be easier to hold off on this discussion until I post the RFC (which is going to happen on Monday at this point), as then we'll have actual code to discuss. > as we we are planning to ensure that guest_memfd can't be mapped until > KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM is invoked on the target side.