Message ID | cover.1709288671.git.isaku.yamahata@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: Prepopulate guest memory API | expand |
On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote: > From: Isaku Yamahata <isaku.yamahata@intel.com> > > Implementation: > - x86 KVM MMU > In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to > KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized > version. Restricting to TDP MMU seems like a good idea. But I'm not quite sure how to reliably do that from a vCPU context. Checking for TDP being enabled is easy, but what if the vCPU is in guest-mode? Perhaps we can just return an error out to userspace if the vCPU is in guest-mode or TDP is disabled, and make it userspace's problem to do memory mapping before loading any vCPU state.
On Wed, Mar 06, 2024 at 04:53:41PM -0800, David Matlack <dmatlack@google.com> wrote: > On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote: > > From: Isaku Yamahata <isaku.yamahata@intel.com> > > > > Implementation: > > - x86 KVM MMU > > In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to > > KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized > > version. > > Restricting to TDP MMU seems like a good idea. But I'm not quite sure > how to reliably do that from a vCPU context. Checking for TDP being > enabled is easy, but what if the vCPU is in guest-mode? As you pointed out in other mail, legacy KVM MMU support or guest-mode will be troublesome. The use case I supposed is pre-population before guest runs, the guest-mode wouldn't matter. I didn't add explicit check for it, though. Any use case while vcpus running? > Perhaps we can just return an error out to userspace if the vCPU is in > guest-mode or TDP is disabled, and make it userspace's problem to do > memory mapping before loading any vCPU state. If the use case for default VM or sw-proteced VM is to avoid excessive kvm page fault at guest boot, error on guest-mode or disabled TDP wouldn't matter.
On Fri, Mar 01, 2024 at 09:28:42AM -0800, isaku.yamahata@intel.com wrote: > From: Isaku Yamahata <isaku.yamahata@intel.com> > > The objective of this RFC patch series is to develop a uAPI aimed at > (pre)populating guest memory for various use cases and underlying VM > technologies. > > - Pre-populate guest memory to mitigate excessive KVM page faults during guest > boot [1], a need not limited to any specific technology. > > - Pre-populating guest memory (including encryption and measurement) for > confidential guests [2]. SEV-SNP, TDX, and SW-PROTECTED VM. Potentially > other technologies and pKVM. > > The patches are organized as follows. > - 1: documentation on uAPI KVM_MAP_MEMORY. > - 2: archtechture-independent implementation part. > - 3-4: refactoring of x86 KVM MMU as preparation. > - 5: x86 Helper function to map guest page. > - 6: x86 KVM arch implementation. > - 7: Add x86-ops necessary for TDX and SEV-SNP. > - 8: selftest for validation. > > Discussion point: > > uAPI design: > - access flags > Access flags are needed for the guest memory population. We have options for > their exposure to uAPI. > - option 1. Introduce access flags, possibly with the addition of a private > access flag. > - option 2. Omit access flags from UAPI. > Allow the kernel to deduce the necessary flag based on the memory slot and > its memory attributes. > > - SEV-SNP and byte vs. page size > The SEV correspondence is SEV_LAUNCH_UPDATE_DATA. Which dictates memory > regions to be in 16-byte alignment, not page size. Should we define struct > kvm_memory_mapping in bytes rather than page size? For SNP it's multiples of page size only, and I'm not sure it's worth trying to work in legacy SEV support, since that pull in other requirements like how the backing memory also needs to be pre-pinned via a separate KVM ioctl that would need to be documented as an SEV-specific requirement for this interface... it just doesn't seem worth it. And SEV would still benefit from the more basic functionality of pre-mapping pages just like any other guest so it still benefits in that regard. That said, it would be a shame if we needed to clutter up the API as soon as some user can along that required non-page-sized values. That seems unlikely for the pre-mapping use case, but for CoCo maybe it's worth checking if pKVM would have any requirements like that? Or just going with byte-size parameters to play it safe? > > struct kvm_sev_launch_update_data { > __u64 uaddr; > __u32 len; > }; > > - TDX and measurement > The TDX correspondence is TDH.MEM.PAGE.ADD and TDH.MR.EXTEND. TDH.MEM.EXTEND > extends its measurement by the page contents. > Option 1. Add an additional flag like KVM_MEMORY_MAPPING_FLAG_EXTEND to issue > TDH.MEM.EXTEND > Option 2. Don't handle extend. Let TDX vendor specific API > KVM_EMMORY_ENCRYPT_OP to handle it with the subcommand like > KVM_TDX_EXTEND_MEMORY. For SNP this happens unconditionally via SNP_LAUNCH_UPDATE, and with some additional measurements via SNP_LAUNCH_FINISH, and down the road when live migration support is added that flow will be a bit different. So personally I think it's better to leave separate for now. Maybe down the road some common measurement/finalize interface can deprecate some of the other MEMORY_ENCRYPT ioctls. Even with this more narrowly-defined purpose it's really unfortunate we have to bake any CoCo stuff into this interface at all. It would be great if all it did was simply pre-map entries into nested page tables to boost boot performance, and we had some separate CoCo-specific API to handle intial loading/encryption of guest data. I understand with SecureEPT considerations why we sort of need it here for TDX, but it already ends up being awkward for the SNP_LAUNCH_UPDATE use-case because there's not really any good reason for that handling to live inside KVM MMU hooks like with TDX, so we'll probably end up putting it in a pre/post hook where all the handling is completely separate from TDX flow and in the process complicating the userspace API to with the additional parameters needed for that even though other guest types are likely to never need them. (Alternatively we can handle SNP_LAUNCH_UPDATE via KVM MMU hooks like with tdx_mem_page_add(), but then we'll need to plumb additional parameters down into the KVM MMU code for stuff like the SNP page type. But maybe it's worth it to have some level of commonality for x86 at least?) But I'd be hesitant to bake more requirements into this pre-mapping interface, it feels like we're already overloading it as is. > > - TDX and struct kvm_memory_mapping:source > While the current patch series doesn't utilize > kvm_memory_mapping::source member. TDX needs it to specify the source of > memory contents. SNP will need it too FWIW. -Mike > > Implementation: > - x86 KVM MMU > In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to > KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized > version. > > [1] https://lore.kernel.org/all/65262e67-7885-971a-896d-ad9c0a760907@polito.it/ > [2] https://lore.kernel.org/all/6a4c029af70d41b63bcee3d6a1f0c2377f6eb4bd.1690322424.git.isaku.yamahata@intel.com > > Thanks, > > Isaku Yamahata (8): > KVM: Document KVM_MAP_MEMORY ioctl > KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory > KVM: x86/mmu: Introduce initialier macro for struct kvm_page_fault > KVM: x86/mmu: Factor out kvm_mmu_do_page_fault() > KVM: x86/mmu: Introduce kvm_mmu_map_page() for prepopulating guest > memory > KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory() > KVM: x86: Add hooks in kvm_arch_vcpu_map_memory() > KVM: selftests: x86: Add test for KVM_MAP_MEMORY > > Documentation/virt/kvm/api.rst | 36 +++++ > arch/x86/include/asm/kvm-x86-ops.h | 2 + > arch/x86/include/asm/kvm_host.h | 6 + > arch/x86/kvm/mmu.h | 3 + > arch/x86/kvm/mmu/mmu.c | 30 ++++ > arch/x86/kvm/mmu/mmu_internal.h | 70 +++++---- > arch/x86/kvm/x86.c | 83 +++++++++++ > include/linux/kvm_host.h | 4 + > include/uapi/linux/kvm.h | 15 ++ > tools/include/uapi/linux/kvm.h | 14 ++ > tools/testing/selftests/kvm/Makefile | 1 + > .../selftests/kvm/x86_64/map_memory_test.c | 136 ++++++++++++++++++ > virt/kvm/kvm_main.c | 74 ++++++++++ > 13 files changed, 448 insertions(+), 26 deletions(-) > create mode 100644 tools/testing/selftests/kvm/x86_64/map_memory_test.c > > > base-commit: 6a108bdc49138bcaa4f995ed87681ab9c65122ad > -- > 2.25.1 > >
On Sun, Mar 10, 2024, Michael Roth wrote: > On Fri, Mar 01, 2024 at 09:28:42AM -0800, isaku.yamahata@intel.com wrote: > > struct kvm_sev_launch_update_data { > > __u64 uaddr; > > __u32 len; > > }; > > > > - TDX and measurement > > The TDX correspondence is TDH.MEM.PAGE.ADD and TDH.MR.EXTEND. TDH.MEM.EXTEND > > extends its measurement by the page contents. > > Option 1. Add an additional flag like KVM_MEMORY_MAPPING_FLAG_EXTEND to issue > > TDH.MEM.EXTEND > > Option 2. Don't handle extend. Let TDX vendor specific API > > KVM_EMMORY_ENCRYPT_OP to handle it with the subcommand like > > KVM_TDX_EXTEND_MEMORY. > > For SNP this happens unconditionally via SNP_LAUNCH_UPDATE, and with some > additional measurements via SNP_LAUNCH_FINISH, and down the road when live > migration support is added that flow will be a bit different. So > personally I think it's better to leave separate for now. +1. The only reason to do EXTEND at the same time as PAGE.ADD would be to optimize setups that want the measurement to be extended with the contents of a page immediately after the measurement is extended with the mapping metadata for said page. And AFAIK, the only reason to prefer that approach is for backwards compatibility, which is not a concern for KVM. I suppose maaaybe some memory locality performance benefits, but that seems like a stretch. <time passes> And I think this whole conversation is moot, because I don't think there's a need to do PAGE.ADD during KVM_MAP_MEMORY[*]. If KVM_MAP_MEMORY does only the SEPT.ADD side of things, then both @source (PAGE.ADD) and the EXTEND flag go away. > But I'd be hesitant to bake more requirements into this pre-mapping > interface, it feels like we're already overloading it as is. Agreed. After being able to think more about this ioctl(), I think KVM_MAP_MEMORY should be as "pure" of a mapping operation as we can make it. It'd be a little weird that using KVM_MAP_MEMORY is required for TDX VMs, but not other VMs. But that's really just a reflection of S-EPT, so it's arguably not even a bad thing. [*] https://lore.kernel.org/all/Ze-TJh0BBOWm9spT@google.com
On Mon, Mar 11, 2024 at 04:44:27PM -0700, Sean Christopherson <seanjc@google.com> wrote: > On Sun, Mar 10, 2024, Michael Roth wrote: > > On Fri, Mar 01, 2024 at 09:28:42AM -0800, isaku.yamahata@intel.com wrote: > > > struct kvm_sev_launch_update_data { > > > __u64 uaddr; > > > __u32 len; > > > }; > > > > > > - TDX and measurement > > > The TDX correspondence is TDH.MEM.PAGE.ADD and TDH.MR.EXTEND. TDH.MEM.EXTEND > > > extends its measurement by the page contents. > > > Option 1. Add an additional flag like KVM_MEMORY_MAPPING_FLAG_EXTEND to issue > > > TDH.MEM.EXTEND > > > Option 2. Don't handle extend. Let TDX vendor specific API > > > KVM_EMMORY_ENCRYPT_OP to handle it with the subcommand like > > > KVM_TDX_EXTEND_MEMORY. > > > > For SNP this happens unconditionally via SNP_LAUNCH_UPDATE, and with some > > additional measurements via SNP_LAUNCH_FINISH, and down the road when live > > migration support is added that flow will be a bit different. So > > personally I think it's better to leave separate for now. > > +1. The only reason to do EXTEND at the same time as PAGE.ADD would be to > optimize setups that want the measurement to be extended with the contents of a > page immediately after the measurement is extended with the mapping metadata for > said page. And AFAIK, the only reason to prefer that approach is for backwards > compatibility, which is not a concern for KVM. I suppose maaaybe some memory > locality performance benefits, but that seems like a stretch. > > <time passes> > > And I think this whole conversation is moot, because I don't think there's a need > to do PAGE.ADD during KVM_MAP_MEMORY[*]. If KVM_MAP_MEMORY does only the SEPT.ADD > side of things, then both @source (PAGE.ADD) and the EXTEND flag go away. > > > But I'd be hesitant to bake more requirements into this pre-mapping > > interface, it feels like we're already overloading it as is. > > Agreed. After being able to think more about this ioctl(), I think KVM_MAP_MEMORY > should be as "pure" of a mapping operation as we can make it. It'd be a little > weird that using KVM_MAP_MEMORY is required for TDX VMs, but not other VMs. But > that's really just a reflection of S-EPT, so it's arguably not even a bad thing. > > [*] https://lore.kernel.org/all/Ze-TJh0BBOWm9spT@google.com Let me give it a try to remove source from struct kvm_memory_mapping. With the unit in byte instead of page, it will be struct kvm_memory_mapping { __u64 base_address; __u64 size; __u64 flags; }; SNP won't have any changes. Always error for KVM_MAP_MEMORY for SNP? (I'll leave it to Roth.) TDX will have TDX_INIT_MEM_REGION with new implementation.
On Wed, Mar 06, 2024 at 06:09:54PM -0800, Isaku Yamahata <isaku.yamahata@linux.intel.com> wrote: > On Wed, Mar 06, 2024 at 04:53:41PM -0800, > David Matlack <dmatlack@google.com> wrote: > > > On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote: > > > From: Isaku Yamahata <isaku.yamahata@intel.com> > > > > > > Implementation: > > > - x86 KVM MMU > > > In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to > > > KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized > > > version. > > > > Restricting to TDP MMU seems like a good idea. But I'm not quite sure > > how to reliably do that from a vCPU context. Checking for TDP being > > enabled is easy, but what if the vCPU is in guest-mode? > > As you pointed out in other mail, legacy KVM MMU support or guest-mode will be > troublesome. The use case I supposed is pre-population before guest runs, the > guest-mode wouldn't matter. I didn't add explicit check for it, though. > > Any use case while vcpus running? > > > > Perhaps we can just return an error out to userspace if the vCPU is in > > guest-mode or TDP is disabled, and make it userspace's problem to do > > memory mapping before loading any vCPU state. > > If the use case for default VM or sw-proteced VM is to avoid excessive kvm page > fault at guest boot, error on guest-mode or disabled TDP wouldn't matter. Any input? If no further input, I assume the primary use case is pre-population before guest running.
On Tue, Mar 19, 2024, Isaku Yamahata wrote: > On Wed, Mar 06, 2024 at 06:09:54PM -0800, > Isaku Yamahata <isaku.yamahata@linux.intel.com> wrote: > > > On Wed, Mar 06, 2024 at 04:53:41PM -0800, > > David Matlack <dmatlack@google.com> wrote: > > > > > On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote: > > > > From: Isaku Yamahata <isaku.yamahata@intel.com> > > > > > > > > Implementation: > > > > - x86 KVM MMU > > > > In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to > > > > KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized > > > > version. > > > > > > Restricting to TDP MMU seems like a good idea. But I'm not quite sure > > > how to reliably do that from a vCPU context. Checking for TDP being > > > enabled is easy, but what if the vCPU is in guest-mode? > > > > As you pointed out in other mail, legacy KVM MMU support or guest-mode will be > > troublesome. Why is shadow paging troublesome? I don't see any obvious issues with effectively prefetching into a shadow MMU with read fault semantics. It might be pointless and wasteful, as the guest PTEs need to be in place, but that's userspace's problem. Testing is the biggest gap I see, as using the ioctl() for shadow paging will essentially require a live guest, but that doesn't seem like it'd be too hard to validate. And unless we lock down the ioctl() to only be allowed on vCPUs that have never done KVM_RUN, we need that test coverage anyways. And I don't think it makes sense to try and lock down the ioctl(), because for the enforcement to have any meaning, KVM would need to reject the ioctl() if *any* vCPU has run, and adding that code would likely add more complexity than it solves. > > The use case I supposed is pre-population before guest runs, the guest-mode > > wouldn't matter. I didn't add explicit check for it, though. KVM shouldn't have an explicit is_guest_mode() check, the support should be a property of the underlying MMU, and KVM can use the TDP MMU for L2 (if L1 is using legacy shadow paging, not TDP). > > Any use case while vcpus running? > > > > > > > Perhaps we can just return an error out to userspace if the vCPU is in > > > guest-mode or TDP is disabled, and make it userspace's problem to do > > > memory mapping before loading any vCPU state. > > > > If the use case for default VM or sw-proteced VM is to avoid excessive kvm page > > fault at guest boot, error on guest-mode or disabled TDP wouldn't matter. > > Any input? If no further input, I assume the primary use case is pre-population > before guest running. Pre-populating is the primary use case, but that could happen if L2 is active, e.g. after live migration. I'm not necessarily opposed to initially adding support only for the TDP MMU, but if the delta to also support the shadow MMU is relatively small, my preference would be to add the support right away. E.g. to give us confidence that the uAPI can work for multiple MMUs, and so that we don't have to write documentation for x86 to explain exactly when it's legal to use the ioctl().
On Wed, Apr 03, 2024 at 11:30:21AM -0700, Sean Christopherson <seanjc@google.com> wrote: > On Tue, Mar 19, 2024, Isaku Yamahata wrote: > > On Wed, Mar 06, 2024 at 06:09:54PM -0800, > > Isaku Yamahata <isaku.yamahata@linux.intel.com> wrote: > > > > > On Wed, Mar 06, 2024 at 04:53:41PM -0800, > > > David Matlack <dmatlack@google.com> wrote: > > > > > > > On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote: > > > > > From: Isaku Yamahata <isaku.yamahata@intel.com> > > > > > > > > > > Implementation: > > > > > - x86 KVM MMU > > > > > In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to > > > > > KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized > > > > > version. > > > > > > > > Restricting to TDP MMU seems like a good idea. But I'm not quite sure > > > > how to reliably do that from a vCPU context. Checking for TDP being > > > > enabled is easy, but what if the vCPU is in guest-mode? > > > > > > As you pointed out in other mail, legacy KVM MMU support or guest-mode will be > > > troublesome. > > Why is shadow paging troublesome? I don't see any obvious issues with effectively > prefetching into a shadow MMU with read fault semantics. It might be pointless > and wasteful, as the guest PTEs need to be in place, but that's userspace's problem. The populating address for shadow paging is GVA, not GPA. I'm not sure if that's what the user space wants. If it's user-space problem, I'm fine. > Testing is the biggest gap I see, as using the ioctl() for shadow paging will > essentially require a live guest, but that doesn't seem like it'd be too hard to > validate. And unless we lock down the ioctl() to only be allowed on vCPUs that > have never done KVM_RUN, we need that test coverage anyways. So far I tried only TDP MMU case. I can try other MMU type. > And I don't think it makes sense to try and lock down the ioctl(), because for > the enforcement to have any meaning, KVM would need to reject the ioctl() if *any* > vCPU has run, and adding that code would likely add more complexity than it solves. > > > > The use case I supposed is pre-population before guest runs, the guest-mode > > > wouldn't matter. I didn't add explicit check for it, though. > > KVM shouldn't have an explicit is_guest_mode() check, the support should be a > property of the underlying MMU, and KVM can use the TDP MMU for L2 (if L1 is > using legacy shadow paging, not TDP). I see. So the type of the populating address can vary depending on vcpu mode. It's user-space problem which address (GVA, L1 GPA, L2 GPA) is used. > > > Any use case while vcpus running? > > > > > > > > > > Perhaps we can just return an error out to userspace if the vCPU is in > > > > guest-mode or TDP is disabled, and make it userspace's problem to do > > > > memory mapping before loading any vCPU state. > > > > > > If the use case for default VM or sw-proteced VM is to avoid excessive kvm page > > > fault at guest boot, error on guest-mode or disabled TDP wouldn't matter. > > > > Any input? If no further input, I assume the primary use case is pre-population > > before guest running. > > Pre-populating is the primary use case, but that could happen if L2 is active, > e.g. after live migration. > > I'm not necessarily opposed to initially adding support only for the TDP MMU, but > if the delta to also support the shadow MMU is relatively small, my preference > would be to add the support right away. E.g. to give us confidence that the uAPI > can work for multiple MMUs, and so that we don't have to write documentation for > x86 to explain exactly when it's legal to use the ioctl(). If we call kvm_mmu.page_fault() without caring of what address will be populated, I don't see the big difference.
On Wed, Apr 03, 2024, Isaku Yamahata wrote: > On Wed, Apr 03, 2024 at 11:30:21AM -0700, > Sean Christopherson <seanjc@google.com> wrote: > > > On Tue, Mar 19, 2024, Isaku Yamahata wrote: > > > On Wed, Mar 06, 2024 at 06:09:54PM -0800, > > > Isaku Yamahata <isaku.yamahata@linux.intel.com> wrote: > > > > > > > On Wed, Mar 06, 2024 at 04:53:41PM -0800, > > > > David Matlack <dmatlack@google.com> wrote: > > > > > > > > > On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote: > > > > > > From: Isaku Yamahata <isaku.yamahata@intel.com> > > > > > > > > > > > > Implementation: > > > > > > - x86 KVM MMU > > > > > > In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to > > > > > > KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized > > > > > > version. > > > > > > > > > > Restricting to TDP MMU seems like a good idea. But I'm not quite sure > > > > > how to reliably do that from a vCPU context. Checking for TDP being > > > > > enabled is easy, but what if the vCPU is in guest-mode? > > > > > > > > As you pointed out in other mail, legacy KVM MMU support or guest-mode will be > > > > troublesome. > > > > Why is shadow paging troublesome? I don't see any obvious issues with effectively > > prefetching into a shadow MMU with read fault semantics. It might be pointless > > and wasteful, as the guest PTEs need to be in place, but that's userspace's problem. > > The populating address for shadow paging is GVA, not GPA. I'm not sure if > that's what the user space wants. If it's user-space problem, I'm fine. /facepalm > > Pre-populating is the primary use case, but that could happen if L2 is active, > > e.g. after live migration. > > > > I'm not necessarily opposed to initially adding support only for the TDP MMU, but > > if the delta to also support the shadow MMU is relatively small, my preference > > would be to add the support right away. E.g. to give us confidence that the uAPI > > can work for multiple MMUs, and so that we don't have to write documentation for > > x86 to explain exactly when it's legal to use the ioctl(). > > If we call kvm_mmu.page_fault() without caring of what address will be > populated, I don't see the big difference. Ignore me, I completely spaced that shadow MMUs don't operate on an L1 GPA. I 100% agree that restricting this to TDP, at least for the initial merge, is the way to go. A uAPI where the type of address varies based on the vCPU mode and MMU type would be super ugly, and probably hard to use. At that point, I don't have a strong preference as to whether or not direct legacy/shadow MMUs are supported. That said, I think it can (probably should?) be done in a way where it more or less Just Works, e.g. by having a function hook in "struct kvm_mmu".
From: Isaku Yamahata <isaku.yamahata@intel.com> The objective of this RFC patch series is to develop a uAPI aimed at (pre)populating guest memory for various use cases and underlying VM technologies. - Pre-populate guest memory to mitigate excessive KVM page faults during guest boot [1], a need not limited to any specific technology. - Pre-populating guest memory (including encryption and measurement) for confidential guests [2]. SEV-SNP, TDX, and SW-PROTECTED VM. Potentially other technologies and pKVM. The patches are organized as follows. - 1: documentation on uAPI KVM_MAP_MEMORY. - 2: archtechture-independent implementation part. - 3-4: refactoring of x86 KVM MMU as preparation. - 5: x86 Helper function to map guest page. - 6: x86 KVM arch implementation. - 7: Add x86-ops necessary for TDX and SEV-SNP. - 8: selftest for validation. Discussion point: uAPI design: - access flags Access flags are needed for the guest memory population. We have options for their exposure to uAPI. - option 1. Introduce access flags, possibly with the addition of a private access flag. - option 2. Omit access flags from UAPI. Allow the kernel to deduce the necessary flag based on the memory slot and its memory attributes. - SEV-SNP and byte vs. page size The SEV correspondence is SEV_LAUNCH_UPDATE_DATA. Which dictates memory regions to be in 16-byte alignment, not page size. Should we define struct kvm_memory_mapping in bytes rather than page size? struct kvm_sev_launch_update_data { __u64 uaddr; __u32 len; }; - TDX and measurement The TDX correspondence is TDH.MEM.PAGE.ADD and TDH.MR.EXTEND. TDH.MEM.EXTEND extends its measurement by the page contents. Option 1. Add an additional flag like KVM_MEMORY_MAPPING_FLAG_EXTEND to issue TDH.MEM.EXTEND Option 2. Don't handle extend. Let TDX vendor specific API KVM_EMMORY_ENCRYPT_OP to handle it with the subcommand like KVM_TDX_EXTEND_MEMORY. - TDX and struct kvm_memory_mapping:source While the current patch series doesn't utilize kvm_memory_mapping::source member. TDX needs it to specify the source of memory contents. Implementation: - x86 KVM MMU In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized version. [1] https://lore.kernel.org/all/65262e67-7885-971a-896d-ad9c0a760907@polito.it/ [2] https://lore.kernel.org/all/6a4c029af70d41b63bcee3d6a1f0c2377f6eb4bd.1690322424.git.isaku.yamahata@intel.com Thanks, Isaku Yamahata (8): KVM: Document KVM_MAP_MEMORY ioctl KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory KVM: x86/mmu: Introduce initialier macro for struct kvm_page_fault KVM: x86/mmu: Factor out kvm_mmu_do_page_fault() KVM: x86/mmu: Introduce kvm_mmu_map_page() for prepopulating guest memory KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory() KVM: x86: Add hooks in kvm_arch_vcpu_map_memory() KVM: selftests: x86: Add test for KVM_MAP_MEMORY Documentation/virt/kvm/api.rst | 36 +++++ arch/x86/include/asm/kvm-x86-ops.h | 2 + arch/x86/include/asm/kvm_host.h | 6 + arch/x86/kvm/mmu.h | 3 + arch/x86/kvm/mmu/mmu.c | 30 ++++ arch/x86/kvm/mmu/mmu_internal.h | 70 +++++---- arch/x86/kvm/x86.c | 83 +++++++++++ include/linux/kvm_host.h | 4 + include/uapi/linux/kvm.h | 15 ++ tools/include/uapi/linux/kvm.h | 14 ++ tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/map_memory_test.c | 136 ++++++++++++++++++ virt/kvm/kvm_main.c | 74 ++++++++++ 13 files changed, 448 insertions(+), 26 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/map_memory_test.c base-commit: 6a108bdc49138bcaa4f995ed87681ab9c65122ad