mbox series

[RFC,0/8] KVM: Prepopulate guest memory API

Message ID cover.1709288671.git.isaku.yamahata@intel.com (mailing list archive)
Headers show
Series KVM: Prepopulate guest memory API | expand

Message

Isaku Yamahata March 1, 2024, 5:28 p.m. UTC
From: Isaku Yamahata <isaku.yamahata@intel.com>

The objective of this RFC patch series is to develop a uAPI aimed at
(pre)populating guest memory for various use cases and underlying VM
technologies.

- Pre-populate guest memory to mitigate excessive KVM page faults during guest
  boot [1], a need not limited to any specific technology.

- Pre-populating guest memory (including encryption and measurement) for
  confidential guests [2].  SEV-SNP, TDX, and SW-PROTECTED VM.  Potentially
  other technologies and pKVM.

The patches are organized as follows.
- 1: documentation on uAPI KVM_MAP_MEMORY.
- 2: archtechture-independent implementation part.
- 3-4: refactoring of x86 KVM MMU as preparation.
- 5: x86 Helper function to map guest page.
- 6: x86 KVM arch implementation.
- 7: Add x86-ops necessary for TDX and SEV-SNP.
- 8: selftest for validation.

Discussion point:

uAPI design:
- access flags
  Access flags are needed for the guest memory population.  We have options for
  their exposure to uAPI.
  - option 1. Introduce access flags, possibly with the addition of a private
              access flag.
  - option 2. Omit access flags from UAPI.
    Allow the kernel to deduce the necessary flag based on the memory slot and
    its memory attributes.

- SEV-SNP and byte vs. page size
  The SEV correspondence is SEV_LAUNCH_UPDATE_DATA.  Which dictates memory
  regions to be in 16-byte alignment, not page size.  Should we define struct
  kvm_memory_mapping in bytes rather than page size?

  struct kvm_sev_launch_update_data {
        __u64 uaddr;
        __u32 len;
  };

- TDX and measurement
  The TDX correspondence is TDH.MEM.PAGE.ADD and TDH.MR.EXTEND.  TDH.MEM.EXTEND
  extends its measurement by the page contents.
  Option 1. Add an additional flag like KVM_MEMORY_MAPPING_FLAG_EXTEND to issue
            TDH.MEM.EXTEND
  Option 2. Don't handle extend. Let TDX vendor specific API
            KVM_EMMORY_ENCRYPT_OP to handle it with the subcommand like
            KVM_TDX_EXTEND_MEMORY.

- TDX and struct kvm_memory_mapping:source
  While the current patch series doesn't utilize
  kvm_memory_mapping::source member.  TDX needs it to specify the source of
  memory contents.

Implementation:
- x86 KVM MMU
  In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault().  It's not confined to
  KVM TDP MMU.  We can restrict it to KVM TDP MMU and introduce an optimized
  version.

[1] https://lore.kernel.org/all/65262e67-7885-971a-896d-ad9c0a760907@polito.it/
[2] https://lore.kernel.org/all/6a4c029af70d41b63bcee3d6a1f0c2377f6eb4bd.1690322424.git.isaku.yamahata@intel.com

Thanks,

Isaku Yamahata (8):
  KVM: Document KVM_MAP_MEMORY ioctl
  KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory
  KVM: x86/mmu: Introduce initialier macro for struct kvm_page_fault
  KVM: x86/mmu: Factor out kvm_mmu_do_page_fault()
  KVM: x86/mmu: Introduce kvm_mmu_map_page() for prepopulating guest
    memory
  KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()
  KVM: x86: Add hooks in kvm_arch_vcpu_map_memory()
  KVM: selftests: x86: Add test for KVM_MAP_MEMORY

 Documentation/virt/kvm/api.rst                |  36 +++++
 arch/x86/include/asm/kvm-x86-ops.h            |   2 +
 arch/x86/include/asm/kvm_host.h               |   6 +
 arch/x86/kvm/mmu.h                            |   3 +
 arch/x86/kvm/mmu/mmu.c                        |  30 ++++
 arch/x86/kvm/mmu/mmu_internal.h               |  70 +++++----
 arch/x86/kvm/x86.c                            |  83 +++++++++++
 include/linux/kvm_host.h                      |   4 +
 include/uapi/linux/kvm.h                      |  15 ++
 tools/include/uapi/linux/kvm.h                |  14 ++
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/x86_64/map_memory_test.c    | 136 ++++++++++++++++++
 virt/kvm/kvm_main.c                           |  74 ++++++++++
 13 files changed, 448 insertions(+), 26 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/map_memory_test.c


base-commit: 6a108bdc49138bcaa4f995ed87681ab9c65122ad

Comments

David Matlack March 7, 2024, 12:53 a.m. UTC | #1
On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Implementation:
> - x86 KVM MMU
>   In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault().  It's not confined to
>   KVM TDP MMU.  We can restrict it to KVM TDP MMU and introduce an optimized
>   version.

Restricting to TDP MMU seems like a good idea. But I'm not quite sure
how to reliably do that from a vCPU context. Checking for TDP being
enabled is easy, but what if the vCPU is in guest-mode?

Perhaps we can just return an error out to userspace if the vCPU is in
guest-mode or TDP is disabled, and make it userspace's problem to do
memory mapping before loading any vCPU state.
Isaku Yamahata March 7, 2024, 2:09 a.m. UTC | #2
On Wed, Mar 06, 2024 at 04:53:41PM -0800,
David Matlack <dmatlack@google.com> wrote:

> On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > Implementation:
> > - x86 KVM MMU
> >   In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault().  It's not confined to
> >   KVM TDP MMU.  We can restrict it to KVM TDP MMU and introduce an optimized
> >   version.
> 
> Restricting to TDP MMU seems like a good idea. But I'm not quite sure
> how to reliably do that from a vCPU context. Checking for TDP being
> enabled is easy, but what if the vCPU is in guest-mode?

As you pointed out in other mail, legacy KVM MMU support or guest-mode will be
troublesome.  The use case I supposed is pre-population before guest runs, the
guest-mode wouldn't matter. I didn't add explicit check for it, though.

Any use case while vcpus running?


> Perhaps we can just return an error out to userspace if the vCPU is in
> guest-mode or TDP is disabled, and make it userspace's problem to do
> memory mapping before loading any vCPU state.

If the use case for default VM or sw-proteced VM is to avoid excessive kvm page
fault at guest boot, error on guest-mode or disabled TDP wouldn't matter.
Michael Roth March 11, 2024, 3:20 a.m. UTC | #3
On Fri, Mar 01, 2024 at 09:28:42AM -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> The objective of this RFC patch series is to develop a uAPI aimed at
> (pre)populating guest memory for various use cases and underlying VM
> technologies.
> 
> - Pre-populate guest memory to mitigate excessive KVM page faults during guest
>   boot [1], a need not limited to any specific technology.
> 
> - Pre-populating guest memory (including encryption and measurement) for
>   confidential guests [2].  SEV-SNP, TDX, and SW-PROTECTED VM.  Potentially
>   other technologies and pKVM.
> 
> The patches are organized as follows.
> - 1: documentation on uAPI KVM_MAP_MEMORY.
> - 2: archtechture-independent implementation part.
> - 3-4: refactoring of x86 KVM MMU as preparation.
> - 5: x86 Helper function to map guest page.
> - 6: x86 KVM arch implementation.
> - 7: Add x86-ops necessary for TDX and SEV-SNP.
> - 8: selftest for validation.
> 
> Discussion point:
> 
> uAPI design:
> - access flags
>   Access flags are needed for the guest memory population.  We have options for
>   their exposure to uAPI.
>   - option 1. Introduce access flags, possibly with the addition of a private
>               access flag.
>   - option 2. Omit access flags from UAPI.
>     Allow the kernel to deduce the necessary flag based on the memory slot and
>     its memory attributes.
> 
> - SEV-SNP and byte vs. page size
>   The SEV correspondence is SEV_LAUNCH_UPDATE_DATA.  Which dictates memory
>   regions to be in 16-byte alignment, not page size.  Should we define struct
>   kvm_memory_mapping in bytes rather than page size?

For SNP it's multiples of page size only, and I'm not sure it's worth
trying to work in legacy SEV support, since that pull in other
requirements like how the backing memory also needs to be pre-pinned via
a separate KVM ioctl that would need to be documented as an SEV-specific
requirement for this interface... it just doesn't seem worth it. And SEV
would still benefit from the more basic functionality of pre-mapping pages
just like any other guest so it still benefits in that regard.

That said, it would be a shame if we needed to clutter up the API as
soon as some user can along that required non-page-sized values. That
seems unlikely for the pre-mapping use case, but for CoCo maybe it's
worth checking if pKVM would have any requirements like that? Or just
going with byte-size parameters to play it safe?

> 
>   struct kvm_sev_launch_update_data {
>         __u64 uaddr;
>         __u32 len;
>   };
> 
> - TDX and measurement
>   The TDX correspondence is TDH.MEM.PAGE.ADD and TDH.MR.EXTEND.  TDH.MEM.EXTEND
>   extends its measurement by the page contents.
>   Option 1. Add an additional flag like KVM_MEMORY_MAPPING_FLAG_EXTEND to issue
>             TDH.MEM.EXTEND
>   Option 2. Don't handle extend. Let TDX vendor specific API
>             KVM_EMMORY_ENCRYPT_OP to handle it with the subcommand like
>             KVM_TDX_EXTEND_MEMORY.

For SNP this happens unconditionally via SNP_LAUNCH_UPDATE, and with some
additional measurements via SNP_LAUNCH_FINISH, and down the road when live
migration support is added that flow will be a bit different. So
personally I think it's better to leave separate for now. Maybe down the
road some common measurement/finalize interface can deprecate some of
the other MEMORY_ENCRYPT ioctls.

Even with this more narrowly-defined purpose it's really unfortunate we
have to bake any CoCo stuff into this interface at all. It would be great
if all it did was simply pre-map entries into nested page tables to boost
boot performance, and we had some separate CoCo-specific API to handle
intial loading/encryption of guest data. I understand with SecureEPT
considerations why we sort of need it here for TDX, but it already ends
up being awkward for the SNP_LAUNCH_UPDATE use-case because there's not
really any good reason for that handling to live inside KVM MMU hooks
like with TDX, so we'll probably end up putting it in a pre/post hook
where all the handling is completely separate from TDX flow and in the
process complicating the userspace API to with the additional parameters
needed for that even though other guest types are likely to never need
them.

(Alternatively we can handle SNP_LAUNCH_UPDATE via KVM MMU hooks like with
tdx_mem_page_add(), but then we'll need to plumb additional parameters
down into the KVM MMU code for stuff like the SNP page type. But maybe
it's worth it to have some level of commonality for x86 at least?)

But I'd be hesitant to bake more requirements into this pre-mapping
interface, it feels like we're already overloading it as is.

> 
> - TDX and struct kvm_memory_mapping:source
>   While the current patch series doesn't utilize
>   kvm_memory_mapping::source member.  TDX needs it to specify the source of
>   memory contents.

SNP will need it too FWIW.

-Mike

> 
> Implementation:
> - x86 KVM MMU
>   In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault().  It's not confined to
>   KVM TDP MMU.  We can restrict it to KVM TDP MMU and introduce an optimized
>   version.
> 
> [1] https://lore.kernel.org/all/65262e67-7885-971a-896d-ad9c0a760907@polito.it/
> [2] https://lore.kernel.org/all/6a4c029af70d41b63bcee3d6a1f0c2377f6eb4bd.1690322424.git.isaku.yamahata@intel.com
> 
> Thanks,
> 
> Isaku Yamahata (8):
>   KVM: Document KVM_MAP_MEMORY ioctl
>   KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory
>   KVM: x86/mmu: Introduce initialier macro for struct kvm_page_fault
>   KVM: x86/mmu: Factor out kvm_mmu_do_page_fault()
>   KVM: x86/mmu: Introduce kvm_mmu_map_page() for prepopulating guest
>     memory
>   KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()
>   KVM: x86: Add hooks in kvm_arch_vcpu_map_memory()
>   KVM: selftests: x86: Add test for KVM_MAP_MEMORY
> 
>  Documentation/virt/kvm/api.rst                |  36 +++++
>  arch/x86/include/asm/kvm-x86-ops.h            |   2 +
>  arch/x86/include/asm/kvm_host.h               |   6 +
>  arch/x86/kvm/mmu.h                            |   3 +
>  arch/x86/kvm/mmu/mmu.c                        |  30 ++++
>  arch/x86/kvm/mmu/mmu_internal.h               |  70 +++++----
>  arch/x86/kvm/x86.c                            |  83 +++++++++++
>  include/linux/kvm_host.h                      |   4 +
>  include/uapi/linux/kvm.h                      |  15 ++
>  tools/include/uapi/linux/kvm.h                |  14 ++
>  tools/testing/selftests/kvm/Makefile          |   1 +
>  .../selftests/kvm/x86_64/map_memory_test.c    | 136 ++++++++++++++++++
>  virt/kvm/kvm_main.c                           |  74 ++++++++++
>  13 files changed, 448 insertions(+), 26 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/x86_64/map_memory_test.c
> 
> 
> base-commit: 6a108bdc49138bcaa4f995ed87681ab9c65122ad
> -- 
> 2.25.1
> 
>
Sean Christopherson March 11, 2024, 11:44 p.m. UTC | #4
On Sun, Mar 10, 2024, Michael Roth wrote:
> On Fri, Mar 01, 2024 at 09:28:42AM -0800, isaku.yamahata@intel.com wrote:
> >   struct kvm_sev_launch_update_data {
> >         __u64 uaddr;
> >         __u32 len;
> >   };
> > 
> > - TDX and measurement
> >   The TDX correspondence is TDH.MEM.PAGE.ADD and TDH.MR.EXTEND.  TDH.MEM.EXTEND
> >   extends its measurement by the page contents.
> >   Option 1. Add an additional flag like KVM_MEMORY_MAPPING_FLAG_EXTEND to issue
> >             TDH.MEM.EXTEND
> >   Option 2. Don't handle extend. Let TDX vendor specific API
> >             KVM_EMMORY_ENCRYPT_OP to handle it with the subcommand like
> >             KVM_TDX_EXTEND_MEMORY.
> 
> For SNP this happens unconditionally via SNP_LAUNCH_UPDATE, and with some
> additional measurements via SNP_LAUNCH_FINISH, and down the road when live
> migration support is added that flow will be a bit different. So
> personally I think it's better to leave separate for now.

+1.  The only reason to do EXTEND at the same time as PAGE.ADD would be to
optimize setups that want the measurement to be extended with the contents of a
page immediately after the measurement is extended with the mapping metadata for
said page.  And AFAIK, the only reason to prefer that approach is for backwards
compatibility, which is not a concern for KVM.  I suppose maaaybe some memory
locality performance benefits, but that seems like a stretch.

<time passes>

And I think this whole conversation is moot, because I don't think there's a need
to do PAGE.ADD during KVM_MAP_MEMORY[*].  If KVM_MAP_MEMORY does only the SEPT.ADD
side of things, then both @source (PAGE.ADD) and the EXTEND flag go away.

> But I'd be hesitant to bake more requirements into this pre-mapping
> interface, it feels like we're already overloading it as is.

Agreed.  After being able to think more about this ioctl(), I think KVM_MAP_MEMORY
should be as "pure" of a mapping operation as we can make it.  It'd be a little
weird that using KVM_MAP_MEMORY is required for TDX VMs, but not other VMs.  But
that's really just a reflection of S-EPT, so it's arguably not even a bad thing.

[*] https://lore.kernel.org/all/Ze-TJh0BBOWm9spT@google.com
Isaku Yamahata March 12, 2024, 1:32 a.m. UTC | #5
On Mon, Mar 11, 2024 at 04:44:27PM -0700,
Sean Christopherson <seanjc@google.com> wrote:

> On Sun, Mar 10, 2024, Michael Roth wrote:
> > On Fri, Mar 01, 2024 at 09:28:42AM -0800, isaku.yamahata@intel.com wrote:
> > >   struct kvm_sev_launch_update_data {
> > >         __u64 uaddr;
> > >         __u32 len;
> > >   };
> > > 
> > > - TDX and measurement
> > >   The TDX correspondence is TDH.MEM.PAGE.ADD and TDH.MR.EXTEND.  TDH.MEM.EXTEND
> > >   extends its measurement by the page contents.
> > >   Option 1. Add an additional flag like KVM_MEMORY_MAPPING_FLAG_EXTEND to issue
> > >             TDH.MEM.EXTEND
> > >   Option 2. Don't handle extend. Let TDX vendor specific API
> > >             KVM_EMMORY_ENCRYPT_OP to handle it with the subcommand like
> > >             KVM_TDX_EXTEND_MEMORY.
> > 
> > For SNP this happens unconditionally via SNP_LAUNCH_UPDATE, and with some
> > additional measurements via SNP_LAUNCH_FINISH, and down the road when live
> > migration support is added that flow will be a bit different. So
> > personally I think it's better to leave separate for now.
> 
> +1.  The only reason to do EXTEND at the same time as PAGE.ADD would be to
> optimize setups that want the measurement to be extended with the contents of a
> page immediately after the measurement is extended with the mapping metadata for
> said page.  And AFAIK, the only reason to prefer that approach is for backwards
> compatibility, which is not a concern for KVM.  I suppose maaaybe some memory
> locality performance benefits, but that seems like a stretch.
> 
> <time passes>
> 
> And I think this whole conversation is moot, because I don't think there's a need
> to do PAGE.ADD during KVM_MAP_MEMORY[*].  If KVM_MAP_MEMORY does only the SEPT.ADD
> side of things, then both @source (PAGE.ADD) and the EXTEND flag go away.
> 
> > But I'd be hesitant to bake more requirements into this pre-mapping
> > interface, it feels like we're already overloading it as is.
> 
> Agreed.  After being able to think more about this ioctl(), I think KVM_MAP_MEMORY
> should be as "pure" of a mapping operation as we can make it.  It'd be a little
> weird that using KVM_MAP_MEMORY is required for TDX VMs, but not other VMs.  But
> that's really just a reflection of S-EPT, so it's arguably not even a bad thing.
> 
> [*] https://lore.kernel.org/all/Ze-TJh0BBOWm9spT@google.com

Let me give it a try to remove source from struct kvm_memory_mapping. With the
unit in byte instead of page, it will be
  struct kvm_memory_mapping {
        __u64 base_address;
	__u64 size;
	__u64 flags;
  };

SNP won't have any changes.  Always error for KVM_MAP_MEMORY for SNP?
(I'll leave it to Roth.)
TDX will have TDX_INIT_MEM_REGION with new implementation.
Isaku Yamahata March 19, 2024, 4:33 p.m. UTC | #6
On Wed, Mar 06, 2024 at 06:09:54PM -0800,
Isaku Yamahata <isaku.yamahata@linux.intel.com> wrote:

> On Wed, Mar 06, 2024 at 04:53:41PM -0800,
> David Matlack <dmatlack@google.com> wrote:
> 
> > On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote:
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > 
> > > Implementation:
> > > - x86 KVM MMU
> > >   In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault().  It's not confined to
> > >   KVM TDP MMU.  We can restrict it to KVM TDP MMU and introduce an optimized
> > >   version.
> > 
> > Restricting to TDP MMU seems like a good idea. But I'm not quite sure
> > how to reliably do that from a vCPU context. Checking for TDP being
> > enabled is easy, but what if the vCPU is in guest-mode?
> 
> As you pointed out in other mail, legacy KVM MMU support or guest-mode will be
> troublesome.  The use case I supposed is pre-population before guest runs, the
> guest-mode wouldn't matter. I didn't add explicit check for it, though.
> 
> Any use case while vcpus running?
> 
> 
> > Perhaps we can just return an error out to userspace if the vCPU is in
> > guest-mode or TDP is disabled, and make it userspace's problem to do
> > memory mapping before loading any vCPU state.
> 
> If the use case for default VM or sw-proteced VM is to avoid excessive kvm page
> fault at guest boot, error on guest-mode or disabled TDP wouldn't matter.

Any input?  If no further input, I assume the primary use case is pre-population
before guest running.
Sean Christopherson April 3, 2024, 6:30 p.m. UTC | #7
On Tue, Mar 19, 2024, Isaku Yamahata wrote:
> On Wed, Mar 06, 2024 at 06:09:54PM -0800,
> Isaku Yamahata <isaku.yamahata@linux.intel.com> wrote:
> 
> > On Wed, Mar 06, 2024 at 04:53:41PM -0800,
> > David Matlack <dmatlack@google.com> wrote:
> > 
> > > On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote:
> > > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > > 
> > > > Implementation:
> > > > - x86 KVM MMU
> > > >   In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault().  It's not confined to
> > > >   KVM TDP MMU.  We can restrict it to KVM TDP MMU and introduce an optimized
> > > >   version.
> > > 
> > > Restricting to TDP MMU seems like a good idea. But I'm not quite sure
> > > how to reliably do that from a vCPU context. Checking for TDP being
> > > enabled is easy, but what if the vCPU is in guest-mode?
> > 
> > As you pointed out in other mail, legacy KVM MMU support or guest-mode will be
> > troublesome.

Why is shadow paging troublesome?  I don't see any obvious issues with effectively
prefetching into a shadow MMU with read fault semantics.  It might be pointless
and wasteful, as the guest PTEs need to be in place, but that's userspace's problem.

Testing is the biggest gap I see, as using the ioctl() for shadow paging will
essentially require a live guest, but that doesn't seem like it'd be too hard to
validate.  And unless we lock down the ioctl() to only be allowed on vCPUs that
have never done KVM_RUN, we need that test coverage anyways.

And I don't think it makes sense to try and lock down the ioctl(), because for
the enforcement to have any meaning, KVM would need to reject the ioctl() if *any*
vCPU has run, and adding that code would likely add more complexity than it solves.

> > The use case I supposed is pre-population before guest runs, the guest-mode
> > wouldn't matter. I didn't add explicit check for it, though.

KVM shouldn't have an explicit is_guest_mode() check, the support should be a
property of the underlying MMU, and KVM can use the TDP MMU for L2 (if L1 is
using legacy shadow paging, not TDP).

> > Any use case while vcpus running?
> > 
> > 
> > > Perhaps we can just return an error out to userspace if the vCPU is in
> > > guest-mode or TDP is disabled, and make it userspace's problem to do
> > > memory mapping before loading any vCPU state.
> > 
> > If the use case for default VM or sw-proteced VM is to avoid excessive kvm page
> > fault at guest boot, error on guest-mode or disabled TDP wouldn't matter.
> 
> Any input?  If no further input, I assume the primary use case is pre-population
> before guest running.

Pre-populating is the primary use case, but that could happen if L2 is active,
e.g. after live migration.

I'm not necessarily opposed to initially adding support only for the TDP MMU, but
if the delta to also support the shadow MMU is relatively small, my preference
would be to add the support right away.  E.g. to give us confidence that the uAPI
can work for multiple MMUs, and so that we don't have to write documentation for
x86 to explain exactly when it's legal to use the ioctl().
Isaku Yamahata April 3, 2024, 10 p.m. UTC | #8
On Wed, Apr 03, 2024 at 11:30:21AM -0700,
Sean Christopherson <seanjc@google.com> wrote:

> On Tue, Mar 19, 2024, Isaku Yamahata wrote:
> > On Wed, Mar 06, 2024 at 06:09:54PM -0800,
> > Isaku Yamahata <isaku.yamahata@linux.intel.com> wrote:
> > 
> > > On Wed, Mar 06, 2024 at 04:53:41PM -0800,
> > > David Matlack <dmatlack@google.com> wrote:
> > > 
> > > > On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote:
> > > > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > > > 
> > > > > Implementation:
> > > > > - x86 KVM MMU
> > > > >   In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault().  It's not confined to
> > > > >   KVM TDP MMU.  We can restrict it to KVM TDP MMU and introduce an optimized
> > > > >   version.
> > > > 
> > > > Restricting to TDP MMU seems like a good idea. But I'm not quite sure
> > > > how to reliably do that from a vCPU context. Checking for TDP being
> > > > enabled is easy, but what if the vCPU is in guest-mode?
> > > 
> > > As you pointed out in other mail, legacy KVM MMU support or guest-mode will be
> > > troublesome.
> 
> Why is shadow paging troublesome?  I don't see any obvious issues with effectively
> prefetching into a shadow MMU with read fault semantics.  It might be pointless
> and wasteful, as the guest PTEs need to be in place, but that's userspace's problem.

The populating address for shadow paging is GVA, not GPA.  I'm not sure if
that's what the user space wants.  If it's user-space problem, I'm fine.


> Testing is the biggest gap I see, as using the ioctl() for shadow paging will
> essentially require a live guest, but that doesn't seem like it'd be too hard to
> validate.  And unless we lock down the ioctl() to only be allowed on vCPUs that
> have never done KVM_RUN, we need that test coverage anyways.

So far I tried only TDP MMU case.  I can try other MMU type.


> And I don't think it makes sense to try and lock down the ioctl(), because for
> the enforcement to have any meaning, KVM would need to reject the ioctl() if *any*
> vCPU has run, and adding that code would likely add more complexity than it solves.
> 
> > > The use case I supposed is pre-population before guest runs, the guest-mode
> > > wouldn't matter. I didn't add explicit check for it, though.
> 
> KVM shouldn't have an explicit is_guest_mode() check, the support should be a
> property of the underlying MMU, and KVM can use the TDP MMU for L2 (if L1 is
> using legacy shadow paging, not TDP).

I see.  So the type of the populating address can vary depending on vcpu mode.
It's user-space problem which address (GVA, L1 GPA, L2 GPA) is used.


> > > Any use case while vcpus running?
> > > 
> > > 
> > > > Perhaps we can just return an error out to userspace if the vCPU is in
> > > > guest-mode or TDP is disabled, and make it userspace's problem to do
> > > > memory mapping before loading any vCPU state.
> > > 
> > > If the use case for default VM or sw-proteced VM is to avoid excessive kvm page
> > > fault at guest boot, error on guest-mode or disabled TDP wouldn't matter.
> > 
> > Any input?  If no further input, I assume the primary use case is pre-population
> > before guest running.
> 
> Pre-populating is the primary use case, but that could happen if L2 is active,
> e.g. after live migration.
> 
> I'm not necessarily opposed to initially adding support only for the TDP MMU, but
> if the delta to also support the shadow MMU is relatively small, my preference
> would be to add the support right away.  E.g. to give us confidence that the uAPI
> can work for multiple MMUs, and so that we don't have to write documentation for
> x86 to explain exactly when it's legal to use the ioctl().

If we call kvm_mmu.page_fault() without caring of what address will be
populated, I don't see the big difference.
Sean Christopherson April 3, 2024, 10:42 p.m. UTC | #9
On Wed, Apr 03, 2024, Isaku Yamahata wrote:
> On Wed, Apr 03, 2024 at 11:30:21AM -0700,
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > On Tue, Mar 19, 2024, Isaku Yamahata wrote:
> > > On Wed, Mar 06, 2024 at 06:09:54PM -0800,
> > > Isaku Yamahata <isaku.yamahata@linux.intel.com> wrote:
> > > 
> > > > On Wed, Mar 06, 2024 at 04:53:41PM -0800,
> > > > David Matlack <dmatlack@google.com> wrote:
> > > > 
> > > > > On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote:
> > > > > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > > > > 
> > > > > > Implementation:
> > > > > > - x86 KVM MMU
> > > > > >   In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault().  It's not confined to
> > > > > >   KVM TDP MMU.  We can restrict it to KVM TDP MMU and introduce an optimized
> > > > > >   version.
> > > > > 
> > > > > Restricting to TDP MMU seems like a good idea. But I'm not quite sure
> > > > > how to reliably do that from a vCPU context. Checking for TDP being
> > > > > enabled is easy, but what if the vCPU is in guest-mode?
> > > > 
> > > > As you pointed out in other mail, legacy KVM MMU support or guest-mode will be
> > > > troublesome.
> > 
> > Why is shadow paging troublesome?  I don't see any obvious issues with effectively
> > prefetching into a shadow MMU with read fault semantics.  It might be pointless
> > and wasteful, as the guest PTEs need to be in place, but that's userspace's problem.
> 
> The populating address for shadow paging is GVA, not GPA.  I'm not sure if
> that's what the user space wants.  If it's user-space problem, I'm fine.

/facepalm

> > Pre-populating is the primary use case, but that could happen if L2 is active,
> > e.g. after live migration.
> > 
> > I'm not necessarily opposed to initially adding support only for the TDP MMU, but
> > if the delta to also support the shadow MMU is relatively small, my preference
> > would be to add the support right away.  E.g. to give us confidence that the uAPI
> > can work for multiple MMUs, and so that we don't have to write documentation for
> > x86 to explain exactly when it's legal to use the ioctl().
> 
> If we call kvm_mmu.page_fault() without caring of what address will be
> populated, I don't see the big difference.  

Ignore me, I completely spaced that shadow MMUs don't operate on an L1 GPA.  I
100% agree that restricting this to TDP, at least for the initial merge, is the
way to go.  A uAPI where the type of address varies based on the vCPU mode and
MMU type would be super ugly, and probably hard to use.

At that point, I don't have a strong preference as to whether or not direct
legacy/shadow MMUs are supported.  That said, I think it can (probably should?)
be done in a way where it more or less Just Works, e.g. by having a function hook
in "struct kvm_mmu".