diff mbox series

[RFC,1/8] KVM: Document KVM_MAP_MEMORY ioctl

Message ID c50dc98effcba3ff68a033661b2941b777c4fb5c.1709288671.git.isaku.yamahata@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: Prepopulate guest memory API | expand

Commit Message

Isaku Yamahata March 1, 2024, 5:28 p.m. UTC
From: Isaku Yamahata <isaku.yamahata@intel.com>

Adds documentation of KVM_MAP_MEMORY ioctl.

It pre-populates guest memory. And potentially do initialized memory
contents with encryption and measurement depending on underlying
technology.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/api.rst | 36 ++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

Comments

David Matlack March 7, 2024, 12:43 a.m. UTC | #1
On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> +
> +  struct kvm_memory_mapping {
> +	__u64 base_gfn;
> +	__u64 nr_pages;
> +	__u64 flags;
> +	__u64 source;
> +  };
> +
> +  /* For kvm_memory_mapping:: flags */
> +  #define KVM_MEMORY_MAPPING_FLAG_WRITE         _BITULL(0)
> +  #define KVM_MEMORY_MAPPING_FLAG_EXEC          _BITULL(1)
> +  #define KVM_MEMORY_MAPPING_FLAG_USER          _BITULL(2)
> +  #define KVM_MEMORY_MAPPING_FLAG_PRIVATE       _BITULL(3)
> +
> +KVM_MAP_MEMORY populates guest memory in the underlying mapping. If source is
> +not zero and it's supported (depending on underlying technology), the guest
> +memory content is populated with the source.

What does "populated with the source" mean?

> The flags field supports three
> +flags: KVM_MEMORY_MAPPING_FLAG_WRITE, KVM_MEMORY_MAPPING_FLAG_EXEC, and
> +KVM_MEMORY_MAPPING_FLAG_USER.

There are 4 flags.
Isaku Yamahata March 7, 2024, 1:29 a.m. UTC | #2
On Wed, Mar 06, 2024 at 04:43:51PM -0800,
David Matlack <dmatlack@google.com> wrote:

> On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > +
> > +  struct kvm_memory_mapping {
> > +	__u64 base_gfn;
> > +	__u64 nr_pages;
> > +	__u64 flags;
> > +	__u64 source;
> > +  };
> > +
> > +  /* For kvm_memory_mapping:: flags */
> > +  #define KVM_MEMORY_MAPPING_FLAG_WRITE         _BITULL(0)
> > +  #define KVM_MEMORY_MAPPING_FLAG_EXEC          _BITULL(1)
> > +  #define KVM_MEMORY_MAPPING_FLAG_USER          _BITULL(2)
> > +  #define KVM_MEMORY_MAPPING_FLAG_PRIVATE       _BITULL(3)
> > +
> > +KVM_MAP_MEMORY populates guest memory in the underlying mapping. If source is
> > +not zero and it's supported (depending on underlying technology), the guest
> > +memory content is populated with the source.
> 
> What does "populated with the source" mean?

source is user pointer and the memory contents of source is copied into
base_gfn. (and it will encrypted.)


> > The flags field supports three
> > +flags: KVM_MEMORY_MAPPING_FLAG_WRITE, KVM_MEMORY_MAPPING_FLAG_EXEC, and
> > +KVM_MEMORY_MAPPING_FLAG_USER.
> 
> There are 4 flags.

Oops. Let me update it.


KVM_MAP_MEMORY populates guest memory at the specified range (`base_gfn`,
`nr_pages`) in the underlying mapping. `source` is an optional user pointer. If
`source` is not NULL and the underlying technology supports it, the memory
contents of `source` are copied into the guest memory. The backend may encrypt
it.

The `flags` field supports four flags: KVM_MEMORY_MAPPING_FLAG_WRITE,
KVM_MEMORY_MAPPING_FLAG_EXEC, KVM_MEMORY_MAPPING_FLAG_USER, and
KVM_MEMORY_MAPPING_FLAGS_PRIVATE. The first three correspond to the fault code
for the KVM page fault to populate guest memory.  write fault, fetch fault, and
user fault.  KVM_MEMORY_MAPPING_FLAGS_PRIVATE is applicable only for guest
memory with guest_memfd.  It is to populate guest memory with the memory
attribute of KVM_MEMORY_ATTRIBUTE_PRIVATE set.

When the ioctl returns, the input values are updated.  If `nr_pages` is large,
it may return -EAGAIN and update the values (`base_gfn` and `nr_pages`. `source`
if not zero) to point to the remaining range.
Huang, Kai March 7, 2024, 12:30 p.m. UTC | #3
On Fri, 2024-03-01 at 09:28 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Adds documentation of KVM_MAP_MEMORY ioctl.
> 
> It pre-populates guest memory. And potentially do initialized memory
> contents with encryption and measurement depending on underlying
> technology.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  Documentation/virt/kvm/api.rst | 36 ++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 0b5a33ee71ee..33d2b63f7dbf 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6352,6 +6352,42 @@ a single guest_memfd file, but the bound ranges must not overlap).
>  
>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
>  
> +4.143 KVM_MAP_MEMORY
> +------------------------
> +
> +:Capability: KVM_CAP_MAP_MEMORY
> +:Architectures: none
> +:Type: vcpu ioctl

I think "vcpu ioctl" means theoretically it can be called on multiple vcpus.

What happens in that case?

> +:Parameters: struct kvm_memory_mapping(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +KVM_MAP_MEMORY populates guest memory without running vcpu.
> +
> +::
> +
> +  struct kvm_memory_mapping {
> +	__u64 base_gfn;
> +	__u64 nr_pages;
> +	__u64 flags;
> +	__u64 source;
> +  };
> +
> +  /* For kvm_memory_mapping:: flags */
> +  #define KVM_MEMORY_MAPPING_FLAG_WRITE         _BITULL(0)
> +  #define KVM_MEMORY_MAPPING_FLAG_EXEC          _BITULL(1)
> +  #define KVM_MEMORY_MAPPING_FLAG_USER          _BITULL(2)

I am not sure what's the good of having "FLAG_USER"?

This ioctl is called from userspace, thus I think we can just treat this always
as user-fault?

> +  #define KVM_MEMORY_MAPPING_FLAG_PRIVATE       _BITULL(3)
> +
> +KVM_MAP_MEMORY populates guest memory in the underlying mapping. If source is
> +not zero and it's supported (depending on underlying technology), the guest
> +memory content is populated with the source.  The flags field supports three
> +flags: KVM_MEMORY_MAPPING_FLAG_WRITE, KVM_MEMORY_MAPPING_FLAG_EXEC, and
> +KVM_MEMORY_MAPPING_FLAG_USER.  Which corresponds to fault code for kvm page
> +fault to populate guest memory. write fault, fetch fault and user fault.
> +When it returned, the input is updated.  If nr_pages is large, it may
> +return -EAGAIN and update the values (base_gfn and nr_pages. source if not zero)
> +to point the remaining range.
> +
>  5. The kvm_run structure
>  ========================
>
Isaku Yamahata March 7, 2024, 8:33 p.m. UTC | #4
On Thu, Mar 07, 2024 at 12:30:04PM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Fri, 2024-03-01 at 09:28 -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > Adds documentation of KVM_MAP_MEMORY ioctl.
> > 
> > It pre-populates guest memory. And potentially do initialized memory
> > contents with encryption and measurement depending on underlying
> > technology.
> > 
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  Documentation/virt/kvm/api.rst | 36 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 36 insertions(+)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 0b5a33ee71ee..33d2b63f7dbf 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6352,6 +6352,42 @@ a single guest_memfd file, but the bound ranges must not overlap).
> >  
> >  See KVM_SET_USER_MEMORY_REGION2 for additional details.
> >  
> > +4.143 KVM_MAP_MEMORY
> > +------------------------
> > +
> > +:Capability: KVM_CAP_MAP_MEMORY
> > +:Architectures: none
> > +:Type: vcpu ioctl
> 
> I think "vcpu ioctl" means theoretically it can be called on multiple vcpus.
> 
> What happens in that case?

Each vcpu can handle the ioctl simaltaneously.  If we assume tdp_mmu, each vcpu
calls the kvm fault handler simultaneously with read spinlock.
If gfn ranges overlap, vcpu will get 0 (success) or EAGAIN.


> > +:Parameters: struct kvm_memory_mapping(in/out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +KVM_MAP_MEMORY populates guest memory without running vcpu.
> > +
> > +::
> > +
> > +  struct kvm_memory_mapping {
> > +	__u64 base_gfn;
> > +	__u64 nr_pages;
> > +	__u64 flags;
> > +	__u64 source;
> > +  };
> > +
> > +  /* For kvm_memory_mapping:: flags */
> > +  #define KVM_MEMORY_MAPPING_FLAG_WRITE         _BITULL(0)
> > +  #define KVM_MEMORY_MAPPING_FLAG_EXEC          _BITULL(1)
> > +  #define KVM_MEMORY_MAPPING_FLAG_USER          _BITULL(2)
> 
> I am not sure what's the good of having "FLAG_USER"?
> 
> This ioctl is called from userspace, thus I think we can just treat this always
> as user-fault?

The point is how to emulate kvm page fault as if vcpu caused the kvm page
fault.  Not we call the ioctl as user context.
Huang, Kai March 8, 2024, 12:20 a.m. UTC | #5
>>>   
>>> +4.143 KVM_MAP_MEMORY
>>> +------------------------
>>> +
>>> +:Capability: KVM_CAP_MAP_MEMORY
>>> +:Architectures: none
>>> +:Type: vcpu ioctl
>>
>> I think "vcpu ioctl" means theoretically it can be called on multiple vcpus.
>>
>> What happens in that case?
> 
> Each vcpu can handle the ioctl simaltaneously.  

Not sure whether it is implied, but should we document it can be called 
simultaneously?

Also, I believe this is only supposed to be called before VM starts to 
run?  I think we should document that too.

This is userspace ABI, we need to be explicit on how it is supposed to 
be called from userspace.

Btw, I believe there should be some justification in the changelog why 
this should be a vcpu ioctl().

[...]

>>> +:Parameters: struct kvm_memory_mapping(in/out)
>>> +:Returns: 0 on success, <0 on error
>>> +
>>> +KVM_MAP_MEMORY populates guest memory without running vcpu.
>>> +
>>> +::
>>> +
>>> +  struct kvm_memory_mapping {
>>> +	__u64 base_gfn;
>>> +	__u64 nr_pages;
>>> +	__u64 flags;
>>> +	__u64 source;
>>> +  };
>>> +
>>> +  /* For kvm_memory_mapping:: flags */
>>> +  #define KVM_MEMORY_MAPPING_FLAG_WRITE         _BITULL(0)
>>> +  #define KVM_MEMORY_MAPPING_FLAG_EXEC          _BITULL(1)
>>> +  #define KVM_MEMORY_MAPPING_FLAG_USER          _BITULL(2)
>>
>> I am not sure what's the good of having "FLAG_USER"?
>>
>> This ioctl is called from userspace, thus I think we can just treat this always
>> as user-fault?
> 
> The point is how to emulate kvm page fault as if vcpu caused the kvm page
> fault.  Not we call the ioctl as user context.

Sorry I don't quite follow.  What's wrong if KVM just append the #PF 
USER error bit before it calls into the fault handler?

My question is, since this is ABI, you have to tell how userspace is 
supposed to use this.  Maybe I am missing something, but I don't see how 
USER should be used here.
David Matlack March 8, 2024, 12:56 a.m. UTC | #6
On 2024-03-08 01:20 PM, Huang, Kai wrote:
> > > > +:Parameters: struct kvm_memory_mapping(in/out)
> > > > +:Returns: 0 on success, <0 on error
> > > > +
> > > > +KVM_MAP_MEMORY populates guest memory without running vcpu.
> > > > +
> > > > +::
> > > > +
> > > > +  struct kvm_memory_mapping {
> > > > +	__u64 base_gfn;
> > > > +	__u64 nr_pages;
> > > > +	__u64 flags;
> > > > +	__u64 source;
> > > > +  };
> > > > +
> > > > +  /* For kvm_memory_mapping:: flags */
> > > > +  #define KVM_MEMORY_MAPPING_FLAG_WRITE         _BITULL(0)
> > > > +  #define KVM_MEMORY_MAPPING_FLAG_EXEC          _BITULL(1)
> > > > +  #define KVM_MEMORY_MAPPING_FLAG_USER          _BITULL(2)
> > > 
> > > I am not sure what's the good of having "FLAG_USER"?
> > > 
> > > This ioctl is called from userspace, thus I think we can just treat this always
> > > as user-fault?
> > 
> > The point is how to emulate kvm page fault as if vcpu caused the kvm page
> > fault.  Not we call the ioctl as user context.
> 
> Sorry I don't quite follow.  What's wrong if KVM just append the #PF USER
> error bit before it calls into the fault handler?
> 
> My question is, since this is ABI, you have to tell how userspace is
> supposed to use this.  Maybe I am missing something, but I don't see how
> USER should be used here.

If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER
is meaningless, PFERR_USER_MASK is only relevant for shadow paging.

KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be
populated with writes (which avoids just faulting in the zero-page for
anon or tmpfs backed memslots), while also allowing populating read-only
memslots.

I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC.
Sean Christopherson March 8, 2024, 1:28 a.m. UTC | #7
On Thu, Mar 07, 2024, David Matlack wrote:
> On 2024-03-08 01:20 PM, Huang, Kai wrote:
> > > > > +:Parameters: struct kvm_memory_mapping(in/out)
> > > > > +:Returns: 0 on success, <0 on error
> > > > > +
> > > > > +KVM_MAP_MEMORY populates guest memory without running vcpu.
> > > > > +
> > > > > +::
> > > > > +
> > > > > +  struct kvm_memory_mapping {
> > > > > +	__u64 base_gfn;
> > > > > +	__u64 nr_pages;
> > > > > +	__u64 flags;
> > > > > +	__u64 source;
> > > > > +  };
> > > > > +
> > > > > +  /* For kvm_memory_mapping:: flags */
> > > > > +  #define KVM_MEMORY_MAPPING_FLAG_WRITE         _BITULL(0)
> > > > > +  #define KVM_MEMORY_MAPPING_FLAG_EXEC          _BITULL(1)
> > > > > +  #define KVM_MEMORY_MAPPING_FLAG_USER          _BITULL(2)
> > > > 
> > > > I am not sure what's the good of having "FLAG_USER"?
> > > > 
> > > > This ioctl is called from userspace, thus I think we can just treat this always
> > > > as user-fault?
> > > 
> > > The point is how to emulate kvm page fault as if vcpu caused the kvm page
> > > fault.  Not we call the ioctl as user context.
> > 
> > Sorry I don't quite follow.  What's wrong if KVM just append the #PF USER
> > error bit before it calls into the fault handler?
> > 
> > My question is, since this is ABI, you have to tell how userspace is
> > supposed to use this.  Maybe I am missing something, but I don't see how
> > USER should be used here.
> 
> If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER
> is meaningless, PFERR_USER_MASK is only relevant for shadow paging.

+1

> KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be
> populated with writes (which avoids just faulting in the zero-page for
> anon or tmpfs backed memslots), while also allowing populating read-only
> memslots.
> 
> I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC.

It would midly be interesting for something like the NX hugepage mitigation.

For the initial implementation, I don't think the ioctl() should specify
protections, period.

VMA-based mappings, i.e. !guest_memfd, already have a way to specify protections.
And for guest_memfd, finer grained control in general, and long term compatibility
with other features that are in-flight or proposed, I would rather userspace specify
RWX protections via KVM_SET_MEMORY_ATTRIBUTES.  Oh, and dirty logging would be a
pain too.

KVM doesn't currently support execute-only (XO) or !executable (RW), so I think
we can simply define KVM_MAP_MEMORY to behave like a read fault.  E.g. map RX,
and add W if all underlying protections allow it.

That way we can defer dealing with things like XO and RW *if* KVM ever does gain
support for specifying those combinations via KVM_SET_MEMORY_ATTRIBUTES, which
will likely be per-arch/vendor and non-trivial, e.g. AMD's NPT doesn't even allow
for XO memory.

And we shouldn't need to do anything for KVM_MAP_MEMORY in particular if
KVM_SET_MEMORY_ATTRIBUTES gains support for RWX protections the existing RWX and
RX combinations, e.g. if there's a use-case for write-protecting guest_memfd
regions.

We can always expand the uAPI, but taking away functionality is much harder, if
not impossible.
Isaku Yamahata March 8, 2024, 2:19 a.m. UTC | #8
On Thu, Mar 07, 2024 at 05:28:20PM -0800,
Sean Christopherson <seanjc@google.com> wrote:

> On Thu, Mar 07, 2024, David Matlack wrote:
> > On 2024-03-08 01:20 PM, Huang, Kai wrote:
> > > > > > +:Parameters: struct kvm_memory_mapping(in/out)
> > > > > > +:Returns: 0 on success, <0 on error
> > > > > > +
> > > > > > +KVM_MAP_MEMORY populates guest memory without running vcpu.
> > > > > > +
> > > > > > +::
> > > > > > +
> > > > > > +  struct kvm_memory_mapping {
> > > > > > +	__u64 base_gfn;
> > > > > > +	__u64 nr_pages;
> > > > > > +	__u64 flags;
> > > > > > +	__u64 source;
> > > > > > +  };
> > > > > > +
> > > > > > +  /* For kvm_memory_mapping:: flags */
> > > > > > +  #define KVM_MEMORY_MAPPING_FLAG_WRITE         _BITULL(0)
> > > > > > +  #define KVM_MEMORY_MAPPING_FLAG_EXEC          _BITULL(1)
> > > > > > +  #define KVM_MEMORY_MAPPING_FLAG_USER          _BITULL(2)
> > > > > 
> > > > > I am not sure what's the good of having "FLAG_USER"?
> > > > > 
> > > > > This ioctl is called from userspace, thus I think we can just treat this always
> > > > > as user-fault?
> > > > 
> > > > The point is how to emulate kvm page fault as if vcpu caused the kvm page
> > > > fault.  Not we call the ioctl as user context.
> > > 
> > > Sorry I don't quite follow.  What's wrong if KVM just append the #PF USER
> > > error bit before it calls into the fault handler?
> > > 
> > > My question is, since this is ABI, you have to tell how userspace is
> > > supposed to use this.  Maybe I am missing something, but I don't see how
> > > USER should be used here.
> > 
> > If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER
> > is meaningless, PFERR_USER_MASK is only relevant for shadow paging.
> 
> +1
> 
> > KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be
> > populated with writes (which avoids just faulting in the zero-page for
> > anon or tmpfs backed memslots), while also allowing populating read-only
> > memslots.
> > 
> > I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC.
> 
> It would midly be interesting for something like the NX hugepage mitigation.
> 
> For the initial implementation, I don't think the ioctl() should specify
> protections, period.
> 
> VMA-based mappings, i.e. !guest_memfd, already have a way to specify protections.
> And for guest_memfd, finer grained control in general, and long term compatibility
> with other features that are in-flight or proposed, I would rather userspace specify
> RWX protections via KVM_SET_MEMORY_ATTRIBUTES.  Oh, and dirty logging would be a
> pain too.
> 
> KVM doesn't currently support execute-only (XO) or !executable (RW), so I think
> we can simply define KVM_MAP_MEMORY to behave like a read fault.  E.g. map RX,
> and add W if all underlying protections allow it.
> 
> That way we can defer dealing with things like XO and RW *if* KVM ever does gain
> support for specifying those combinations via KVM_SET_MEMORY_ATTRIBUTES, which
> will likely be per-arch/vendor and non-trivial, e.g. AMD's NPT doesn't even allow
> for XO memory.
> 
> And we shouldn't need to do anything for KVM_MAP_MEMORY in particular if
> KVM_SET_MEMORY_ATTRIBUTES gains support for RWX protections the existing RWX and
> RX combinations, e.g. if there's a use-case for write-protecting guest_memfd
> regions.
> 
> We can always expand the uAPI, but taking away functionality is much harder, if
> not impossible.

Ok, let me drop all the flags.  Here is the updated one.

4.143 KVM_MAP_MEMORY
------------------------

:Capability: KVM_CAP_MAP_MEMORY
:Architectures: none
:Type: vcpu ioctl
:Parameters: struct kvm_memory_mapping(in/out)
:Returns: 0 on success, < 0 on error

Errors:

  ======   =============================================================
  EINVAL   vcpu state is not in TDP MMU mode or is in guest mode.
           Currently, this ioctl is restricted to TDP MMU.
  EAGAIN   The region is only processed partially.  The caller should
           issue the ioctl with the updated parameters.
  EINTR    An unmasked signal is pending.  The region may be processed
           partially.  If `nr_pages` > 0, the caller should issue the
           ioctl with the updated parameters.
  ======   =============================================================

KVM_MAP_MEMORY populates guest memory before the VM starts to run.  Multiple
vcpus can call this ioctl simultaneously.  It may result in the error of EAGAIN
due to race conditions.

::

  struct kvm_memory_mapping {
	__u64 base_gfn;
	__u64 nr_pages;
	__u64 flags;
	__u64 source;
  };

KVM_MAP_MEMORY populates guest memory at the specified range (`base_gfn`,
`nr_pages`) in the underlying mapping. `source` is an optional user pointer.  If
`source` is not NULL and the underlying technology supports it, the memory
contents of `source` are copied into the guest memory.  The backend may encrypt
it.  `flags` must be zero.  It's reserved for future use.

When the ioctl returns, the input values are updated.  If `nr_pages` is large,
it may return EAGAIN or EINTR for pending signal and update the values
(`base_gfn` and `nr_pages`.  `source` if not zero) to point to the remaining
range.
Michael Roth March 10, 2024, 11:12 p.m. UTC | #9
On Thu, Mar 07, 2024 at 06:19:41PM -0800, Isaku Yamahata wrote:
> On Thu, Mar 07, 2024 at 05:28:20PM -0800,
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > On Thu, Mar 07, 2024, David Matlack wrote:
> > > On 2024-03-08 01:20 PM, Huang, Kai wrote:
> > > > > > > +:Parameters: struct kvm_memory_mapping(in/out)
> > > > > > > +:Returns: 0 on success, <0 on error
> > > > > > > +
> > > > > > > +KVM_MAP_MEMORY populates guest memory without running vcpu.
> > > > > > > +
> > > > > > > +::
> > > > > > > +
> > > > > > > +  struct kvm_memory_mapping {
> > > > > > > +	__u64 base_gfn;
> > > > > > > +	__u64 nr_pages;
> > > > > > > +	__u64 flags;
> > > > > > > +	__u64 source;
> > > > > > > +  };
> > > > > > > +
> > > > > > > +  /* For kvm_memory_mapping:: flags */
> > > > > > > +  #define KVM_MEMORY_MAPPING_FLAG_WRITE         _BITULL(0)
> > > > > > > +  #define KVM_MEMORY_MAPPING_FLAG_EXEC          _BITULL(1)
> > > > > > > +  #define KVM_MEMORY_MAPPING_FLAG_USER          _BITULL(2)
> > > > > > 
> > > > > > I am not sure what's the good of having "FLAG_USER"?
> > > > > > 
> > > > > > This ioctl is called from userspace, thus I think we can just treat this always
> > > > > > as user-fault?
> > > > > 
> > > > > The point is how to emulate kvm page fault as if vcpu caused the kvm page
> > > > > fault.  Not we call the ioctl as user context.
> > > > 
> > > > Sorry I don't quite follow.  What's wrong if KVM just append the #PF USER
> > > > error bit before it calls into the fault handler?
> > > > 
> > > > My question is, since this is ABI, you have to tell how userspace is
> > > > supposed to use this.  Maybe I am missing something, but I don't see how
> > > > USER should be used here.
> > > 
> > > If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER
> > > is meaningless, PFERR_USER_MASK is only relevant for shadow paging.
> > 
> > +1
> > 
> > > KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be
> > > populated with writes (which avoids just faulting in the zero-page for
> > > anon or tmpfs backed memslots), while also allowing populating read-only
> > > memslots.
> > > 
> > > I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC.
> > 
> > It would midly be interesting for something like the NX hugepage mitigation.
> > 
> > For the initial implementation, I don't think the ioctl() should specify
> > protections, period.
> > 
> > VMA-based mappings, i.e. !guest_memfd, already have a way to specify protections.
> > And for guest_memfd, finer grained control in general, and long term compatibility
> > with other features that are in-flight or proposed, I would rather userspace specify
> > RWX protections via KVM_SET_MEMORY_ATTRIBUTES.  Oh, and dirty logging would be a
> > pain too.
> > 
> > KVM doesn't currently support execute-only (XO) or !executable (RW), so I think
> > we can simply define KVM_MAP_MEMORY to behave like a read fault.  E.g. map RX,
> > and add W if all underlying protections allow it.
> > 
> > That way we can defer dealing with things like XO and RW *if* KVM ever does gain
> > support for specifying those combinations via KVM_SET_MEMORY_ATTRIBUTES, which
> > will likely be per-arch/vendor and non-trivial, e.g. AMD's NPT doesn't even allow
> > for XO memory.
> > 
> > And we shouldn't need to do anything for KVM_MAP_MEMORY in particular if
> > KVM_SET_MEMORY_ATTRIBUTES gains support for RWX protections the existing RWX and
> > RX combinations, e.g. if there's a use-case for write-protecting guest_memfd
> > regions.
> > 
> > We can always expand the uAPI, but taking away functionality is much harder, if
> > not impossible.
> 
> Ok, let me drop all the flags.  Here is the updated one.
> 
> 4.143 KVM_MAP_MEMORY
> ------------------------
> 
> :Capability: KVM_CAP_MAP_MEMORY
> :Architectures: none
> :Type: vcpu ioctl
> :Parameters: struct kvm_memory_mapping(in/out)
> :Returns: 0 on success, < 0 on error
> 
> Errors:
> 
>   ======   =============================================================
>   EINVAL   vcpu state is not in TDP MMU mode or is in guest mode.
>            Currently, this ioctl is restricted to TDP MMU.
>   EAGAIN   The region is only processed partially.  The caller should
>            issue the ioctl with the updated parameters.
>   EINTR    An unmasked signal is pending.  The region may be processed
>            partially.  If `nr_pages` > 0, the caller should issue the
>            ioctl with the updated parameters.
>   ======   =============================================================
> 
> KVM_MAP_MEMORY populates guest memory before the VM starts to run.  Multiple
> vcpus can call this ioctl simultaneously.  It may result in the error of EAGAIN
> due to race conditions.
> 
> ::
> 
>   struct kvm_memory_mapping {
> 	__u64 base_gfn;
> 	__u64 nr_pages;
> 	__u64 flags;
> 	__u64 source;
>   };
> 
> KVM_MAP_MEMORY populates guest memory at the specified range (`base_gfn`,
> `nr_pages`) in the underlying mapping. `source` is an optional user pointer.  If
> `source` is not NULL and the underlying technology supports it, the memory
> contents of `source` are copied into the guest memory.  The backend may encrypt
> it.  `flags` must be zero.  It's reserved for future use.
> 
> When the ioctl returns, the input values are updated.  If `nr_pages` is large,
> it may return EAGAIN or EINTR for pending signal and update the values
> (`base_gfn` and `nr_pages`.  `source` if not zero) to point to the remaining
> range.

If this intended to replace SNP_LAUNCH_UPDATE, then to be useable for SNP
guests userspace also needs to pass along the type of page being added,
which are currently defined as:

  #define KVM_SEV_SNP_PAGE_TYPE_NORMAL            0x1
  #define KVM_SEV_SNP_PAGE_TYPE_ZERO              0x3
  #define KVM_SEV_SNP_PAGE_TYPE_UNMEASURED        0x4
  #define KVM_SEV_SNP_PAGE_TYPE_SECRETS           0x5
  #define KVM_SEV_SNP_PAGE_TYPE_CPUID             0x6

So I guess the main question is, do bite the bullet now and introduce
infrastructure for vendor-specific parameters, or should we attempt to define
these as cross-vendor/cross-architecture types and hide the vendor-specific
stuff from userspace?

There are a couple other bits of vendor-specific information that would be
needed to be a total drop-in replacement for SNP_LAUNCH_UPDATE, but I think
these we could can do without:

  sev_fd: handle for /dev/sev which is used to issue SEV firmware calls
          as-needed for various KVM ioctls
          - can likely bind this to SNP context during SNP_LAUNCH_UPDATE and
            avoid needing to pass it in for subsequent calls
  error code: return parameter which passes SEV firmware error codes to
              userspace for informational purposes
              - can probably live without this

-Mike

> 
> -- 
> Isaku Yamahata <isaku.yamahata@linux.intel.com>
Huang, Kai March 11, 2024, 1:05 a.m. UTC | #10
> 
>    struct kvm_memory_mapping {
> 	__u64 base_gfn;
> 	__u64 nr_pages;
> 	__u64 flags;
> 	__u64 source;
>    };

 From your next patch the @source must be 4K aligned.  If it is intended 
please document this too.

[...]

The backend may encrypt
> it.  

To me the "backend" is a little bit confusing.  It doesn't mean the 
@source, correct?  Maybe just the "underlying physical pages may be 
encrypted" etc.  But it may be only my issue.
Huang, Kai March 11, 2024, 1:08 a.m. UTC | #11
Maybe just the "underlying physical pages may be
> encrypted" etc. 

Sorry, perhaps "encrypted" -> "crypto-protected"?
Isaku Yamahata March 12, 2024, 1:34 a.m. UTC | #12
On Mon, Mar 11, 2024 at 02:08:01PM +1300,
"Huang, Kai" <kai.huang@intel.com> wrote:

> Maybe just the "underlying physical pages may be
> > encrypted" etc.
> 
> Sorry, perhaps "encrypted" -> "crypto-protected"?

Thanks for clarifying.  Based on the discussion[1], I'm going to remove
'source' member. So I'll eliminate those sentence.

[1] https://lore.kernel.org/all/Ze-XW-EbT9vXaagC@google.com/
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 0b5a33ee71ee..33d2b63f7dbf 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6352,6 +6352,42 @@  a single guest_memfd file, but the bound ranges must not overlap).
 
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
+4.143 KVM_MAP_MEMORY
+------------------------
+
+:Capability: KVM_CAP_MAP_MEMORY
+:Architectures: none
+:Type: vcpu ioctl
+:Parameters: struct kvm_memory_mapping(in/out)
+:Returns: 0 on success, <0 on error
+
+KVM_MAP_MEMORY populates guest memory without running vcpu.
+
+::
+
+  struct kvm_memory_mapping {
+	__u64 base_gfn;
+	__u64 nr_pages;
+	__u64 flags;
+	__u64 source;
+  };
+
+  /* For kvm_memory_mapping:: flags */
+  #define KVM_MEMORY_MAPPING_FLAG_WRITE         _BITULL(0)
+  #define KVM_MEMORY_MAPPING_FLAG_EXEC          _BITULL(1)
+  #define KVM_MEMORY_MAPPING_FLAG_USER          _BITULL(2)
+  #define KVM_MEMORY_MAPPING_FLAG_PRIVATE       _BITULL(3)
+
+KVM_MAP_MEMORY populates guest memory in the underlying mapping. If source is
+not zero and it's supported (depending on underlying technology), the guest
+memory content is populated with the source.  The flags field supports three
+flags: KVM_MEMORY_MAPPING_FLAG_WRITE, KVM_MEMORY_MAPPING_FLAG_EXEC, and
+KVM_MEMORY_MAPPING_FLAG_USER.  Which corresponds to fault code for kvm page
+fault to populate guest memory. write fault, fetch fault and user fault.
+When it returned, the input is updated.  If nr_pages is large, it may
+return -EAGAIN and update the values (base_gfn and nr_pages. source if not zero)
+to point the remaining range.
+
 5. The kvm_run structure
 ========================