Message ID | c50dc98effcba3ff68a033661b2941b777c4fb5c.1709288671.git.isaku.yamahata@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: Prepopulate guest memory API | expand |
On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote: > From: Isaku Yamahata <isaku.yamahata@intel.com> > + > + struct kvm_memory_mapping { > + __u64 base_gfn; > + __u64 nr_pages; > + __u64 flags; > + __u64 source; > + }; > + > + /* For kvm_memory_mapping:: flags */ > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0) > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1) > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2) > + #define KVM_MEMORY_MAPPING_FLAG_PRIVATE _BITULL(3) > + > +KVM_MAP_MEMORY populates guest memory in the underlying mapping. If source is > +not zero and it's supported (depending on underlying technology), the guest > +memory content is populated with the source. What does "populated with the source" mean? > The flags field supports three > +flags: KVM_MEMORY_MAPPING_FLAG_WRITE, KVM_MEMORY_MAPPING_FLAG_EXEC, and > +KVM_MEMORY_MAPPING_FLAG_USER. There are 4 flags.
On Wed, Mar 06, 2024 at 04:43:51PM -0800, David Matlack <dmatlack@google.com> wrote: > On 2024-03-01 09:28 AM, isaku.yamahata@intel.com wrote: > > From: Isaku Yamahata <isaku.yamahata@intel.com> > > + > > + struct kvm_memory_mapping { > > + __u64 base_gfn; > > + __u64 nr_pages; > > + __u64 flags; > > + __u64 source; > > + }; > > + > > + /* For kvm_memory_mapping:: flags */ > > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0) > > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1) > > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2) > > + #define KVM_MEMORY_MAPPING_FLAG_PRIVATE _BITULL(3) > > + > > +KVM_MAP_MEMORY populates guest memory in the underlying mapping. If source is > > +not zero and it's supported (depending on underlying technology), the guest > > +memory content is populated with the source. > > What does "populated with the source" mean? source is user pointer and the memory contents of source is copied into base_gfn. (and it will encrypted.) > > The flags field supports three > > +flags: KVM_MEMORY_MAPPING_FLAG_WRITE, KVM_MEMORY_MAPPING_FLAG_EXEC, and > > +KVM_MEMORY_MAPPING_FLAG_USER. > > There are 4 flags. Oops. Let me update it. KVM_MAP_MEMORY populates guest memory at the specified range (`base_gfn`, `nr_pages`) in the underlying mapping. `source` is an optional user pointer. If `source` is not NULL and the underlying technology supports it, the memory contents of `source` are copied into the guest memory. The backend may encrypt it. The `flags` field supports four flags: KVM_MEMORY_MAPPING_FLAG_WRITE, KVM_MEMORY_MAPPING_FLAG_EXEC, KVM_MEMORY_MAPPING_FLAG_USER, and KVM_MEMORY_MAPPING_FLAGS_PRIVATE. The first three correspond to the fault code for the KVM page fault to populate guest memory. write fault, fetch fault, and user fault. KVM_MEMORY_MAPPING_FLAGS_PRIVATE is applicable only for guest memory with guest_memfd. It is to populate guest memory with the memory attribute of KVM_MEMORY_ATTRIBUTE_PRIVATE set. When the ioctl returns, the input values are updated. If `nr_pages` is large, it may return -EAGAIN and update the values (`base_gfn` and `nr_pages`. `source` if not zero) to point to the remaining range.
On Fri, 2024-03-01 at 09:28 -0800, isaku.yamahata@intel.com wrote: > From: Isaku Yamahata <isaku.yamahata@intel.com> > > Adds documentation of KVM_MAP_MEMORY ioctl. > > It pre-populates guest memory. And potentially do initialized memory > contents with encryption and measurement depending on underlying > technology. > > Suggested-by: Sean Christopherson <seanjc@google.com> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> > --- > Documentation/virt/kvm/api.rst | 36 ++++++++++++++++++++++++++++++++++ > 1 file changed, 36 insertions(+) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index 0b5a33ee71ee..33d2b63f7dbf 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6352,6 +6352,42 @@ a single guest_memfd file, but the bound ranges must not overlap). > > See KVM_SET_USER_MEMORY_REGION2 for additional details. > > +4.143 KVM_MAP_MEMORY > +------------------------ > + > +:Capability: KVM_CAP_MAP_MEMORY > +:Architectures: none > +:Type: vcpu ioctl I think "vcpu ioctl" means theoretically it can be called on multiple vcpus. What happens in that case? > +:Parameters: struct kvm_memory_mapping(in/out) > +:Returns: 0 on success, <0 on error > + > +KVM_MAP_MEMORY populates guest memory without running vcpu. > + > +:: > + > + struct kvm_memory_mapping { > + __u64 base_gfn; > + __u64 nr_pages; > + __u64 flags; > + __u64 source; > + }; > + > + /* For kvm_memory_mapping:: flags */ > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0) > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1) > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2) I am not sure what's the good of having "FLAG_USER"? This ioctl is called from userspace, thus I think we can just treat this always as user-fault? > + #define KVM_MEMORY_MAPPING_FLAG_PRIVATE _BITULL(3) > + > +KVM_MAP_MEMORY populates guest memory in the underlying mapping. If source is > +not zero and it's supported (depending on underlying technology), the guest > +memory content is populated with the source. The flags field supports three > +flags: KVM_MEMORY_MAPPING_FLAG_WRITE, KVM_MEMORY_MAPPING_FLAG_EXEC, and > +KVM_MEMORY_MAPPING_FLAG_USER. Which corresponds to fault code for kvm page > +fault to populate guest memory. write fault, fetch fault and user fault. > +When it returned, the input is updated. If nr_pages is large, it may > +return -EAGAIN and update the values (base_gfn and nr_pages. source if not zero) > +to point the remaining range. > + > 5. The kvm_run structure > ======================== >
On Thu, Mar 07, 2024 at 12:30:04PM +0000, "Huang, Kai" <kai.huang@intel.com> wrote: > On Fri, 2024-03-01 at 09:28 -0800, isaku.yamahata@intel.com wrote: > > From: Isaku Yamahata <isaku.yamahata@intel.com> > > > > Adds documentation of KVM_MAP_MEMORY ioctl. > > > > It pre-populates guest memory. And potentially do initialized memory > > contents with encryption and measurement depending on underlying > > technology. > > > > Suggested-by: Sean Christopherson <seanjc@google.com> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> > > --- > > Documentation/virt/kvm/api.rst | 36 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 36 insertions(+) > > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > > index 0b5a33ee71ee..33d2b63f7dbf 100644 > > --- a/Documentation/virt/kvm/api.rst > > +++ b/Documentation/virt/kvm/api.rst > > @@ -6352,6 +6352,42 @@ a single guest_memfd file, but the bound ranges must not overlap). > > > > See KVM_SET_USER_MEMORY_REGION2 for additional details. > > > > +4.143 KVM_MAP_MEMORY > > +------------------------ > > + > > +:Capability: KVM_CAP_MAP_MEMORY > > +:Architectures: none > > +:Type: vcpu ioctl > > I think "vcpu ioctl" means theoretically it can be called on multiple vcpus. > > What happens in that case? Each vcpu can handle the ioctl simaltaneously. If we assume tdp_mmu, each vcpu calls the kvm fault handler simultaneously with read spinlock. If gfn ranges overlap, vcpu will get 0 (success) or EAGAIN. > > +:Parameters: struct kvm_memory_mapping(in/out) > > +:Returns: 0 on success, <0 on error > > + > > +KVM_MAP_MEMORY populates guest memory without running vcpu. > > + > > +:: > > + > > + struct kvm_memory_mapping { > > + __u64 base_gfn; > > + __u64 nr_pages; > > + __u64 flags; > > + __u64 source; > > + }; > > + > > + /* For kvm_memory_mapping:: flags */ > > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0) > > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1) > > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2) > > I am not sure what's the good of having "FLAG_USER"? > > This ioctl is called from userspace, thus I think we can just treat this always > as user-fault? The point is how to emulate kvm page fault as if vcpu caused the kvm page fault. Not we call the ioctl as user context.
>>> >>> +4.143 KVM_MAP_MEMORY >>> +------------------------ >>> + >>> +:Capability: KVM_CAP_MAP_MEMORY >>> +:Architectures: none >>> +:Type: vcpu ioctl >> >> I think "vcpu ioctl" means theoretically it can be called on multiple vcpus. >> >> What happens in that case? > > Each vcpu can handle the ioctl simaltaneously. Not sure whether it is implied, but should we document it can be called simultaneously? Also, I believe this is only supposed to be called before VM starts to run? I think we should document that too. This is userspace ABI, we need to be explicit on how it is supposed to be called from userspace. Btw, I believe there should be some justification in the changelog why this should be a vcpu ioctl(). [...] >>> +:Parameters: struct kvm_memory_mapping(in/out) >>> +:Returns: 0 on success, <0 on error >>> + >>> +KVM_MAP_MEMORY populates guest memory without running vcpu. >>> + >>> +:: >>> + >>> + struct kvm_memory_mapping { >>> + __u64 base_gfn; >>> + __u64 nr_pages; >>> + __u64 flags; >>> + __u64 source; >>> + }; >>> + >>> + /* For kvm_memory_mapping:: flags */ >>> + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0) >>> + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1) >>> + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2) >> >> I am not sure what's the good of having "FLAG_USER"? >> >> This ioctl is called from userspace, thus I think we can just treat this always >> as user-fault? > > The point is how to emulate kvm page fault as if vcpu caused the kvm page > fault. Not we call the ioctl as user context. Sorry I don't quite follow. What's wrong if KVM just append the #PF USER error bit before it calls into the fault handler? My question is, since this is ABI, you have to tell how userspace is supposed to use this. Maybe I am missing something, but I don't see how USER should be used here.
On 2024-03-08 01:20 PM, Huang, Kai wrote: > > > > +:Parameters: struct kvm_memory_mapping(in/out) > > > > +:Returns: 0 on success, <0 on error > > > > + > > > > +KVM_MAP_MEMORY populates guest memory without running vcpu. > > > > + > > > > +:: > > > > + > > > > + struct kvm_memory_mapping { > > > > + __u64 base_gfn; > > > > + __u64 nr_pages; > > > > + __u64 flags; > > > > + __u64 source; > > > > + }; > > > > + > > > > + /* For kvm_memory_mapping:: flags */ > > > > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0) > > > > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1) > > > > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2) > > > > > > I am not sure what's the good of having "FLAG_USER"? > > > > > > This ioctl is called from userspace, thus I think we can just treat this always > > > as user-fault? > > > > The point is how to emulate kvm page fault as if vcpu caused the kvm page > > fault. Not we call the ioctl as user context. > > Sorry I don't quite follow. What's wrong if KVM just append the #PF USER > error bit before it calls into the fault handler? > > My question is, since this is ABI, you have to tell how userspace is > supposed to use this. Maybe I am missing something, but I don't see how > USER should be used here. If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER is meaningless, PFERR_USER_MASK is only relevant for shadow paging. KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be populated with writes (which avoids just faulting in the zero-page for anon or tmpfs backed memslots), while also allowing populating read-only memslots. I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC.
On Thu, Mar 07, 2024, David Matlack wrote: > On 2024-03-08 01:20 PM, Huang, Kai wrote: > > > > > +:Parameters: struct kvm_memory_mapping(in/out) > > > > > +:Returns: 0 on success, <0 on error > > > > > + > > > > > +KVM_MAP_MEMORY populates guest memory without running vcpu. > > > > > + > > > > > +:: > > > > > + > > > > > + struct kvm_memory_mapping { > > > > > + __u64 base_gfn; > > > > > + __u64 nr_pages; > > > > > + __u64 flags; > > > > > + __u64 source; > > > > > + }; > > > > > + > > > > > + /* For kvm_memory_mapping:: flags */ > > > > > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0) > > > > > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1) > > > > > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2) > > > > > > > > I am not sure what's the good of having "FLAG_USER"? > > > > > > > > This ioctl is called from userspace, thus I think we can just treat this always > > > > as user-fault? > > > > > > The point is how to emulate kvm page fault as if vcpu caused the kvm page > > > fault. Not we call the ioctl as user context. > > > > Sorry I don't quite follow. What's wrong if KVM just append the #PF USER > > error bit before it calls into the fault handler? > > > > My question is, since this is ABI, you have to tell how userspace is > > supposed to use this. Maybe I am missing something, but I don't see how > > USER should be used here. > > If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER > is meaningless, PFERR_USER_MASK is only relevant for shadow paging. +1 > KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be > populated with writes (which avoids just faulting in the zero-page for > anon or tmpfs backed memslots), while also allowing populating read-only > memslots. > > I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC. It would midly be interesting for something like the NX hugepage mitigation. For the initial implementation, I don't think the ioctl() should specify protections, period. VMA-based mappings, i.e. !guest_memfd, already have a way to specify protections. And for guest_memfd, finer grained control in general, and long term compatibility with other features that are in-flight or proposed, I would rather userspace specify RWX protections via KVM_SET_MEMORY_ATTRIBUTES. Oh, and dirty logging would be a pain too. KVM doesn't currently support execute-only (XO) or !executable (RW), so I think we can simply define KVM_MAP_MEMORY to behave like a read fault. E.g. map RX, and add W if all underlying protections allow it. That way we can defer dealing with things like XO and RW *if* KVM ever does gain support for specifying those combinations via KVM_SET_MEMORY_ATTRIBUTES, which will likely be per-arch/vendor and non-trivial, e.g. AMD's NPT doesn't even allow for XO memory. And we shouldn't need to do anything for KVM_MAP_MEMORY in particular if KVM_SET_MEMORY_ATTRIBUTES gains support for RWX protections the existing RWX and RX combinations, e.g. if there's a use-case for write-protecting guest_memfd regions. We can always expand the uAPI, but taking away functionality is much harder, if not impossible.
On Thu, Mar 07, 2024 at 05:28:20PM -0800, Sean Christopherson <seanjc@google.com> wrote: > On Thu, Mar 07, 2024, David Matlack wrote: > > On 2024-03-08 01:20 PM, Huang, Kai wrote: > > > > > > +:Parameters: struct kvm_memory_mapping(in/out) > > > > > > +:Returns: 0 on success, <0 on error > > > > > > + > > > > > > +KVM_MAP_MEMORY populates guest memory without running vcpu. > > > > > > + > > > > > > +:: > > > > > > + > > > > > > + struct kvm_memory_mapping { > > > > > > + __u64 base_gfn; > > > > > > + __u64 nr_pages; > > > > > > + __u64 flags; > > > > > > + __u64 source; > > > > > > + }; > > > > > > + > > > > > > + /* For kvm_memory_mapping:: flags */ > > > > > > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0) > > > > > > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1) > > > > > > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2) > > > > > > > > > > I am not sure what's the good of having "FLAG_USER"? > > > > > > > > > > This ioctl is called from userspace, thus I think we can just treat this always > > > > > as user-fault? > > > > > > > > The point is how to emulate kvm page fault as if vcpu caused the kvm page > > > > fault. Not we call the ioctl as user context. > > > > > > Sorry I don't quite follow. What's wrong if KVM just append the #PF USER > > > error bit before it calls into the fault handler? > > > > > > My question is, since this is ABI, you have to tell how userspace is > > > supposed to use this. Maybe I am missing something, but I don't see how > > > USER should be used here. > > > > If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER > > is meaningless, PFERR_USER_MASK is only relevant for shadow paging. > > +1 > > > KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be > > populated with writes (which avoids just faulting in the zero-page for > > anon or tmpfs backed memslots), while also allowing populating read-only > > memslots. > > > > I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC. > > It would midly be interesting for something like the NX hugepage mitigation. > > For the initial implementation, I don't think the ioctl() should specify > protections, period. > > VMA-based mappings, i.e. !guest_memfd, already have a way to specify protections. > And for guest_memfd, finer grained control in general, and long term compatibility > with other features that are in-flight or proposed, I would rather userspace specify > RWX protections via KVM_SET_MEMORY_ATTRIBUTES. Oh, and dirty logging would be a > pain too. > > KVM doesn't currently support execute-only (XO) or !executable (RW), so I think > we can simply define KVM_MAP_MEMORY to behave like a read fault. E.g. map RX, > and add W if all underlying protections allow it. > > That way we can defer dealing with things like XO and RW *if* KVM ever does gain > support for specifying those combinations via KVM_SET_MEMORY_ATTRIBUTES, which > will likely be per-arch/vendor and non-trivial, e.g. AMD's NPT doesn't even allow > for XO memory. > > And we shouldn't need to do anything for KVM_MAP_MEMORY in particular if > KVM_SET_MEMORY_ATTRIBUTES gains support for RWX protections the existing RWX and > RX combinations, e.g. if there's a use-case for write-protecting guest_memfd > regions. > > We can always expand the uAPI, but taking away functionality is much harder, if > not impossible. Ok, let me drop all the flags. Here is the updated one. 4.143 KVM_MAP_MEMORY ------------------------ :Capability: KVM_CAP_MAP_MEMORY :Architectures: none :Type: vcpu ioctl :Parameters: struct kvm_memory_mapping(in/out) :Returns: 0 on success, < 0 on error Errors: ====== ============================================================= EINVAL vcpu state is not in TDP MMU mode or is in guest mode. Currently, this ioctl is restricted to TDP MMU. EAGAIN The region is only processed partially. The caller should issue the ioctl with the updated parameters. EINTR An unmasked signal is pending. The region may be processed partially. If `nr_pages` > 0, the caller should issue the ioctl with the updated parameters. ====== ============================================================= KVM_MAP_MEMORY populates guest memory before the VM starts to run. Multiple vcpus can call this ioctl simultaneously. It may result in the error of EAGAIN due to race conditions. :: struct kvm_memory_mapping { __u64 base_gfn; __u64 nr_pages; __u64 flags; __u64 source; }; KVM_MAP_MEMORY populates guest memory at the specified range (`base_gfn`, `nr_pages`) in the underlying mapping. `source` is an optional user pointer. If `source` is not NULL and the underlying technology supports it, the memory contents of `source` are copied into the guest memory. The backend may encrypt it. `flags` must be zero. It's reserved for future use. When the ioctl returns, the input values are updated. If `nr_pages` is large, it may return EAGAIN or EINTR for pending signal and update the values (`base_gfn` and `nr_pages`. `source` if not zero) to point to the remaining range.
On Thu, Mar 07, 2024 at 06:19:41PM -0800, Isaku Yamahata wrote: > On Thu, Mar 07, 2024 at 05:28:20PM -0800, > Sean Christopherson <seanjc@google.com> wrote: > > > On Thu, Mar 07, 2024, David Matlack wrote: > > > On 2024-03-08 01:20 PM, Huang, Kai wrote: > > > > > > > +:Parameters: struct kvm_memory_mapping(in/out) > > > > > > > +:Returns: 0 on success, <0 on error > > > > > > > + > > > > > > > +KVM_MAP_MEMORY populates guest memory without running vcpu. > > > > > > > + > > > > > > > +:: > > > > > > > + > > > > > > > + struct kvm_memory_mapping { > > > > > > > + __u64 base_gfn; > > > > > > > + __u64 nr_pages; > > > > > > > + __u64 flags; > > > > > > > + __u64 source; > > > > > > > + }; > > > > > > > + > > > > > > > + /* For kvm_memory_mapping:: flags */ > > > > > > > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0) > > > > > > > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1) > > > > > > > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2) > > > > > > > > > > > > I am not sure what's the good of having "FLAG_USER"? > > > > > > > > > > > > This ioctl is called from userspace, thus I think we can just treat this always > > > > > > as user-fault? > > > > > > > > > > The point is how to emulate kvm page fault as if vcpu caused the kvm page > > > > > fault. Not we call the ioctl as user context. > > > > > > > > Sorry I don't quite follow. What's wrong if KVM just append the #PF USER > > > > error bit before it calls into the fault handler? > > > > > > > > My question is, since this is ABI, you have to tell how userspace is > > > > supposed to use this. Maybe I am missing something, but I don't see how > > > > USER should be used here. > > > > > > If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER > > > is meaningless, PFERR_USER_MASK is only relevant for shadow paging. > > > > +1 > > > > > KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be > > > populated with writes (which avoids just faulting in the zero-page for > > > anon or tmpfs backed memslots), while also allowing populating read-only > > > memslots. > > > > > > I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC. > > > > It would midly be interesting for something like the NX hugepage mitigation. > > > > For the initial implementation, I don't think the ioctl() should specify > > protections, period. > > > > VMA-based mappings, i.e. !guest_memfd, already have a way to specify protections. > > And for guest_memfd, finer grained control in general, and long term compatibility > > with other features that are in-flight or proposed, I would rather userspace specify > > RWX protections via KVM_SET_MEMORY_ATTRIBUTES. Oh, and dirty logging would be a > > pain too. > > > > KVM doesn't currently support execute-only (XO) or !executable (RW), so I think > > we can simply define KVM_MAP_MEMORY to behave like a read fault. E.g. map RX, > > and add W if all underlying protections allow it. > > > > That way we can defer dealing with things like XO and RW *if* KVM ever does gain > > support for specifying those combinations via KVM_SET_MEMORY_ATTRIBUTES, which > > will likely be per-arch/vendor and non-trivial, e.g. AMD's NPT doesn't even allow > > for XO memory. > > > > And we shouldn't need to do anything for KVM_MAP_MEMORY in particular if > > KVM_SET_MEMORY_ATTRIBUTES gains support for RWX protections the existing RWX and > > RX combinations, e.g. if there's a use-case for write-protecting guest_memfd > > regions. > > > > We can always expand the uAPI, but taking away functionality is much harder, if > > not impossible. > > Ok, let me drop all the flags. Here is the updated one. > > 4.143 KVM_MAP_MEMORY > ------------------------ > > :Capability: KVM_CAP_MAP_MEMORY > :Architectures: none > :Type: vcpu ioctl > :Parameters: struct kvm_memory_mapping(in/out) > :Returns: 0 on success, < 0 on error > > Errors: > > ====== ============================================================= > EINVAL vcpu state is not in TDP MMU mode or is in guest mode. > Currently, this ioctl is restricted to TDP MMU. > EAGAIN The region is only processed partially. The caller should > issue the ioctl with the updated parameters. > EINTR An unmasked signal is pending. The region may be processed > partially. If `nr_pages` > 0, the caller should issue the > ioctl with the updated parameters. > ====== ============================================================= > > KVM_MAP_MEMORY populates guest memory before the VM starts to run. Multiple > vcpus can call this ioctl simultaneously. It may result in the error of EAGAIN > due to race conditions. > > :: > > struct kvm_memory_mapping { > __u64 base_gfn; > __u64 nr_pages; > __u64 flags; > __u64 source; > }; > > KVM_MAP_MEMORY populates guest memory at the specified range (`base_gfn`, > `nr_pages`) in the underlying mapping. `source` is an optional user pointer. If > `source` is not NULL and the underlying technology supports it, the memory > contents of `source` are copied into the guest memory. The backend may encrypt > it. `flags` must be zero. It's reserved for future use. > > When the ioctl returns, the input values are updated. If `nr_pages` is large, > it may return EAGAIN or EINTR for pending signal and update the values > (`base_gfn` and `nr_pages`. `source` if not zero) to point to the remaining > range. If this intended to replace SNP_LAUNCH_UPDATE, then to be useable for SNP guests userspace also needs to pass along the type of page being added, which are currently defined as: #define KVM_SEV_SNP_PAGE_TYPE_NORMAL 0x1 #define KVM_SEV_SNP_PAGE_TYPE_ZERO 0x3 #define KVM_SEV_SNP_PAGE_TYPE_UNMEASURED 0x4 #define KVM_SEV_SNP_PAGE_TYPE_SECRETS 0x5 #define KVM_SEV_SNP_PAGE_TYPE_CPUID 0x6 So I guess the main question is, do bite the bullet now and introduce infrastructure for vendor-specific parameters, or should we attempt to define these as cross-vendor/cross-architecture types and hide the vendor-specific stuff from userspace? There are a couple other bits of vendor-specific information that would be needed to be a total drop-in replacement for SNP_LAUNCH_UPDATE, but I think these we could can do without: sev_fd: handle for /dev/sev which is used to issue SEV firmware calls as-needed for various KVM ioctls - can likely bind this to SNP context during SNP_LAUNCH_UPDATE and avoid needing to pass it in for subsequent calls error code: return parameter which passes SEV firmware error codes to userspace for informational purposes - can probably live without this -Mike > > -- > Isaku Yamahata <isaku.yamahata@linux.intel.com>
> > struct kvm_memory_mapping { > __u64 base_gfn; > __u64 nr_pages; > __u64 flags; > __u64 source; > }; From your next patch the @source must be 4K aligned. If it is intended please document this too. [...] The backend may encrypt > it. To me the "backend" is a little bit confusing. It doesn't mean the @source, correct? Maybe just the "underlying physical pages may be encrypted" etc. But it may be only my issue.
Maybe just the "underlying physical pages may be
> encrypted" etc.
Sorry, perhaps "encrypted" -> "crypto-protected"?
On Mon, Mar 11, 2024 at 02:08:01PM +1300, "Huang, Kai" <kai.huang@intel.com> wrote: > Maybe just the "underlying physical pages may be > > encrypted" etc. > > Sorry, perhaps "encrypted" -> "crypto-protected"? Thanks for clarifying. Based on the discussion[1], I'm going to remove 'source' member. So I'll eliminate those sentence. [1] https://lore.kernel.org/all/Ze-XW-EbT9vXaagC@google.com/
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 0b5a33ee71ee..33d2b63f7dbf 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6352,6 +6352,42 @@ a single guest_memfd file, but the bound ranges must not overlap). See KVM_SET_USER_MEMORY_REGION2 for additional details. +4.143 KVM_MAP_MEMORY +------------------------ + +:Capability: KVM_CAP_MAP_MEMORY +:Architectures: none +:Type: vcpu ioctl +:Parameters: struct kvm_memory_mapping(in/out) +:Returns: 0 on success, <0 on error + +KVM_MAP_MEMORY populates guest memory without running vcpu. + +:: + + struct kvm_memory_mapping { + __u64 base_gfn; + __u64 nr_pages; + __u64 flags; + __u64 source; + }; + + /* For kvm_memory_mapping:: flags */ + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0) + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1) + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2) + #define KVM_MEMORY_MAPPING_FLAG_PRIVATE _BITULL(3) + +KVM_MAP_MEMORY populates guest memory in the underlying mapping. If source is +not zero and it's supported (depending on underlying technology), the guest +memory content is populated with the source. The flags field supports three +flags: KVM_MEMORY_MAPPING_FLAG_WRITE, KVM_MEMORY_MAPPING_FLAG_EXEC, and +KVM_MEMORY_MAPPING_FLAG_USER. Which corresponds to fault code for kvm page +fault to populate guest memory. write fault, fetch fault and user fault. +When it returned, the input is updated. If nr_pages is large, it may +return -EAGAIN and update the values (base_gfn and nr_pages. source if not zero) +to point the remaining range. + 5. The kvm_run structure ========================