Message ID | 20220519153713.819591-5-chao.p.peng@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: mm: fd-based approach for supporting KVM guest private memory | expand |
On 5/19/22 08:37, Chao Peng wrote: > Extend the memslot definition to provide guest private memory through a > file descriptor(fd) instead of userspace_addr(hva). Such guest private > memory(fd) may never be mapped into userspace so no userspace_addr(hva) > can be used. Instead add another two new fields > (private_fd/private_offset), plus the existing memory_size to represent > the private memory range. Such memslot can still have the existing > userspace_addr(hva). When use, a single memslot can maintain both > private memory through private fd(private_fd/private_offset) and shared > memory through hva(userspace_addr). A GPA is considered private by KVM > if the memslot has private fd and that corresponding page in the private > fd is populated, otherwise, it's shared. > So this is a strange API and, IMO, a layering violation. I want to make sure that we're all actually on board with making this a permanent part of the Linux API. Specifically, we end up with a multiplexing situation as you have described. For a given GPA, there are *two* possible host backings: an fd-backed one (from the fd, which is private for now might might end up potentially shared depending on future extensions) and a VMA-backed one. The selection of which one backs the address is made internally by whatever backs the fd. This, IMO, a clear layering violation. Normally, an fd has an associated address space, and pages in that address space can have contents, can be holes that appear to contain all zeros, or could have holes that are inaccessible. If you try to access a hole, you get whatever is in the hole. But now, with this patchset, the fd is more of an overlay and you get *something else* if you try to access through the hole. This results in operations on the fd bubbling up to the KVM mapping in what is, IMO, a strange way. If the user punches a hole, KVM has to modify its mappings such that the GPA goes to whatever VMA may be there. (And update the RMP, the hypervisor's tables, or whatever else might actually control privateness.) Conversely, if the user does fallocate to fill a hole, the guest mapping *to an unrelated page* has to be zapped so that the fd's page shows up. And the RMP needs updating, etc. I am lukewarm on this for a few reasons. 1. This is weird. AFAIK nothing else works like this. Obviously this is subjecting, but "weird" and "layering violation" sometimes translate to "problematic locking". 2. fd-backed private memory can't have normal holes. If I make a memfd, punch a hole in it, and mmap(MAP_SHARED) it, I end up with a page that reads as zero. If I write to it, the page gets allocated. But with this new mechanism, if I punch a hole and put it in a memslot, reads and writes go somewhere else. So what if I actually wanted lazily allocated private zeros? 2b. For a hypothetical future extension in which an fd can also have shared pages (for conversion, for example, or simply because the fd backing might actually be more efficient than indirecting through VMAs and therefore get used for shared memory or entirely-non-confidential VMs), lazy fd-backed zeros sound genuinely useful. 3. TDX hardware capability is not fully exposed. TDX can have a private page and a shared page at GPAs that differ only by the private bit. Sure, no one plans to use this today, but baking this into the user ABI throws away half the potential address space. 3b. Any software solution that works like TDX (which IMO seems like an eminently reasonable design to me) has the same issue. The alternative would be to have some kind of separate table or bitmap (part of the memslot?) that tells KVM whether a GPA should map to the fd. What do you all think?
On Fri, May 20, 2022, Andy Lutomirski wrote: > The alternative would be to have some kind of separate table or bitmap (part > of the memslot?) that tells KVM whether a GPA should map to the fd. > > What do you all think? My original proposal was to have expolicit shared vs. private memslots, and punch holes in KVM's memslots on conversion, but due to the way KVM (and userspace) handle memslot updates, conversions would be painfully slow. That's how we ended up with the current propsoal. But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement and wouldn't necessarily even need to interact with the memslots. It could be a consumer of memslots, e.g. if we wanted to disallow registering regions without an associated memslot, but I think we'd want to avoid even that because things will get messy during memslot updates, e.g. if dirty logging is toggled or a shared memory region is temporarily removed then we wouldn't want to destroy the tracking. I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray should be far more efficient. One benefit to explicitly tracking this in KVM is that it might be useful for software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending" based on guest hypercalls to share/unshare memory, and then complete the transaction when userspace invokes the ioctl() to complete the share/unshare.
On Fri, May 20, 2022, at 11:31 AM, Sean Christopherson wrote: > But a dedicated KVM ioctl() to add/remove shared ranges would be easy > to implement > and wouldn't necessarily even need to interact with the memslots. It > could be a > consumer of memslots, e.g. if we wanted to disallow registering regions > without an > associated memslot, but I think we'd want to avoid even that because > things will > get messy during memslot updates, e.g. if dirty logging is toggled or a > shared > memory region is temporarily removed then we wouldn't want to destroy > the tracking. > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray > should be far more efficient. > > One benefit to explicitly tracking this in KVM is that it might be > useful for > software-only protected VMs, e.g. KVM could mark a region in the XArray > as "pending" > based on guest hypercalls to share/unshare memory, and then complete > the transaction > when userspace invokes the ioctl() to complete the share/unshare. That makes sense. If KVM goes this route, perhaps there the allowed states for a GPA should include private, shared, and also private-and-shared. Then anyone who wanted to use the same masked GPA for shared and private on TDX could do so if they wanted to.
On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote: > On Fri, May 20, 2022, Andy Lutomirski wrote: > > The alternative would be to have some kind of separate table or bitmap (part > > of the memslot?) that tells KVM whether a GPA should map to the fd. > > > > What do you all think? > > My original proposal was to have expolicit shared vs. private memslots, and punch > holes in KVM's memslots on conversion, but due to the way KVM (and userspace) > handle memslot updates, conversions would be painfully slow. That's how we ended > up with the current propsoal. > > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement > and wouldn't necessarily even need to interact with the memslots. It could be a > consumer of memslots, e.g. if we wanted to disallow registering regions without an > associated memslot, but I think we'd want to avoid even that because things will > get messy during memslot updates, e.g. if dirty logging is toggled or a shared > memory region is temporarily removed then we wouldn't want to destroy the tracking. Even we don't tight that to memslots, that info can only be effective for private memslot, right? Setting this ioctl to memory ranges defined in a traditional non-private memslots just makes no sense, I guess we can comment that in the API document. > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray > should be far more efficient. What about the mis-behaved guest? I don't want to design for the worst case, but people may raise concern on the attack from such guest. > > One benefit to explicitly tracking this in KVM is that it might be useful for > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending" > based on guest hypercalls to share/unshare memory, and then complete the transaction > when userspace invokes the ioctl() to complete the share/unshare. OK, then this can be another field of states/flags/attributes. Let me dig up certain level of details: First, introduce below KVM ioctl KVM_SET_MEMORY_ATTR struct kvm_memory_attr { __u64 addr; /* page aligned */ __u64 size; /* page aligned */ #define KVM_MEMORY_ATTR_SHARED (1 << 0) #define KVM_MEMORY_ATTR_PRIVATE (1 << 1) __u64 flags; } Second, check the KVM maintained guest memory attributes in page fault handler (instead of checking memory existence in private fd) Third, the memfile_notifier_ops (populate/invalidate) will be removed from current code, the old mapping zapping can be directly handled in this new KVM ioctl(). Thought? Since this info is stored in KVM, which I think is reasonable. But for other potential memfile_notifier users like VFIO, some KVM-to-VFIO APIs might be needed depends on the implementaion. It is also possible to maintain this info purely in userspace. The only trick bit is implicit conversion support that has to be checked in KVM page fault handler and is in the fast path. Thanks, Chao
On Mon, May 23, 2022, Chao Peng wrote: > On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote: > > On Fri, May 20, 2022, Andy Lutomirski wrote: > > > The alternative would be to have some kind of separate table or bitmap (part > > > of the memslot?) that tells KVM whether a GPA should map to the fd. > > > > > > What do you all think? > > > > My original proposal was to have expolicit shared vs. private memslots, and punch > > holes in KVM's memslots on conversion, but due to the way KVM (and userspace) > > handle memslot updates, conversions would be painfully slow. That's how we ended > > up with the current propsoal. > > > > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement > > and wouldn't necessarily even need to interact with the memslots. It could be a > > consumer of memslots, e.g. if we wanted to disallow registering regions without an > > associated memslot, but I think we'd want to avoid even that because things will > > get messy during memslot updates, e.g. if dirty logging is toggled or a shared > > memory region is temporarily removed then we wouldn't want to destroy the tracking. > > Even we don't tight that to memslots, that info can only be effective > for private memslot, right? Setting this ioctl to memory ranges defined > in a traditional non-private memslots just makes no sense, I guess we can > comment that in the API document. Hrm, applying it universally would be funky, e.g. emulated MMIO would need to be declared "shared". But, applying it selectively would arguably be worse, e.g. letting userspace map memory into the guest as shared for a region that's registered as private... On option to that mess would be to make memory shared by default, and so userspace must declare regions that are private. Then there's no weirdness with emulated MMIO or "legacy" memslots. On page fault, KVM does a lookup to see if the GPA is shared or private. If the GPA is private, but there is no memslot or the memslot doesn't have a private fd, KVM exits to userspace. If there's a memslot with a private fd, the shared/private flag is used to resolve the And to handle the ioctl(), KVM can use kvm_zap_gfn_range(), which will bump the notifier sequence, i.e. force the page fault to retry if the GPA may have been (un)registered between checking the type and acquiring mmu_lock. > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray > > should be far more efficient. > > What about the mis-behaved guest? I don't want to design for the worst > case, but people may raise concern on the attack from such guest. That's why cgroups exist. E.g. a malicious/broken L1 can similarly abuse nested EPT/NPT to generate a large number of shadow page tables. > > One benefit to explicitly tracking this in KVM is that it might be useful for > > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending" > > based on guest hypercalls to share/unshare memory, and then complete the transaction > > when userspace invokes the ioctl() to complete the share/unshare. > > OK, then this can be another field of states/flags/attributes. Let me > dig up certain level of details: > > First, introduce below KVM ioctl > > KVM_SET_MEMORY_ATTR Actually, if the semantics are that userspace declares memory as private, then we can reuse KVM_MEMORY_ENCRYPT_REG_REGION and KVM_MEMORY_ENCRYPT_UNREG_REGION. It'd be a little gross because we'd need to slightly redefine the semantics for TDX, SNP, and software-protected VM types, e.g. the ioctls() currently require a pre-exisitng memslot. But I think it'd work... I'll think more on this...
On Mon, May 23, 2022 at 03:22:32PM +0000, Sean Christopherson wrote: > On Mon, May 23, 2022, Chao Peng wrote: > > On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote: > > > On Fri, May 20, 2022, Andy Lutomirski wrote: > > > > The alternative would be to have some kind of separate table or bitmap (part > > > > of the memslot?) that tells KVM whether a GPA should map to the fd. > > > > > > > > What do you all think? > > > > > > My original proposal was to have expolicit shared vs. private memslots, and punch > > > holes in KVM's memslots on conversion, but due to the way KVM (and userspace) > > > handle memslot updates, conversions would be painfully slow. That's how we ended > > > up with the current propsoal. > > > > > > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement > > > and wouldn't necessarily even need to interact with the memslots. It could be a > > > consumer of memslots, e.g. if we wanted to disallow registering regions without an > > > associated memslot, but I think we'd want to avoid even that because things will > > > get messy during memslot updates, e.g. if dirty logging is toggled or a shared > > > memory region is temporarily removed then we wouldn't want to destroy the tracking. > > > > Even we don't tight that to memslots, that info can only be effective > > for private memslot, right? Setting this ioctl to memory ranges defined > > in a traditional non-private memslots just makes no sense, I guess we can > > comment that in the API document. > > Hrm, applying it universally would be funky, e.g. emulated MMIO would need to be > declared "shared". But, applying it selectively would arguably be worse, e.g. > letting userspace map memory into the guest as shared for a region that's registered > as private... > > On option to that mess would be to make memory shared by default, and so userspace > must declare regions that are private. Then there's no weirdness with emulated MMIO > or "legacy" memslots. > > On page fault, KVM does a lookup to see if the GPA is shared or private. If the > GPA is private, but there is no memslot or the memslot doesn't have a private fd, > KVM exits to userspace. If there's a memslot with a private fd, the shared/private > flag is used to resolve the > > And to handle the ioctl(), KVM can use kvm_zap_gfn_range(), which will bump the > notifier sequence, i.e. force the page fault to retry if the GPA may have been > (un)registered between checking the type and acquiring mmu_lock. Yeah, that makes sense. > > > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray > > > should be far more efficient. > > > > What about the mis-behaved guest? I don't want to design for the worst > > case, but people may raise concern on the attack from such guest. > > That's why cgroups exist. E.g. a malicious/broken L1 can similarly abuse nested > EPT/NPT to generate a large number of shadow page tables. I havn't seen we had that in KVM. Is there any plan/discussion to add that? > > > > One benefit to explicitly tracking this in KVM is that it might be useful for > > > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending" > > > based on guest hypercalls to share/unshare memory, and then complete the transaction > > > when userspace invokes the ioctl() to complete the share/unshare. > > > > OK, then this can be another field of states/flags/attributes. Let me > > dig up certain level of details: > > > > First, introduce below KVM ioctl > > > > KVM_SET_MEMORY_ATTR > > Actually, if the semantics are that userspace declares memory as private, then we > can reuse KVM_MEMORY_ENCRYPT_REG_REGION and KVM_MEMORY_ENCRYPT_UNREG_REGION. It'd > be a little gross because we'd need to slightly redefine the semantics for TDX, SNP, > and software-protected VM types, e.g. the ioctls() currently require a pre-exisitng > memslot. But I think it'd work... These existing ioctls looks good for TDX and probably SNP as well. For softrware-protected VM types, it may not be enough. Maybe for the first step we can reuse this for all hardware based solutions and invent new interface when software-protected solution gets really supported. There is semantics difference for fd-based private memory. Current above two ioctls() use userspace addreess(hva) while for fd-based it should be fd+offset, and probably it's better to use gpa in this case. Then we will need change existing semantics and break backward-compatibility. Chao > > I'll think more on this...
On Mon, May 30, 2022, Chao Peng wrote: > On Mon, May 23, 2022 at 03:22:32PM +0000, Sean Christopherson wrote: > > Actually, if the semantics are that userspace declares memory as private, then we > > can reuse KVM_MEMORY_ENCRYPT_REG_REGION and KVM_MEMORY_ENCRYPT_UNREG_REGION. It'd > > be a little gross because we'd need to slightly redefine the semantics for TDX, SNP, > > and software-protected VM types, e.g. the ioctls() currently require a pre-exisitng > > memslot. But I think it'd work... > > These existing ioctls looks good for TDX and probably SNP as well. For > softrware-protected VM types, it may not be enough. Maybe for the first > step we can reuse this for all hardware based solutions and invent new > interface when software-protected solution gets really supported. > > There is semantics difference for fd-based private memory. Current above > two ioctls() use userspace addreess(hva) while for fd-based it should be > fd+offset, and probably it's better to use gpa in this case. Then we > will need change existing semantics and break backward-compatibility. My thought was to keep the existing semantics for VMs with type==0, i.e. SEV and SEV-ES VMs. It's a bit gross, but the pinning behavior is a dead end for SNP and TDX, so it effectively needs to be deprecated anyways. I'm definitely not opposed to a new ioctl if Paolo or others think this is too awful, but burning an ioctl for this seems wasteful. Then generic KVM can do something like: case KVM_MEMORY_ENCRYPT_REG_REGION: case KVM_MEMORY_ENCRYPT_UNREG_REGION: struct kvm_enc_region region; if (!kvm_arch_vm_supports_private_memslots(kvm)) goto arch_vm_ioctl; r = -EFAULT; if (copy_from_user(®ion, argp, sizeof(region))) goto out; r = kvm_set_encrypted_region(ioctl, ®ion); break; default: arch_vm_ioctl: r = kvm_arch_vm_ioctl(filp, ioctl, arg); where common KVM provides __weak void kvm_arch_vm_supports_private_memslots(struct kvm *kvm) { return false; } and x86 overrides that to bool kvm_arch_vm_supports_private_memslots(struct kvm *kvm) { /* I can't remember what we decided on calling type '0' VMs. */ return !!kvm->vm_type; } and if someone ever wants to enable private memslot for SEV/SEV-ES guests we can always add a capability or even a new VM type. pKVM on arm can then obviously implement kvm_arch_vm_supports_private_memslots() to grab whatever identifies a pKVM VM.
On Fri, Jun 10, 2022 at 04:14:21PM +0000, Sean Christopherson wrote: > On Mon, May 30, 2022, Chao Peng wrote: > > On Mon, May 23, 2022 at 03:22:32PM +0000, Sean Christopherson wrote: > > > Actually, if the semantics are that userspace declares memory as private, then we > > > can reuse KVM_MEMORY_ENCRYPT_REG_REGION and KVM_MEMORY_ENCRYPT_UNREG_REGION. It'd > > > be a little gross because we'd need to slightly redefine the semantics for TDX, SNP, > > > and software-protected VM types, e.g. the ioctls() currently require a pre-exisitng > > > memslot. But I think it'd work... > > > > These existing ioctls looks good for TDX and probably SNP as well. For > > softrware-protected VM types, it may not be enough. Maybe for the first > > step we can reuse this for all hardware based solutions and invent new > > interface when software-protected solution gets really supported. > > > > There is semantics difference for fd-based private memory. Current above > > two ioctls() use userspace addreess(hva) while for fd-based it should be > > fd+offset, and probably it's better to use gpa in this case. Then we > > will need change existing semantics and break backward-compatibility. > > My thought was to keep the existing semantics for VMs with type==0, i.e. SEV and > SEV-ES VMs. It's a bit gross, but the pinning behavior is a dead end for SNP and > TDX, so it effectively needs to be deprecated anyways. Yes agreed. > I'm definitely not opposed > to a new ioctl if Paolo or others think this is too awful, but burning an ioctl > for this seems wasteful. Yes, I also feel confortable if it's acceptable to reuse kvm_enc_region to pass _gpa_ range for this new type. > > Then generic KVM can do something like: > > case KVM_MEMORY_ENCRYPT_REG_REGION: > case KVM_MEMORY_ENCRYPT_UNREG_REGION: > struct kvm_enc_region region; > > if (!kvm_arch_vm_supports_private_memslots(kvm)) > goto arch_vm_ioctl; > > r = -EFAULT; > if (copy_from_user(®ion, argp, sizeof(region))) > goto out; > > r = kvm_set_encrypted_region(ioctl, ®ion); > break; > default: > arch_vm_ioctl: > r = kvm_arch_vm_ioctl(filp, ioctl, arg); > > > where common KVM provides > > __weak void kvm_arch_vm_supports_private_memslots(struct kvm *kvm) > { > return false; > } I already had kvm_arch_private_mem_supported() introduced in patch-07 so that can be reused. > > and x86 overrides that to > > bool kvm_arch_vm_supports_private_memslots(struct kvm *kvm) > { > /* I can't remember what we decided on calling type '0' VMs. */ > return !!kvm->vm_type; > } > > and if someone ever wants to enable private memslot for SEV/SEV-ES guests we can > always add a capability or even a new VM type. > > pKVM on arm can then obviously implement kvm_arch_vm_supports_private_memslots() > to grab whatever identifies a pKVM VM.
On Thu, May 19, 2022, Chao Peng wrote: > @@ -653,12 +662,12 @@ struct kvm_irq_routing_table { > }; > #endif > > -#ifndef KVM_PRIVATE_MEM_SLOTS > -#define KVM_PRIVATE_MEM_SLOTS 0 > +#ifndef KVM_INTERNAL_MEM_SLOTS > +#define KVM_INTERNAL_MEM_SLOTS 0 > #endif This rename belongs in a separate patch. > #define KVM_MEM_SLOTS_NUM SHRT_MAX > -#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS) > +#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS) > > #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE > static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu) > @@ -1087,9 +1096,9 @@ enum kvm_mr_change { > }; > > int kvm_set_memory_region(struct kvm *kvm, > - const struct kvm_userspace_memory_region *mem); > + const struct kvm_user_mem_region *mem); > int __kvm_set_memory_region(struct kvm *kvm, > - const struct kvm_userspace_memory_region *mem); > + const struct kvm_user_mem_region *mem); > void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); > void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); > int kvm_arch_prepare_memory_region(struct kvm *kvm, > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index e10d131edd80..28cacd3656d4 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -103,6 +103,29 @@ struct kvm_userspace_memory_region { > __u64 userspace_addr; /* start of the userspace allocated memory */ > }; > > +struct kvm_userspace_memory_region_ext { > + struct kvm_userspace_memory_region region; > + __u64 private_offset; > + __u32 private_fd; > + __u32 pad1; > + __u64 pad2[14]; > +}; > + > +#ifdef __KERNEL__ > +/* Internal helper, the layout must match above user visible structures */ It's worth explicity calling out which structureso this aliases. And rather than add a comment about the layout needing to match that, enforce it in code. I personally wouldn't bother with an expolicit comment about the layout, IMO that's a fairly obvious implication of aliasing. /* * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext * that "unpacks" kvm_userspace_memory_region so that KVM can directly access * all fields from the top-level "extended" region. */ And I think it's in this patch that you missed a conversion to the alias, in the prototype for check_memory_region_flags() (looks like it gets fixed up later in the series). diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 0f81bf0407be..8765b334477d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1466,7 +1466,7 @@ static void kvm_replace_memslot(struct kvm *kvm, } } -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem) +static int check_memory_region_flags(const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; @@ -4514,6 +4514,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm) return fd; } +#define SANITY_CHECK_MEM_REGION_FIELD(field) \ +do { \ + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ + offsetof(struct kvm_userspace_memory_region, field)); \ + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ + sizeof_field(struct kvm_userspace_memory_region, field)); \ +} while (0) + +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \ +do { \ + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ + offsetof(struct kvm_userspace_memory_region_ext, field)); \ + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \ +} while (0) + +static void kvm_sanity_check_user_mem_region_alias(void) +{ + SANITY_CHECK_MEM_REGION_FIELD(slot); + SANITY_CHECK_MEM_REGION_FIELD(flags); + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr); + SANITY_CHECK_MEM_REGION_FIELD(memory_size); + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr); + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset); + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd); +} + static long kvm_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -4541,6 +4568,8 @@ static long kvm_vm_ioctl(struct file *filp, unsigned long size; u32 flags; + kvm_sanity_check_user_mem_region_alias(); + memset(&mem, 0, sizeof(mem)); r = -EFAULT; > +struct kvm_user_mem_region { > + __u32 slot; > + __u32 flags; > + __u64 guest_phys_addr; > + __u64 memory_size; > + __u64 userspace_addr; > + __u64 private_offset; > + __u32 private_fd; > + __u32 pad1; > + __u64 pad2[14]; > +}; > +#endif > + > /* > * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace, > * other bits are reserved for kvm internal use which are defined in > @@ -110,6 +133,7 @@ struct kvm_userspace_memory_region { > */ > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) > #define KVM_MEM_READONLY (1UL << 1) > +#define KVM_MEM_PRIVATE (1UL << 2) Hmm, KVM_MEM_PRIVATE is technically wrong now that a "private" memslot maps private and/or shared memory. Strictly speaking, we don't actually need a new flag. Valid file descriptors must be >=0, so the logic for specifying a memslot that can be converted between private and shared could be that "(int)private_fd < 0" means "not convertible", i.e. derive the flag from private_fd. And looking at the two KVM consumers of the flag, via kvm_slot_is_private(), they're both wrong. Both kvm_faultin_pfn() and kvm_mmu_max_mapping_level() should operate on the _fault_, not the slot. So it would actually be a positive to not have an easy way to query if a slot supports conversion. > /* for KVM_IRQ_LINE */ > struct kvm_irq_level { ... > + if (flags & KVM_MEM_PRIVATE) { An added bonus of dropping KVM_MEM_PRIVATE is that these checks go away. > + r = -EINVAL; > + goto out; > + } > + > + size = sizeof(struct kvm_userspace_memory_region); > + > + if (copy_from_user(&mem, argp, size)) > + goto out; > + > + r = -EINVAL; > + if ((flags ^ mem.flags) & KVM_MEM_PRIVATE) > goto out; > > - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); > + r = kvm_vm_ioctl_set_memory_region(kvm, &mem); > break; > } > case KVM_GET_DIRTY_LOG: { > -- > 2.25.1 >
On Fri, Jun 17, 2022, Sean Christopherson wrote: > > @@ -110,6 +133,7 @@ struct kvm_userspace_memory_region { > > */ > > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) > > #define KVM_MEM_READONLY (1UL << 1) > > +#define KVM_MEM_PRIVATE (1UL << 2) > > Hmm, KVM_MEM_PRIVATE is technically wrong now that a "private" memslot maps private > and/or shared memory. Strictly speaking, we don't actually need a new flag. Valid > file descriptors must be >=0, so the logic for specifying a memslot that can be > converted between private and shared could be that "(int)private_fd < 0" means > "not convertible", i.e. derive the flag from private_fd. > > And looking at the two KVM consumers of the flag, via kvm_slot_is_private(), they're > both wrong. Both kvm_faultin_pfn() and kvm_mmu_max_mapping_level() should operate > on the _fault_, not the slot. So it would actually be a positive to not have an easy > way to query if a slot supports conversion. I take that back, the usage in kvm_faultin_pfn() is correct, but the names ends up being confusing because it suggests that it always faults in a private pfn. diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index b6d75016e48c..e1008f00609d 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4045,7 +4045,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) return RET_PF_EMULATE; } - if (fault->is_private) { + if (kvm_slot_can_be_private(slot)) { r = kvm_faultin_pfn_private(vcpu, fault); if (r != RET_PF_CONTINUE) return r == RET_PF_FIXED ? RET_PF_CONTINUE : r; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 31f704c83099..c5126190fb71 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -583,9 +583,9 @@ struct kvm_memory_slot { struct kvm *kvm; }; -static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot) { - return slot && (slot->flags & KVM_MEM_PRIVATE); + return slot && !!slot->private_file; } static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
On Fri, Jun 17, 2022 at 08:52:15PM +0000, Sean Christopherson wrote: > On Thu, May 19, 2022, Chao Peng wrote: > > @@ -653,12 +662,12 @@ struct kvm_irq_routing_table { > > }; > > #endif > > > > -#ifndef KVM_PRIVATE_MEM_SLOTS > > -#define KVM_PRIVATE_MEM_SLOTS 0 > > +#ifndef KVM_INTERNAL_MEM_SLOTS > > +#define KVM_INTERNAL_MEM_SLOTS 0 > > #endif > > This rename belongs in a separate patch. Will separate it out, thanks. > > > #define KVM_MEM_SLOTS_NUM SHRT_MAX > > -#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS) > > +#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS) > > > > #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE > > static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu) > > @@ -1087,9 +1096,9 @@ enum kvm_mr_change { > > }; > > > > int kvm_set_memory_region(struct kvm *kvm, > > - const struct kvm_userspace_memory_region *mem); > > + const struct kvm_user_mem_region *mem); > > int __kvm_set_memory_region(struct kvm *kvm, > > - const struct kvm_userspace_memory_region *mem); > > + const struct kvm_user_mem_region *mem); > > void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); > > void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); > > int kvm_arch_prepare_memory_region(struct kvm *kvm, > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > > index e10d131edd80..28cacd3656d4 100644 > > --- a/include/uapi/linux/kvm.h > > +++ b/include/uapi/linux/kvm.h > > @@ -103,6 +103,29 @@ struct kvm_userspace_memory_region { > > __u64 userspace_addr; /* start of the userspace allocated memory */ > > }; > > > > +struct kvm_userspace_memory_region_ext { > > + struct kvm_userspace_memory_region region; > > + __u64 private_offset; > > + __u32 private_fd; > > + __u32 pad1; > > + __u64 pad2[14]; > > +}; > > + > > +#ifdef __KERNEL__ > > +/* Internal helper, the layout must match above user visible structures */ > > It's worth explicity calling out which structureso this aliases. And rather than > add a comment about the layout needing to match that, enforce it in code. I > personally wouldn't bother with an expolicit comment about the layout, IMO that's > a fairly obvious implication of aliasing. > > /* > * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext > * that "unpacks" kvm_userspace_memory_region so that KVM can directly access > * all fields from the top-level "extended" region. > */ > Thanks. > > And I think it's in this patch that you missed a conversion to the alias, in the > prototype for check_memory_region_flags() (looks like it gets fixed up later in > the series). > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 0f81bf0407be..8765b334477d 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1466,7 +1466,7 @@ static void kvm_replace_memslot(struct kvm *kvm, > } > } > > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem) > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem) > { > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; > > @@ -4514,6 +4514,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm) > return fd; > } > > +#define SANITY_CHECK_MEM_REGION_FIELD(field) \ > +do { \ > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ > + offsetof(struct kvm_userspace_memory_region, field)); \ > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ > + sizeof_field(struct kvm_userspace_memory_region, field)); \ > +} while (0) > + > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \ > +do { \ > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ > + offsetof(struct kvm_userspace_memory_region_ext, field)); \ > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ > + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \ > +} while (0) > + > +static void kvm_sanity_check_user_mem_region_alias(void) > +{ > + SANITY_CHECK_MEM_REGION_FIELD(slot); > + SANITY_CHECK_MEM_REGION_FIELD(flags); > + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr); > + SANITY_CHECK_MEM_REGION_FIELD(memory_size); > + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr); > + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset); > + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd); > +} > + > static long kvm_vm_ioctl(struct file *filp, > unsigned int ioctl, unsigned long arg) > { > @@ -4541,6 +4568,8 @@ static long kvm_vm_ioctl(struct file *filp, > unsigned long size; > u32 flags; > > + kvm_sanity_check_user_mem_region_alias(); > + > memset(&mem, 0, sizeof(mem)); > > r = -EFAULT; > > > +struct kvm_user_mem_region { > > + __u32 slot; > > + __u32 flags; > > + __u64 guest_phys_addr; > > + __u64 memory_size; > > + __u64 userspace_addr; > > + __u64 private_offset; > > + __u32 private_fd; > > + __u32 pad1; > > + __u64 pad2[14]; > > +}; > > +#endif > > + > > /* > > * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace, > > * other bits are reserved for kvm internal use which are defined in > > @@ -110,6 +133,7 @@ struct kvm_userspace_memory_region { > > */ > > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) > > #define KVM_MEM_READONLY (1UL << 1) > > +#define KVM_MEM_PRIVATE (1UL << 2) > > Hmm, KVM_MEM_PRIVATE is technically wrong now that a "private" memslot maps private > and/or shared memory. Strictly speaking, we don't actually need a new flag. Valid > file descriptors must be >=0, so the logic for specifying a memslot that can be > converted between private and shared could be that "(int)private_fd < 0" means > "not convertible", i.e. derive the flag from private_fd. I think a flag is still needed, the problem is private_fd can be safely accessed only when this flag is set, e.g. without this flag, we can't copy_from_user these new fields since they don't exist for previous kvm_userspace_memory_region callers. > > And looking at the two KVM consumers of the flag, via kvm_slot_is_private(), they're > both wrong. Both kvm_faultin_pfn() and kvm_mmu_max_mapping_level() should operate > on the _fault_, not the slot. So it would actually be a positive to not have an easy > way to query if a slot supports conversion. > > > /* for KVM_IRQ_LINE */ > > struct kvm_irq_level { > > ... > > > + if (flags & KVM_MEM_PRIVATE) { > > An added bonus of dropping KVM_MEM_PRIVATE is that these checks go away. > > > + r = -EINVAL; > > + goto out; > > + } > > + > > + size = sizeof(struct kvm_userspace_memory_region); > > + > > + if (copy_from_user(&mem, argp, size)) > > + goto out; > > + > > + r = -EINVAL; > > + if ((flags ^ mem.flags) & KVM_MEM_PRIVATE) > > goto out; > > > > - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); > > + r = kvm_vm_ioctl_set_memory_region(kvm, &mem); > > break; > > } > > case KVM_GET_DIRTY_LOG: { > > -- > > 2.25.1 > >
On Fri, Jun 17, 2022 at 09:27:25PM +0000, Sean Christopherson wrote: > On Fri, Jun 17, 2022, Sean Christopherson wrote: > > > @@ -110,6 +133,7 @@ struct kvm_userspace_memory_region { > > > */ > > > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) > > > #define KVM_MEM_READONLY (1UL << 1) > > > +#define KVM_MEM_PRIVATE (1UL << 2) > > > > Hmm, KVM_MEM_PRIVATE is technically wrong now that a "private" memslot maps private > > and/or shared memory. Strictly speaking, we don't actually need a new flag. Valid > > file descriptors must be >=0, so the logic for specifying a memslot that can be > > converted between private and shared could be that "(int)private_fd < 0" means > > "not convertible", i.e. derive the flag from private_fd. > > > > And looking at the two KVM consumers of the flag, via kvm_slot_is_private(), they're > > both wrong. Both kvm_faultin_pfn() and kvm_mmu_max_mapping_level() should operate > > on the _fault_, not the slot. So it would actually be a positive to not have an easy > > way to query if a slot supports conversion. > > I take that back, the usage in kvm_faultin_pfn() is correct, but the names ends > up being confusing because it suggests that it always faults in a private pfn. Make sense, will change the naming, thanks. > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index b6d75016e48c..e1008f00609d 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4045,7 +4045,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > return RET_PF_EMULATE; > } > > - if (fault->is_private) { > + if (kvm_slot_can_be_private(slot)) { > r = kvm_faultin_pfn_private(vcpu, fault); > if (r != RET_PF_CONTINUE) > return r == RET_PF_FIXED ? RET_PF_CONTINUE : r; > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 31f704c83099..c5126190fb71 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -583,9 +583,9 @@ struct kvm_memory_slot { > struct kvm *kvm; > }; > > -static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) > +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot) > { > - return slot && (slot->flags & KVM_MEM_PRIVATE); > + return slot && !!slot->private_file; > } > > static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote: > On Fri, May 20, 2022, Andy Lutomirski wrote: > > The alternative would be to have some kind of separate table or bitmap (part > > of the memslot?) that tells KVM whether a GPA should map to the fd. > > > > What do you all think? > > My original proposal was to have expolicit shared vs. private memslots, and punch > holes in KVM's memslots on conversion, but due to the way KVM (and userspace) > handle memslot updates, conversions would be painfully slow. That's how we ended > up with the current propsoal. > > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement > and wouldn't necessarily even need to interact with the memslots. It could be a > consumer of memslots, e.g. if we wanted to disallow registering regions without an > associated memslot, but I think we'd want to avoid even that because things will > get messy during memslot updates, e.g. if dirty logging is toggled or a shared > memory region is temporarily removed then we wouldn't want to destroy the tracking. > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray > should be far more efficient. > > One benefit to explicitly tracking this in KVM is that it might be useful for > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending" > based on guest hypercalls to share/unshare memory, and then complete the transaction > when userspace invokes the ioctl() to complete the share/unshare. Another upside to implementing a KVM ioctl is basically the reverse of the discussion around avoiding double-allocations: *supporting* double-allocations. One thing I noticed while testing SNP+UPM support is a fairly dramatic slow-down with how it handles OVMF, which does some really nasty stuff with DMA where it takes 1 or 2 pages and flips them between shared/private on every transaction. Obviously that's not ideal and should be fixed directly at some point, but it's something that exists in the wild and might not be the only such instance where we need to deal with that sort of usage pattern. With the current implementation, one option I had to address this was to disable hole-punching in QEMU when doing shared->private conversions: Boot time from 1GB guest: SNP: 32s SNP+UPM: 1m43s SNP+UPM (disable shared discard): 1m08s Of course, we don't have the option of disabling discard/hole-punching for private memory to see if we get similar gains there, since that also doubles as the interface for doing private->shared conversions. A separate KVM ioctl to decouple these 2 things would allow for that, and allow for a way for userspace to implement things like batched/lazy-discard of previously-converted pages to deal with cases like these. Another motivator for these separate ioctl is that, since we're considering 'out-of-band' interactions with private memfd where userspace might erroneously/inadvertently do things like double allocations, another thing it might do is pre-allocating pages in the private memfd prior to associating the memfd with a private memslot. Since the notifiers aren't registered until that point, any associated callbacks that would normally need to be done as part of those fallocate() notification would be missed unless we do something like 'replay' all the notifications once the private memslot is registered and associating with a memfile notifier. But that seems a bit ugly, and I'm not sure how well that would work. This also seems to hint at this additional 'conversion' state being something that should be owned and managed directly by KVM rather than hooking into the allocations. It would also nicely solve the question of how to handle in-place encryption, since unlike userspace, KVM is perfectly capable of copying data from shared->private prior to conversion / guest start, and disallowing such things afterward. Would just need an extra flag basically.
On Thu, Jun 23, 2022 at 05:59:49PM -0500, Michael Roth wrote: > On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote: > > On Fri, May 20, 2022, Andy Lutomirski wrote: > > > The alternative would be to have some kind of separate table or bitmap (part > > > of the memslot?) that tells KVM whether a GPA should map to the fd. > > > > > > What do you all think? > > > > My original proposal was to have expolicit shared vs. private memslots, and punch > > holes in KVM's memslots on conversion, but due to the way KVM (and userspace) > > handle memslot updates, conversions would be painfully slow. That's how we ended > > up with the current propsoal. > > > > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement > > and wouldn't necessarily even need to interact with the memslots. It could be a > > consumer of memslots, e.g. if we wanted to disallow registering regions without an > > associated memslot, but I think we'd want to avoid even that because things will > > get messy during memslot updates, e.g. if dirty logging is toggled or a shared > > memory region is temporarily removed then we wouldn't want to destroy the tracking. > > > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray > > should be far more efficient. > > > > One benefit to explicitly tracking this in KVM is that it might be useful for > > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending" > > based on guest hypercalls to share/unshare memory, and then complete the transaction > > when userspace invokes the ioctl() to complete the share/unshare. > > Another upside to implementing a KVM ioctl is basically the reverse of the > discussion around avoiding double-allocations: *supporting* double-allocations. > > One thing I noticed while testing SNP+UPM support is a fairly dramatic > slow-down with how it handles OVMF, which does some really nasty stuff > with DMA where it takes 1 or 2 pages and flips them between > shared/private on every transaction. Obviously that's not ideal and > should be fixed directly at some point, but it's something that exists in the > wild and might not be the only such instance where we need to deal with that > sort of usage pattern. > > With the current implementation, one option I had to address this was to > disable hole-punching in QEMU when doing shared->private conversions: > > Boot time from 1GB guest: > SNP: 32s > SNP+UPM: 1m43s > SNP+UPM (disable shared discard): 1m08s > > Of course, we don't have the option of disabling discard/hole-punching > for private memory to see if we get similar gains there, since that also > doubles as the interface for doing private->shared conversions. Private should be the same, minus time consumed for private memory, the data should be close to SNP case. You can't try that in current version due to we rely on the existence of the private page to tell a page is private. > A separate > KVM ioctl to decouple these 2 things would allow for that, and allow for a > way for userspace to implement things like batched/lazy-discard of > previously-converted pages to deal with cases like these. The planned ioctl includes two responsibilities: - Mark the range as private/shared - Zap the existing SLPT mapping for the range Whether doing the hole-punching or not on the fd is unrelated to this ioctl, userspace has freedom to do that or not. Since we don't reply on the fact that private memoy should have been allocated, we can support lazy faulting and don't need explicit fallocate(). That means, whether the memory is discarded or not in the memory backing store is not required by KVM, but be a userspace option. > > Another motivator for these separate ioctl is that, since we're considering > 'out-of-band' interactions with private memfd where userspace might > erroneously/inadvertently do things like double allocations, another thing it > might do is pre-allocating pages in the private memfd prior to associating > the memfd with a private memslot. Since the notifiers aren't registered until > that point, any associated callbacks that would normally need to be done as > part of those fallocate() notification would be missed unless we do something > like 'replay' all the notifications once the private memslot is registered and > associating with a memfile notifier. But that seems a bit ugly, and I'm not > sure how well that would work. This also seems to hint at this additional > 'conversion' state being something that should be owned and managed directly > by KVM rather than hooking into the allocations. Right, once we move the private/shared state into KVM then we don't rely on those callbacks so the 'replay' thing is unneeded. fallocate() notification is useless for sure, invalidate() is likely still needed, just like the invalidate for mmu_notifier to bump the mmu_seq and do the zap. > > It would also nicely solve the question of how to handle in-place > encryption, since unlike userspace, KVM is perfectly capable of copying > data from shared->private prior to conversion / guest start, and > disallowing such things afterward. Would just need an extra flag basically. Agree it's possible to do additional copy during the conversion but I'm not so confident this is urgent and the right API. Currently TDX does not have this need. Maybe as the first step just add the conversion itself. Adding additional feature like this can always be possible whenever we are clear. Thanks, Chao
On Fri, Jun 24, 2022 at 04:54:26PM +0800, Chao Peng wrote: > On Thu, Jun 23, 2022 at 05:59:49PM -0500, Michael Roth wrote: > > On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote: > > > On Fri, May 20, 2022, Andy Lutomirski wrote: > > > > The alternative would be to have some kind of separate table or bitmap (part > > > > of the memslot?) that tells KVM whether a GPA should map to the fd. > > > > > > > > What do you all think? > > > > > > My original proposal was to have expolicit shared vs. private memslots, and punch > > > holes in KVM's memslots on conversion, but due to the way KVM (and userspace) > > > handle memslot updates, conversions would be painfully slow. That's how we ended > > > up with the current propsoal. > > > > > > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement > > > and wouldn't necessarily even need to interact with the memslots. It could be a > > > consumer of memslots, e.g. if we wanted to disallow registering regions without an > > > associated memslot, but I think we'd want to avoid even that because things will > > > get messy during memslot updates, e.g. if dirty logging is toggled or a shared > > > memory region is temporarily removed then we wouldn't want to destroy the tracking. > > > > > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray > > > should be far more efficient. > > > > > > One benefit to explicitly tracking this in KVM is that it might be useful for > > > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending" > > > based on guest hypercalls to share/unshare memory, and then complete the transaction > > > when userspace invokes the ioctl() to complete the share/unshare. > > > > Another upside to implementing a KVM ioctl is basically the reverse of the > > discussion around avoiding double-allocations: *supporting* double-allocations. > > > > One thing I noticed while testing SNP+UPM support is a fairly dramatic > > slow-down with how it handles OVMF, which does some really nasty stuff > > with DMA where it takes 1 or 2 pages and flips them between > > shared/private on every transaction. Obviously that's not ideal and > > should be fixed directly at some point, but it's something that exists in the > > wild and might not be the only such instance where we need to deal with that > > sort of usage pattern. > > > > With the current implementation, one option I had to address this was to > > disable hole-punching in QEMU when doing shared->private conversions: > > > > Boot time from 1GB guest: > > SNP: 32s > > SNP+UPM: 1m43s > > SNP+UPM (disable shared discard): 1m08s > > > > Of course, we don't have the option of disabling discard/hole-punching > > for private memory to see if we get similar gains there, since that also > > doubles as the interface for doing private->shared conversions. > > Private should be the same, minus time consumed for private memory, the > data should be close to SNP case. You can't try that in current version > due to we rely on the existence of the private page to tell a page is > private. > > > A separate > > KVM ioctl to decouple these 2 things would allow for that, and allow for a > > way for userspace to implement things like batched/lazy-discard of > > previously-converted pages to deal with cases like these. > > The planned ioctl includes two responsibilities: > - Mark the range as private/shared > - Zap the existing SLPT mapping for the range > > Whether doing the hole-punching or not on the fd is unrelated to this > ioctl, userspace has freedom to do that or not. Since we don't reply on > the fact that private memoy should have been allocated, we can support > lazy faulting and don't need explicit fallocate(). That means, whether > the memory is discarded or not in the memory backing store is not > required by KVM, but be a userspace option. Nice, that sounds promising. > > > > > Another motivator for these separate ioctl is that, since we're considering > > 'out-of-band' interactions with private memfd where userspace might > > erroneously/inadvertently do things like double allocations, another thing it > > might do is pre-allocating pages in the private memfd prior to associating > > the memfd with a private memslot. Since the notifiers aren't registered until > > that point, any associated callbacks that would normally need to be done as > > part of those fallocate() notification would be missed unless we do something > > like 'replay' all the notifications once the private memslot is registered and > > associating with a memfile notifier. But that seems a bit ugly, and I'm not > > sure how well that would work. This also seems to hint at this additional > > 'conversion' state being something that should be owned and managed directly > > by KVM rather than hooking into the allocations. > > Right, once we move the private/shared state into KVM then we don't rely > on those callbacks so the 'replay' thing is unneeded. fallocate() > notification is useless for sure, invalidate() is likely still needed, > just like the invalidate for mmu_notifier to bump the mmu_seq and do the > zap. Ok, yah, makes sense that we'd still up needing the invalidation hooks. > > > > > It would also nicely solve the question of how to handle in-place > > encryption, since unlike userspace, KVM is perfectly capable of copying > > data from shared->private prior to conversion / guest start, and > > disallowing such things afterward. Would just need an extra flag basically. > > Agree it's possible to do additional copy during the conversion but I'm > not so confident this is urgent and the right API. Currently TDX does > not have this need. Maybe as the first step just add the conversion > itself. Adding additional feature like this can always be possible > whenever we are clear. That seems fair. In the meantime we can adopt the approach proposed by Sean and Vishal[1] and handle it directly in the relevant SNP KVM ioctls. If we end up keeping that approach we'll probably want to make sure these KVM-driven 'implicit' conversions are documented in the KVM/SNP API so that userspace can account for it in it's view of what's private/shared. In this case at least it's pretty obvious, just thinking of when other archs and VMMs utilizing this more. Thanks! -Mike [1] https://lore.kernel.org/kvm/20220524205646.1798325-4-vannapurve@google.com/T/#m1e9bb782b1bea66c36ae7c4c9f4f0c35c2d7e338 > > Thanks, > Chao
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 23baf7fce038..b959445b64cc 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1311,7 +1311,7 @@ yet and must be cleared on entry. :Capability: KVM_CAP_USER_MEMORY :Architectures: all :Type: vm ioctl -:Parameters: struct kvm_userspace_memory_region (in) +:Parameters: struct kvm_userspace_memory_region(_ext) (in) :Returns: 0 on success, -1 on error :: @@ -1324,9 +1324,18 @@ yet and must be cleared on entry. __u64 userspace_addr; /* start of the userspace allocated memory */ }; + struct kvm_userspace_memory_region_ext { + struct kvm_userspace_memory_region region; + __u64 private_offset; + __u32 private_fd; + __u32 pad1; + __u64 pad2[14]; +}; + /* for kvm_memory_region::flags */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) + #define KVM_MEM_PRIVATE (1UL << 2) This ioctl allows the user to create, modify or delete a guest physical memory slot. Bits 0-15 of "slot" specify the slot id and this value @@ -1357,12 +1366,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr be identical. This allows large pages in the guest to be backed by large pages in the host. -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, -to make a new slot read-only. In this case, writes to this memory will be -posted to userspace as KVM_EXIT_MMIO exits. +kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region +fields. It also includes additional fields for some specific features. See +below description of flags field for more information. It's recommended to use +kvm_userspace_memory_region_ext in new userspace code. + +The flags field supports below flags: + +- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to + memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to use it. + +- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to + make a new slot read-only. In this case, writes to this memory will be posted + to userspace as KVM_EXIT_MMIO exits. + +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by + a file descirptor(fd) and the content of the private memory is invisible to + userspace. In this case, userspace should use private_fd/private_offset in + kvm_userspace_memory_region_ext to instruct KVM to provide private memory to + guest. Userspace should guarantee not to map the same pfn indicated by + private_fd/private_offset to different gfns with multiple memslots. Failed to + do this may result undefined behavior. When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of the memory region are automatically reflected into the guest. For example, an diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h index 717716cc51c5..45a978c805bc 100644 --- a/arch/mips/include/asm/kvm_host.h +++ b/arch/mips/include/asm/kvm_host.h @@ -85,7 +85,7 @@ #define KVM_MAX_VCPUS 16 /* memory slots that does not exposed to userspace */ -#define KVM_PRIVATE_MEM_SLOTS 0 +#define KVM_INTERNAL_MEM_SLOTS 0 #define KVM_HALT_POLL_NS_DEFAULT 500000 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c59fea4bdb6e..3f5e81ef77c8 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -53,7 +53,7 @@ #define KVM_MAX_VCPU_IDS (KVM_MAX_VCPUS * KVM_VCPU_ID_RATIO) /* memory slots that are not exposed to userspace */ -#define KVM_PRIVATE_MEM_SLOTS 3 +#define KVM_INTERNAL_MEM_SLOTS 3 #define KVM_HALT_POLL_NS_DEFAULT 200000 diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index e3cbd7706136..1f160801e2a7 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -48,6 +48,8 @@ config KVM select SRCU select INTERVAL_TREE select HAVE_KVM_PM_NOTIFIER if PM + select HAVE_KVM_PRIVATE_MEM if X86_64 + select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM help Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8ee8c91fa762..d873ae56b01a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11910,7 +11910,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, } for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { - struct kvm_userspace_memory_region m; + struct kvm_user_mem_region m; m.slot = id | (i << 16); m.flags = 0; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index f94f72bbd2d3..3fd168972ecd 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -44,6 +44,7 @@ #include <asm/kvm_host.h> #include <linux/kvm_dirty_ring.h> +#include <linux/memfile_notifier.h> #ifndef KVM_MAX_VCPU_IDS #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS @@ -573,8 +574,16 @@ struct kvm_memory_slot { u32 flags; short id; u16 as_id; + struct file *private_file; + loff_t private_offset; + struct memfile_notifier notifier; }; +static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) +{ + return slot && (slot->flags & KVM_MEM_PRIVATE); +} + static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot) { return slot->flags & KVM_MEM_LOG_DIRTY_PAGES; @@ -653,12 +662,12 @@ struct kvm_irq_routing_table { }; #endif -#ifndef KVM_PRIVATE_MEM_SLOTS -#define KVM_PRIVATE_MEM_SLOTS 0 +#ifndef KVM_INTERNAL_MEM_SLOTS +#define KVM_INTERNAL_MEM_SLOTS 0 #endif #define KVM_MEM_SLOTS_NUM SHRT_MAX -#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS) +#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS) #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu) @@ -1087,9 +1096,9 @@ enum kvm_mr_change { }; int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_user_mem_region *mem); int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_user_mem_region *mem); void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); int kvm_arch_prepare_memory_region(struct kvm *kvm, diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index e10d131edd80..28cacd3656d4 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -103,6 +103,29 @@ struct kvm_userspace_memory_region { __u64 userspace_addr; /* start of the userspace allocated memory */ }; +struct kvm_userspace_memory_region_ext { + struct kvm_userspace_memory_region region; + __u64 private_offset; + __u32 private_fd; + __u32 pad1; + __u64 pad2[14]; +}; + +#ifdef __KERNEL__ +/* Internal helper, the layout must match above user visible structures */ +struct kvm_user_mem_region { + __u32 slot; + __u32 flags; + __u64 guest_phys_addr; + __u64 memory_size; + __u64 userspace_addr; + __u64 private_offset; + __u32 private_fd; + __u32 pad1; + __u64 pad2[14]; +}; +#endif + /* * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace, * other bits are reserved for kvm internal use which are defined in @@ -110,6 +133,7 @@ struct kvm_userspace_memory_region { */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) +#define KVM_MEM_PRIVATE (1UL << 2) /* for KVM_IRQ_LINE */ struct kvm_irq_level { diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index a8c5c9f06b3c..ccaff13cc5b8 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -72,3 +72,6 @@ config KVM_XFER_TO_GUEST_WORK config HAVE_KVM_PM_NOTIFIER bool + +config HAVE_KVM_PRIVATE_MEM + bool diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e089db822c12..db9d39a2d3a6 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1830,7 +1830,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id, * Must be called holding kvm->slots_lock for write. */ int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_user_mem_region *mem) { struct kvm_memory_slot *old, *new; struct kvm_memslots *slots; @@ -1934,7 +1934,7 @@ int __kvm_set_memory_region(struct kvm *kvm, EXPORT_SYMBOL_GPL(__kvm_set_memory_region); int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_user_mem_region *mem) { int r; @@ -1946,7 +1946,7 @@ int kvm_set_memory_region(struct kvm *kvm, EXPORT_SYMBOL_GPL(kvm_set_memory_region); static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm, - struct kvm_userspace_memory_region *mem) + struct kvm_user_mem_region *mem) { if ((u16)mem->slot >= KVM_USER_MEM_SLOTS) return -EINVAL; @@ -4500,14 +4500,33 @@ static long kvm_vm_ioctl(struct file *filp, break; } case KVM_SET_USER_MEMORY_REGION: { - struct kvm_userspace_memory_region kvm_userspace_mem; + struct kvm_user_mem_region mem; + unsigned long size; + u32 flags; + + memset(&mem, 0, sizeof(mem)); r = -EFAULT; - if (copy_from_user(&kvm_userspace_mem, argp, - sizeof(kvm_userspace_mem))) + + if (get_user(flags, + (u32 __user *)(argp + offsetof(typeof(mem), flags)))) + goto out; + + if (flags & KVM_MEM_PRIVATE) { + r = -EINVAL; + goto out; + } + + size = sizeof(struct kvm_userspace_memory_region); + + if (copy_from_user(&mem, argp, size)) + goto out; + + r = -EINVAL; + if ((flags ^ mem.flags) & KVM_MEM_PRIVATE) goto out; - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); + r = kvm_vm_ioctl_set_memory_region(kvm, &mem); break; } case KVM_GET_DIRTY_LOG: {