diff mbox series

[v6,4/8] KVM: Extend the memslot to support fd-based private memory

Message ID 20220519153713.819591-5-chao.p.peng@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: mm: fd-based approach for supporting KVM guest private memory | expand

Commit Message

Chao Peng May 19, 2022, 3:37 p.m. UTC
Extend the memslot definition to provide guest private memory through a
file descriptor(fd) instead of userspace_addr(hva). Such guest private
memory(fd) may never be mapped into userspace so no userspace_addr(hva)
can be used. Instead add another two new fields
(private_fd/private_offset), plus the existing memory_size to represent
the private memory range. Such memslot can still have the existing
userspace_addr(hva). When use, a single memslot can maintain both
private memory through private fd(private_fd/private_offset) and shared
memory through hva(userspace_addr). A GPA is considered private by KVM
if the memslot has private fd and that corresponding page in the private
fd is populated, otherwise, it's shared.

Since there is no userspace mapping for private fd so we cannot
rely on get_user_pages() to get the pfn in KVM, instead we add a new
memfile_notifier in the memslot and rely on it to get pfn by interacting
its callbacks from memory backing store with the fd/offset.

This new extension is indicated by a new flag KVM_MEM_PRIVATE. At
compile time, a new config HAVE_KVM_PRIVATE_MEM is added and right now
it is selected on X86_64 for Intel TDX usage.

To make KVM easy, internally we use a binary compatible struct
kvm_user_mem_region to handle both the normal and the '_ext' variants.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst   | 38 ++++++++++++++++++++++++++------
 arch/mips/include/asm/kvm_host.h |  2 +-
 arch/x86/include/asm/kvm_host.h  |  2 +-
 arch/x86/kvm/Kconfig             |  2 ++
 arch/x86/kvm/x86.c               |  2 +-
 include/linux/kvm_host.h         | 19 +++++++++++-----
 include/uapi/linux/kvm.h         | 24 ++++++++++++++++++++
 virt/kvm/Kconfig                 |  3 +++
 virt/kvm/kvm_main.c              | 33 +++++++++++++++++++++------
 9 files changed, 103 insertions(+), 22 deletions(-)

Comments

Andy Lutomirski May 20, 2022, 5:57 p.m. UTC | #1
On 5/19/22 08:37, Chao Peng wrote:
> Extend the memslot definition to provide guest private memory through a
> file descriptor(fd) instead of userspace_addr(hva). Such guest private
> memory(fd) may never be mapped into userspace so no userspace_addr(hva)
> can be used. Instead add another two new fields
> (private_fd/private_offset), plus the existing memory_size to represent
> the private memory range. Such memslot can still have the existing
> userspace_addr(hva). When use, a single memslot can maintain both
> private memory through private fd(private_fd/private_offset) and shared
> memory through hva(userspace_addr). A GPA is considered private by KVM
> if the memslot has private fd and that corresponding page in the private
> fd is populated, otherwise, it's shared.
> 


So this is a strange API and, IMO, a layering violation.  I want to make 
sure that we're all actually on board with making this a permanent part 
of the Linux API.  Specifically, we end up with a multiplexing situation 
as you have described. For a given GPA, there are *two* possible host 
backings: an fd-backed one (from the fd, which is private for now might 
might end up potentially shared depending on future extensions) and a 
VMA-backed one.  The selection of which one backs the address is made 
internally by whatever backs the fd.

This, IMO, a clear layering violation.  Normally, an fd has an 
associated address space, and pages in that address space can have 
contents, can be holes that appear to contain all zeros, or could have 
holes that are inaccessible.  If you try to access a hole, you get 
whatever is in the hole.

But now, with this patchset, the fd is more of an overlay and you get 
*something else* if you try to access through the hole.

This results in operations on the fd bubbling up to the KVM mapping in 
what is, IMO, a strange way.  If the user punches a hole, KVM has to 
modify its mappings such that the GPA goes to whatever VMA may be there. 
  (And update the RMP, the hypervisor's tables, or whatever else might 
actually control privateness.)  Conversely, if the user does fallocate 
to fill a hole, the guest mapping *to an unrelated page* has to be 
zapped so that the fd's page shows up.  And the RMP needs updating, etc.

I am lukewarm on this for a few reasons.

1. This is weird.  AFAIK nothing else works like this.  Obviously this 
is subjecting, but "weird" and "layering violation" sometimes translate 
to "problematic locking".

2. fd-backed private memory can't have normal holes.  If I make a memfd, 
punch a hole in it, and mmap(MAP_SHARED) it, I end up with a page that 
reads as zero.  If I write to it, the page gets allocated.  But with 
this new mechanism, if I punch a hole and put it in a memslot, reads and 
writes go somewhere else.  So what if I actually wanted lazily allocated 
private zeros?

2b. For a hypothetical future extension in which an fd can also have 
shared pages (for conversion, for example, or simply because the fd 
backing might actually be more efficient than indirecting through VMAs 
and therefore get used for shared memory or entirely-non-confidential 
VMs), lazy fd-backed zeros sound genuinely useful.

3. TDX hardware capability is not fully exposed.  TDX can have a private 
page and a shared page at GPAs that differ only by the private bit. 
Sure, no one plans to use this today, but baking this into the user ABI 
throws away half the potential address space.

3b. Any software solution that works like TDX (which IMO seems like an 
eminently reasonable design to me) has the same issue.


The alternative would be to have some kind of separate table or bitmap 
(part of the memslot?) that tells KVM whether a GPA should map to the fd.

What do you all think?
Sean Christopherson May 20, 2022, 6:31 p.m. UTC | #2
On Fri, May 20, 2022, Andy Lutomirski wrote:
> The alternative would be to have some kind of separate table or bitmap (part
> of the memslot?) that tells KVM whether a GPA should map to the fd.
> 
> What do you all think?

My original proposal was to have expolicit shared vs. private memslots, and punch
holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
handle memslot updates, conversions would be painfully slow.  That's how we ended
up with the current propsoal.

But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement
and wouldn't necessarily even need to interact with the memslots.  It could be a
consumer of memslots, e.g. if we wanted to disallow registering regions without an
associated memslot, but I think we'd want to avoid even that because things will
get messy during memslot updates, e.g. if dirty logging is toggled or a shared
memory region is temporarily removed then we wouldn't want to destroy the tracking.

I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
should be far more efficient.

One benefit to explicitly tracking this in KVM is that it might be useful for
software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending"
based on guest hypercalls to share/unshare memory, and then complete the transaction
when userspace invokes the ioctl() to complete the share/unshare.
Andy Lutomirski May 22, 2022, 4:03 a.m. UTC | #3
On Fri, May 20, 2022, at 11:31 AM, Sean Christopherson wrote:

> But a dedicated KVM ioctl() to add/remove shared ranges would be easy 
> to implement
> and wouldn't necessarily even need to interact with the memslots.  It 
> could be a
> consumer of memslots, e.g. if we wanted to disallow registering regions 
> without an
> associated memslot, but I think we'd want to avoid even that because 
> things will
> get messy during memslot updates, e.g. if dirty logging is toggled or a 
> shared
> memory region is temporarily removed then we wouldn't want to destroy 
> the tracking.
>
> I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> should be far more efficient.
>
> One benefit to explicitly tracking this in KVM is that it might be 
> useful for
> software-only protected VMs, e.g. KVM could mark a region in the XArray 
> as "pending"
> based on guest hypercalls to share/unshare memory, and then complete 
> the transaction
> when userspace invokes the ioctl() to complete the share/unshare.

That makes sense.

If KVM goes this route, perhaps there the allowed states for a GPA should include private, shared, and also private-and-shared.  Then anyone who wanted to use the same masked GPA for shared and private on TDX could do so if they wanted to.
Chao Peng May 23, 2022, 1:21 p.m. UTC | #4
On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote:
> On Fri, May 20, 2022, Andy Lutomirski wrote:
> > The alternative would be to have some kind of separate table or bitmap (part
> > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > 
> > What do you all think?
> 
> My original proposal was to have expolicit shared vs. private memslots, and punch
> holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
> handle memslot updates, conversions would be painfully slow.  That's how we ended
> up with the current propsoal.
> 
> But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement
> and wouldn't necessarily even need to interact with the memslots.  It could be a
> consumer of memslots, e.g. if we wanted to disallow registering regions without an
> associated memslot, but I think we'd want to avoid even that because things will
> get messy during memslot updates, e.g. if dirty logging is toggled or a shared
> memory region is temporarily removed then we wouldn't want to destroy the tracking.

Even we don't tight that to memslots, that info can only be effective
for private memslot, right? Setting this ioctl to memory ranges defined
in a traditional non-private memslots just makes no sense, I guess we can
comment that in the API document.

> 
> I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> should be far more efficient.

What about the mis-behaved guest? I don't want to design for the worst
case, but people may raise concern on the attack from such guest.

> 
> One benefit to explicitly tracking this in KVM is that it might be useful for
> software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending"
> based on guest hypercalls to share/unshare memory, and then complete the transaction
> when userspace invokes the ioctl() to complete the share/unshare.

OK, then this can be another field of states/flags/attributes. Let me
dig up certain level of details:

First, introduce below KVM ioctl

KVM_SET_MEMORY_ATTR

struct kvm_memory_attr {
	__u64 addr;	/* page aligned */
	__u64 size;	/* page aligned */
#define KVM_MEMORY_ATTR_SHARED		(1 << 0)
#define KVM_MEMORY_ATTR_PRIVATE		(1 << 1)
	__u64 flags;
}

Second, check the KVM maintained guest memory attributes in page fault
handler (instead of checking memory existence in private fd)

Third, the memfile_notifier_ops (populate/invalidate) will be removed
from current code, the old mapping zapping can be directly handled in
this new KVM ioctl().
 
Thought?

Since this info is stored in KVM, which I think is reasonable. But for
other potential memfile_notifier users like VFIO, some KVM-to-VFIO APIs
might be needed depends on the implementaion.

It is also possible to maintain this info purely in userspace. The only
trick bit is implicit conversion support that has to be checked in KVM
page fault handler and is in the fast path.

Thanks,
Chao
Sean Christopherson May 23, 2022, 3:22 p.m. UTC | #5
On Mon, May 23, 2022, Chao Peng wrote:
> On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote:
> > On Fri, May 20, 2022, Andy Lutomirski wrote:
> > > The alternative would be to have some kind of separate table or bitmap (part
> > > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > > 
> > > What do you all think?
> > 
> > My original proposal was to have expolicit shared vs. private memslots, and punch
> > holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
> > handle memslot updates, conversions would be painfully slow.  That's how we ended
> > up with the current propsoal.
> > 
> > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement
> > and wouldn't necessarily even need to interact with the memslots.  It could be a
> > consumer of memslots, e.g. if we wanted to disallow registering regions without an
> > associated memslot, but I think we'd want to avoid even that because things will
> > get messy during memslot updates, e.g. if dirty logging is toggled or a shared
> > memory region is temporarily removed then we wouldn't want to destroy the tracking.
> 
> Even we don't tight that to memslots, that info can only be effective
> for private memslot, right? Setting this ioctl to memory ranges defined
> in a traditional non-private memslots just makes no sense, I guess we can
> comment that in the API document.

Hrm, applying it universally would be funky, e.g. emulated MMIO would need to be
declared "shared".  But, applying it selectively would arguably be worse, e.g.
letting userspace map memory into the guest as shared for a region that's registered
as private...

On option to that mess would be to make memory shared by default, and so userspace
must declare regions that are private.  Then there's no weirdness with emulated MMIO
or "legacy" memslots.

On page fault, KVM does a lookup to see if the GPA is shared or private.  If the
GPA is private, but there is no memslot or the memslot doesn't have a private fd,
KVM exits to userspace.  If there's a memslot with a private fd, the shared/private
flag is used to resolve the 

And to handle the ioctl(), KVM can use kvm_zap_gfn_range(), which will bump the
notifier sequence, i.e. force the page fault to retry if the GPA may have been
(un)registered between checking the type and acquiring mmu_lock.

> > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> > should be far more efficient.
> 
> What about the mis-behaved guest? I don't want to design for the worst
> case, but people may raise concern on the attack from such guest.

That's why cgroups exist.  E.g. a malicious/broken L1 can similarly abuse nested
EPT/NPT to generate a large number of shadow page tables.

> > One benefit to explicitly tracking this in KVM is that it might be useful for
> > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending"
> > based on guest hypercalls to share/unshare memory, and then complete the transaction
> > when userspace invokes the ioctl() to complete the share/unshare.
> 
> OK, then this can be another field of states/flags/attributes. Let me
> dig up certain level of details:
> 
> First, introduce below KVM ioctl
> 
> KVM_SET_MEMORY_ATTR

Actually, if the semantics are that userspace declares memory as private, then we
can reuse KVM_MEMORY_ENCRYPT_REG_REGION and KVM_MEMORY_ENCRYPT_UNREG_REGION.  It'd
be a little gross because we'd need to slightly redefine the semantics for TDX, SNP,
and software-protected VM types, e.g. the ioctls() currently require a pre-exisitng
memslot.  But I think it'd work...

I'll think more on this...
Chao Peng May 30, 2022, 1:26 p.m. UTC | #6
On Mon, May 23, 2022 at 03:22:32PM +0000, Sean Christopherson wrote:
> On Mon, May 23, 2022, Chao Peng wrote:
> > On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote:
> > > On Fri, May 20, 2022, Andy Lutomirski wrote:
> > > > The alternative would be to have some kind of separate table or bitmap (part
> > > > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > > > 
> > > > What do you all think?
> > > 
> > > My original proposal was to have expolicit shared vs. private memslots, and punch
> > > holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
> > > handle memslot updates, conversions would be painfully slow.  That's how we ended
> > > up with the current propsoal.
> > > 
> > > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement
> > > and wouldn't necessarily even need to interact with the memslots.  It could be a
> > > consumer of memslots, e.g. if we wanted to disallow registering regions without an
> > > associated memslot, but I think we'd want to avoid even that because things will
> > > get messy during memslot updates, e.g. if dirty logging is toggled or a shared
> > > memory region is temporarily removed then we wouldn't want to destroy the tracking.
> > 
> > Even we don't tight that to memslots, that info can only be effective
> > for private memslot, right? Setting this ioctl to memory ranges defined
> > in a traditional non-private memslots just makes no sense, I guess we can
> > comment that in the API document.
> 
> Hrm, applying it universally would be funky, e.g. emulated MMIO would need to be
> declared "shared".  But, applying it selectively would arguably be worse, e.g.
> letting userspace map memory into the guest as shared for a region that's registered
> as private...
> 
> On option to that mess would be to make memory shared by default, and so userspace
> must declare regions that are private.  Then there's no weirdness with emulated MMIO
> or "legacy" memslots.
> 
> On page fault, KVM does a lookup to see if the GPA is shared or private.  If the
> GPA is private, but there is no memslot or the memslot doesn't have a private fd,
> KVM exits to userspace.  If there's a memslot with a private fd, the shared/private
> flag is used to resolve the 
> 
> And to handle the ioctl(), KVM can use kvm_zap_gfn_range(), which will bump the
> notifier sequence, i.e. force the page fault to retry if the GPA may have been
> (un)registered between checking the type and acquiring mmu_lock.

Yeah, that makes sense.

> 
> > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> > > should be far more efficient.
> > 
> > What about the mis-behaved guest? I don't want to design for the worst
> > case, but people may raise concern on the attack from such guest.
> 
> That's why cgroups exist.  E.g. a malicious/broken L1 can similarly abuse nested
> EPT/NPT to generate a large number of shadow page tables.

I havn't seen we had that in KVM. Is there any plan/discussion to add that?

> 
> > > One benefit to explicitly tracking this in KVM is that it might be useful for
> > > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending"
> > > based on guest hypercalls to share/unshare memory, and then complete the transaction
> > > when userspace invokes the ioctl() to complete the share/unshare.
> > 
> > OK, then this can be another field of states/flags/attributes. Let me
> > dig up certain level of details:
> > 
> > First, introduce below KVM ioctl
> > 
> > KVM_SET_MEMORY_ATTR
> 
> Actually, if the semantics are that userspace declares memory as private, then we
> can reuse KVM_MEMORY_ENCRYPT_REG_REGION and KVM_MEMORY_ENCRYPT_UNREG_REGION.  It'd
> be a little gross because we'd need to slightly redefine the semantics for TDX, SNP,
> and software-protected VM types, e.g. the ioctls() currently require a pre-exisitng
> memslot.  But I think it'd work...

These existing ioctls looks good for TDX and probably SNP as well. For
softrware-protected VM types, it may not be enough. Maybe for the first
step we can reuse this for all hardware based solutions and invent new
interface when software-protected solution gets really supported.

There is semantics difference for fd-based private memory. Current above
two ioctls() use userspace addreess(hva) while for fd-based it should be
fd+offset, and probably it's better to use gpa in this case. Then we
will need change existing semantics and break backward-compatibility.

Chao

> 
> I'll think more on this...
Sean Christopherson June 10, 2022, 4:14 p.m. UTC | #7
On Mon, May 30, 2022, Chao Peng wrote:
> On Mon, May 23, 2022 at 03:22:32PM +0000, Sean Christopherson wrote:
> > Actually, if the semantics are that userspace declares memory as private, then we
> > can reuse KVM_MEMORY_ENCRYPT_REG_REGION and KVM_MEMORY_ENCRYPT_UNREG_REGION.  It'd
> > be a little gross because we'd need to slightly redefine the semantics for TDX, SNP,
> > and software-protected VM types, e.g. the ioctls() currently require a pre-exisitng
> > memslot.  But I think it'd work...
> 
> These existing ioctls looks good for TDX and probably SNP as well. For
> softrware-protected VM types, it may not be enough. Maybe for the first
> step we can reuse this for all hardware based solutions and invent new
> interface when software-protected solution gets really supported.
> 
> There is semantics difference for fd-based private memory. Current above
> two ioctls() use userspace addreess(hva) while for fd-based it should be
> fd+offset, and probably it's better to use gpa in this case. Then we
> will need change existing semantics and break backward-compatibility.

My thought was to keep the existing semantics for VMs with type==0, i.e. SEV and
SEV-ES VMs.  It's a bit gross, but the pinning behavior is a dead end for SNP and
TDX, so it effectively needs to be deprecated anyways.  I'm definitely not opposed
to a new ioctl if Paolo or others think this is too awful, but burning an ioctl
for this seems wasteful.

Then generic KVM can do something like:

	case KVM_MEMORY_ENCRYPT_REG_REGION:
	case KVM_MEMORY_ENCRYPT_UNREG_REGION:
		struct kvm_enc_region region;

		if (!kvm_arch_vm_supports_private_memslots(kvm))
			goto arch_vm_ioctl;

		r = -EFAULT;
		if (copy_from_user(&region, argp, sizeof(region)))
			goto out;

		r = kvm_set_encrypted_region(ioctl, &region);
		break;
	default:
arch_vm_ioctl:
		r = kvm_arch_vm_ioctl(filp, ioctl, arg);


where common KVM provides

  __weak void kvm_arch_vm_supports_private_memslots(struct kvm *kvm)
  {
	return false;
  }

and x86 overrides that to

  bool kvm_arch_vm_supports_private_memslots(struct kvm *kvm)
  {
  	/* I can't remember what we decided on calling type '0' VMs. */
	return !!kvm->vm_type;
  }

and if someone ever wants to enable private memslot for SEV/SEV-ES guests we can
always add a capability or even a new VM type.

pKVM on arm can then obviously implement kvm_arch_vm_supports_private_memslots()
to grab whatever identifies a pKVM VM.
Chao Peng June 14, 2022, 6:45 a.m. UTC | #8
On Fri, Jun 10, 2022 at 04:14:21PM +0000, Sean Christopherson wrote:
> On Mon, May 30, 2022, Chao Peng wrote:
> > On Mon, May 23, 2022 at 03:22:32PM +0000, Sean Christopherson wrote:
> > > Actually, if the semantics are that userspace declares memory as private, then we
> > > can reuse KVM_MEMORY_ENCRYPT_REG_REGION and KVM_MEMORY_ENCRYPT_UNREG_REGION.  It'd
> > > be a little gross because we'd need to slightly redefine the semantics for TDX, SNP,
> > > and software-protected VM types, e.g. the ioctls() currently require a pre-exisitng
> > > memslot.  But I think it'd work...
> > 
> > These existing ioctls looks good for TDX and probably SNP as well. For
> > softrware-protected VM types, it may not be enough. Maybe for the first
> > step we can reuse this for all hardware based solutions and invent new
> > interface when software-protected solution gets really supported.
> > 
> > There is semantics difference for fd-based private memory. Current above
> > two ioctls() use userspace addreess(hva) while for fd-based it should be
> > fd+offset, and probably it's better to use gpa in this case. Then we
> > will need change existing semantics and break backward-compatibility.
> 
> My thought was to keep the existing semantics for VMs with type==0, i.e. SEV and
> SEV-ES VMs.  It's a bit gross, but the pinning behavior is a dead end for SNP and
> TDX, so it effectively needs to be deprecated anyways. 

Yes agreed.

> I'm definitely not opposed
> to a new ioctl if Paolo or others think this is too awful, but burning an ioctl
> for this seems wasteful.

Yes, I also feel confortable if it's acceptable to reuse kvm_enc_region
to pass _gpa_ range for this new type.

> 
> Then generic KVM can do something like:
> 
> 	case KVM_MEMORY_ENCRYPT_REG_REGION:
> 	case KVM_MEMORY_ENCRYPT_UNREG_REGION:
> 		struct kvm_enc_region region;
> 
> 		if (!kvm_arch_vm_supports_private_memslots(kvm))
> 			goto arch_vm_ioctl;
> 
> 		r = -EFAULT;
> 		if (copy_from_user(&region, argp, sizeof(region)))
> 			goto out;
> 
> 		r = kvm_set_encrypted_region(ioctl, &region);
> 		break;
> 	default:
> arch_vm_ioctl:
> 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> 
> 
> where common KVM provides
> 
>   __weak void kvm_arch_vm_supports_private_memslots(struct kvm *kvm)
>   {
> 	return false;
>   }

I already had kvm_arch_private_mem_supported() introduced in patch-07
so that can be reused.

> 
> and x86 overrides that to
> 
>   bool kvm_arch_vm_supports_private_memslots(struct kvm *kvm)
>   {
>   	/* I can't remember what we decided on calling type '0' VMs. */
> 	return !!kvm->vm_type;
>   }
> 
> and if someone ever wants to enable private memslot for SEV/SEV-ES guests we can
> always add a capability or even a new VM type.
> 
> pKVM on arm can then obviously implement kvm_arch_vm_supports_private_memslots()
> to grab whatever identifies a pKVM VM.
Sean Christopherson June 17, 2022, 8:52 p.m. UTC | #9
On Thu, May 19, 2022, Chao Peng wrote:
> @@ -653,12 +662,12 @@ struct kvm_irq_routing_table {
>  };
>  #endif
>  
> -#ifndef KVM_PRIVATE_MEM_SLOTS
> -#define KVM_PRIVATE_MEM_SLOTS 0
> +#ifndef KVM_INTERNAL_MEM_SLOTS
> +#define KVM_INTERNAL_MEM_SLOTS 0
>  #endif

This rename belongs in a separate patch.

>  #define KVM_MEM_SLOTS_NUM SHRT_MAX
> -#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS)
> +#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
>  
>  #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
>  static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
> @@ -1087,9 +1096,9 @@ enum kvm_mr_change {
>  };
>  
>  int kvm_set_memory_region(struct kvm *kvm,
> -			  const struct kvm_userspace_memory_region *mem);
> +			  const struct kvm_user_mem_region *mem);
>  int __kvm_set_memory_region(struct kvm *kvm,
> -			    const struct kvm_userspace_memory_region *mem);
> +			    const struct kvm_user_mem_region *mem);
>  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
>  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
>  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e10d131edd80..28cacd3656d4 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -103,6 +103,29 @@ struct kvm_userspace_memory_region {
>  	__u64 userspace_addr; /* start of the userspace allocated memory */
>  };
>  
> +struct kvm_userspace_memory_region_ext {
> +	struct kvm_userspace_memory_region region;
> +	__u64 private_offset;
> +	__u32 private_fd;
> +	__u32 pad1;
> +	__u64 pad2[14];
> +};
> +
> +#ifdef __KERNEL__
> +/* Internal helper, the layout must match above user visible structures */

It's worth explicity calling out which structureso this aliases.  And rather than
add a comment about the layout needing to match that, enforce it in code. I
personally wouldn't bother with an expolicit comment about the layout, IMO that's
a fairly obvious implication of aliasing.

/*
 * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
 * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
 * all fields from the top-level "extended" region.
 */


And I think it's in this patch that you missed a conversion to the alias, in the
prototype for check_memory_region_flags() (looks like it gets fixed up later in
the series).

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0f81bf0407be..8765b334477d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1466,7 +1466,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
        }
 }

-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
 {
        u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;

@@ -4514,6 +4514,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
        return fd;
 }

+#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
+do {                                                                           \
+       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=             \
+                    offsetof(struct kvm_userspace_memory_region, field));      \
+       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=         \
+                    sizeof_field(struct kvm_userspace_memory_region, field));  \
+} while (0)
+
+#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)                                       \
+do {                                                                                   \
+       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=                     \
+                    offsetof(struct kvm_userspace_memory_region_ext, field));          \
+       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=                 \
+                    sizeof_field(struct kvm_userspace_memory_region_ext, field));      \
+} while (0)
+
+static void kvm_sanity_check_user_mem_region_alias(void)
+{
+       SANITY_CHECK_MEM_REGION_FIELD(slot);
+       SANITY_CHECK_MEM_REGION_FIELD(flags);
+       SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+       SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+       SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
+       SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset);
+       SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd);
+}
+
 static long kvm_vm_ioctl(struct file *filp,
                           unsigned int ioctl, unsigned long arg)
 {
@@ -4541,6 +4568,8 @@ static long kvm_vm_ioctl(struct file *filp,
                unsigned long size;
                u32 flags;

+               kvm_sanity_check_user_mem_region_alias();
+
                memset(&mem, 0, sizeof(mem));

                r = -EFAULT;

> +struct kvm_user_mem_region {
> +	__u32 slot;
> +	__u32 flags;
> +	__u64 guest_phys_addr;
> +	__u64 memory_size;
> +	__u64 userspace_addr;
> +	__u64 private_offset;
> +	__u32 private_fd;
> +	__u32 pad1;
> +	__u64 pad2[14];
> +};
> +#endif
> +
>  /*
>   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
>   * other bits are reserved for kvm internal use which are defined in
> @@ -110,6 +133,7 @@ struct kvm_userspace_memory_region {
>   */
>  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>  #define KVM_MEM_READONLY	(1UL << 1)
> +#define KVM_MEM_PRIVATE		(1UL << 2)

Hmm, KVM_MEM_PRIVATE is technically wrong now that a "private" memslot maps private
and/or shared memory.  Strictly speaking, we don't actually need a new flag.  Valid
file descriptors must be >=0, so the logic for specifying a memslot that can be
converted between private and shared could be that "(int)private_fd < 0" means
"not convertible", i.e. derive the flag from private_fd.

And looking at the two KVM consumers of the flag, via kvm_slot_is_private(), they're
both wrong.  Both kvm_faultin_pfn() and kvm_mmu_max_mapping_level() should operate
on the _fault_, not the slot.  So it would actually be a positive to not have an easy
way to query if a slot supports conversion.

>  /* for KVM_IRQ_LINE */
>  struct kvm_irq_level {

...

> +		if (flags & KVM_MEM_PRIVATE) {

An added bonus of dropping KVM_MEM_PRIVATE is that these checks go away.

> +			r = -EINVAL;
> +			goto out;
> +		}
> +
> +		size = sizeof(struct kvm_userspace_memory_region);
> +
> +		if (copy_from_user(&mem, argp, size))
> +			goto out;
> +
> +		r = -EINVAL;
> +		if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
>  			goto out;
>  
> -		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> +		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>  		break;
>  	}
>  	case KVM_GET_DIRTY_LOG: {
> -- 
> 2.25.1
>
Sean Christopherson June 17, 2022, 9:27 p.m. UTC | #10
On Fri, Jun 17, 2022, Sean Christopherson wrote:
> > @@ -110,6 +133,7 @@ struct kvm_userspace_memory_region {
> >   */
> >  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >  #define KVM_MEM_READONLY	(1UL << 1)
> > +#define KVM_MEM_PRIVATE		(1UL << 2)
> 
> Hmm, KVM_MEM_PRIVATE is technically wrong now that a "private" memslot maps private
> and/or shared memory.  Strictly speaking, we don't actually need a new flag.  Valid
> file descriptors must be >=0, so the logic for specifying a memslot that can be
> converted between private and shared could be that "(int)private_fd < 0" means
> "not convertible", i.e. derive the flag from private_fd.
> 
> And looking at the two KVM consumers of the flag, via kvm_slot_is_private(), they're
> both wrong.  Both kvm_faultin_pfn() and kvm_mmu_max_mapping_level() should operate
> on the _fault_, not the slot.  So it would actually be a positive to not have an easy
> way to query if a slot supports conversion.

I take that back, the usage in kvm_faultin_pfn() is correct, but the names ends
up being confusing because it suggests that it always faults in a private pfn.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b6d75016e48c..e1008f00609d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4045,7 +4045,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
                        return RET_PF_EMULATE;
        }

-       if (fault->is_private) {
+       if (kvm_slot_can_be_private(slot)) {
                r = kvm_faultin_pfn_private(vcpu, fault);
                if (r != RET_PF_CONTINUE)
                        return r == RET_PF_FIXED ? RET_PF_CONTINUE : r;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 31f704c83099..c5126190fb71 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -583,9 +583,9 @@ struct kvm_memory_slot {
        struct kvm *kvm;
 };

-static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
+static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
 {
-       return slot && (slot->flags & KVM_MEM_PRIVATE);
+       return slot && !!slot->private_file;
 }

 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
Chao Peng June 20, 2022, 2:08 p.m. UTC | #11
On Fri, Jun 17, 2022 at 08:52:15PM +0000, Sean Christopherson wrote:
> On Thu, May 19, 2022, Chao Peng wrote:
> > @@ -653,12 +662,12 @@ struct kvm_irq_routing_table {
> >  };
> >  #endif
> >  
> > -#ifndef KVM_PRIVATE_MEM_SLOTS
> > -#define KVM_PRIVATE_MEM_SLOTS 0
> > +#ifndef KVM_INTERNAL_MEM_SLOTS
> > +#define KVM_INTERNAL_MEM_SLOTS 0
> >  #endif
> 
> This rename belongs in a separate patch.

Will separate it out, thanks.

> 
> >  #define KVM_MEM_SLOTS_NUM SHRT_MAX
> > -#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS)
> > +#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
> >  
> >  #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
> >  static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
> > @@ -1087,9 +1096,9 @@ enum kvm_mr_change {
> >  };
> >  
> >  int kvm_set_memory_region(struct kvm *kvm,
> > -			  const struct kvm_userspace_memory_region *mem);
> > +			  const struct kvm_user_mem_region *mem);
> >  int __kvm_set_memory_region(struct kvm *kvm,
> > -			    const struct kvm_userspace_memory_region *mem);
> > +			    const struct kvm_user_mem_region *mem);
> >  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> >  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> >  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index e10d131edd80..28cacd3656d4 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -103,6 +103,29 @@ struct kvm_userspace_memory_region {
> >  	__u64 userspace_addr; /* start of the userspace allocated memory */
> >  };
> >  
> > +struct kvm_userspace_memory_region_ext {
> > +	struct kvm_userspace_memory_region region;
> > +	__u64 private_offset;
> > +	__u32 private_fd;
> > +	__u32 pad1;
> > +	__u64 pad2[14];
> > +};
> > +
> > +#ifdef __KERNEL__
> > +/* Internal helper, the layout must match above user visible structures */
> 
> It's worth explicity calling out which structureso this aliases.  And rather than
> add a comment about the layout needing to match that, enforce it in code. I
> personally wouldn't bother with an expolicit comment about the layout, IMO that's
> a fairly obvious implication of aliasing.
> 
> /*
>  * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
>  * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
>  * all fields from the top-level "extended" region.
>  */
> 

Thanks.

> 
> And I think it's in this patch that you missed a conversion to the alias, in the
> prototype for check_memory_region_flags() (looks like it gets fixed up later in
> the series).
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0f81bf0407be..8765b334477d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1466,7 +1466,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
>         }
>  }
> 
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> 
> @@ -4514,6 +4514,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
>         return fd;
>  }
> 
> +#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
> +do {                                                                           \
> +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=             \
> +                    offsetof(struct kvm_userspace_memory_region, field));      \
> +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=         \
> +                    sizeof_field(struct kvm_userspace_memory_region, field));  \
> +} while (0)
> +
> +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)                                       \
> +do {                                                                                   \
> +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=                     \
> +                    offsetof(struct kvm_userspace_memory_region_ext, field));          \
> +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=                 \
> +                    sizeof_field(struct kvm_userspace_memory_region_ext, field));      \
> +} while (0)
> +
> +static void kvm_sanity_check_user_mem_region_alias(void)
> +{
> +       SANITY_CHECK_MEM_REGION_FIELD(slot);
> +       SANITY_CHECK_MEM_REGION_FIELD(flags);
> +       SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> +       SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> +       SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> +       SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset);
> +       SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd);
> +}
> +
>  static long kvm_vm_ioctl(struct file *filp,
>                            unsigned int ioctl, unsigned long arg)
>  {
> @@ -4541,6 +4568,8 @@ static long kvm_vm_ioctl(struct file *filp,
>                 unsigned long size;
>                 u32 flags;
> 
> +               kvm_sanity_check_user_mem_region_alias();
> +
>                 memset(&mem, 0, sizeof(mem));
> 
>                 r = -EFAULT;
> 
> > +struct kvm_user_mem_region {
> > +	__u32 slot;
> > +	__u32 flags;
> > +	__u64 guest_phys_addr;
> > +	__u64 memory_size;
> > +	__u64 userspace_addr;
> > +	__u64 private_offset;
> > +	__u32 private_fd;
> > +	__u32 pad1;
> > +	__u64 pad2[14];
> > +};
> > +#endif
> > +
> >  /*
> >   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> >   * other bits are reserved for kvm internal use which are defined in
> > @@ -110,6 +133,7 @@ struct kvm_userspace_memory_region {
> >   */
> >  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >  #define KVM_MEM_READONLY	(1UL << 1)
> > +#define KVM_MEM_PRIVATE		(1UL << 2)
> 
> Hmm, KVM_MEM_PRIVATE is technically wrong now that a "private" memslot maps private
> and/or shared memory.  Strictly speaking, we don't actually need a new flag.  Valid
> file descriptors must be >=0, so the logic for specifying a memslot that can be
> converted between private and shared could be that "(int)private_fd < 0" means
> "not convertible", i.e. derive the flag from private_fd.

I think a flag is still needed, the problem is private_fd can be safely
accessed only when this flag is set, e.g. without this flag, we can't
copy_from_user these new fields since they don't exist for previous
kvm_userspace_memory_region callers.

> 
> And looking at the two KVM consumers of the flag, via kvm_slot_is_private(), they're
> both wrong.  Both kvm_faultin_pfn() and kvm_mmu_max_mapping_level() should operate
> on the _fault_, not the slot.  So it would actually be a positive to not have an easy
> way to query if a slot supports conversion.
> 
> >  /* for KVM_IRQ_LINE */
> >  struct kvm_irq_level {
> 
> ...
> 
> > +		if (flags & KVM_MEM_PRIVATE) {
> 
> An added bonus of dropping KVM_MEM_PRIVATE is that these checks go away.
> 
> > +			r = -EINVAL;
> > +			goto out;
> > +		}
> > +
> > +		size = sizeof(struct kvm_userspace_memory_region);
> > +
> > +		if (copy_from_user(&mem, argp, size))
> > +			goto out;
> > +
> > +		r = -EINVAL;
> > +		if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
> >  			goto out;
> >  
> > -		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > +		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> >  		break;
> >  	}
> >  	case KVM_GET_DIRTY_LOG: {
> > -- 
> > 2.25.1
> >
Chao Peng June 20, 2022, 2:09 p.m. UTC | #12
On Fri, Jun 17, 2022 at 09:27:25PM +0000, Sean Christopherson wrote:
> On Fri, Jun 17, 2022, Sean Christopherson wrote:
> > > @@ -110,6 +133,7 @@ struct kvm_userspace_memory_region {
> > >   */
> > >  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> > >  #define KVM_MEM_READONLY	(1UL << 1)
> > > +#define KVM_MEM_PRIVATE		(1UL << 2)
> > 
> > Hmm, KVM_MEM_PRIVATE is technically wrong now that a "private" memslot maps private
> > and/or shared memory.  Strictly speaking, we don't actually need a new flag.  Valid
> > file descriptors must be >=0, so the logic for specifying a memslot that can be
> > converted between private and shared could be that "(int)private_fd < 0" means
> > "not convertible", i.e. derive the flag from private_fd.
> > 
> > And looking at the two KVM consumers of the flag, via kvm_slot_is_private(), they're
> > both wrong.  Both kvm_faultin_pfn() and kvm_mmu_max_mapping_level() should operate
> > on the _fault_, not the slot.  So it would actually be a positive to not have an easy
> > way to query if a slot supports conversion.
> 
> I take that back, the usage in kvm_faultin_pfn() is correct, but the names ends
> up being confusing because it suggests that it always faults in a private pfn.

Make sense, will change the naming, thanks.

> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b6d75016e48c..e1008f00609d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4045,7 +4045,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                         return RET_PF_EMULATE;
>         }
> 
> -       if (fault->is_private) {
> +       if (kvm_slot_can_be_private(slot)) {
>                 r = kvm_faultin_pfn_private(vcpu, fault);
>                 if (r != RET_PF_CONTINUE)
>                         return r == RET_PF_FIXED ? RET_PF_CONTINUE : r;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 31f704c83099..c5126190fb71 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -583,9 +583,9 @@ struct kvm_memory_slot {
>         struct kvm *kvm;
>  };
> 
> -static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
>  {
> -       return slot && (slot->flags & KVM_MEM_PRIVATE);
> +       return slot && !!slot->private_file;
>  }
> 
>  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
Michael Roth June 23, 2022, 10:59 p.m. UTC | #13
On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote:
> On Fri, May 20, 2022, Andy Lutomirski wrote:
> > The alternative would be to have some kind of separate table or bitmap (part
> > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > 
> > What do you all think?
> 
> My original proposal was to have expolicit shared vs. private memslots, and punch
> holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
> handle memslot updates, conversions would be painfully slow.  That's how we ended
> up with the current propsoal.
> 
> But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement
> and wouldn't necessarily even need to interact with the memslots.  It could be a
> consumer of memslots, e.g. if we wanted to disallow registering regions without an
> associated memslot, but I think we'd want to avoid even that because things will
> get messy during memslot updates, e.g. if dirty logging is toggled or a shared
> memory region is temporarily removed then we wouldn't want to destroy the tracking.
> 
> I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> should be far more efficient.
> 
> One benefit to explicitly tracking this in KVM is that it might be useful for
> software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending"
> based on guest hypercalls to share/unshare memory, and then complete the transaction
> when userspace invokes the ioctl() to complete the share/unshare.

Another upside to implementing a KVM ioctl is basically the reverse of the
discussion around avoiding double-allocations: *supporting* double-allocations.

One thing I noticed while testing SNP+UPM support is a fairly dramatic
slow-down with how it handles OVMF, which does some really nasty stuff
with DMA where it takes 1 or 2 pages and flips them between
shared/private on every transaction. Obviously that's not ideal and
should be fixed directly at some point, but it's something that exists in the
wild and might not be the only such instance where we need to deal with that
sort of usage pattern. 

With the current implementation, one option I had to address this was to
disable hole-punching in QEMU when doing shared->private conversions:

Boot time from 1GB guest:
                               SNP:   32s
                           SNP+UPM: 1m43s
  SNP+UPM (disable shared discard): 1m08s

Of course, we don't have the option of disabling discard/hole-punching
for private memory to see if we get similar gains there, since that also
doubles as the interface for doing private->shared conversions. A separate
KVM ioctl to decouple these 2 things would allow for that, and allow for a
way for userspace to implement things like batched/lazy-discard of
previously-converted pages to deal with cases like these.

Another motivator for these separate ioctl is that, since we're considering
'out-of-band' interactions with private memfd where userspace might
erroneously/inadvertently do things like double allocations, another thing it
might do is pre-allocating pages in the private memfd prior to associating
the memfd with a private memslot. Since the notifiers aren't registered until
that point, any associated callbacks that would normally need to be done as
part of those fallocate() notification would be missed unless we do something
like 'replay' all the notifications once the private memslot is registered and
associating with a memfile notifier. But that seems a bit ugly, and I'm not
sure how well that would work. This also seems to hint at this additional
'conversion' state being something that should be owned and managed directly
by KVM rather than hooking into the allocations.

It would also nicely solve the question of how to handle in-place
encryption, since unlike userspace, KVM is perfectly capable of copying
data from shared->private prior to conversion / guest start, and
disallowing such things afterward. Would just need an extra flag basically.
Chao Peng June 24, 2022, 8:54 a.m. UTC | #14
On Thu, Jun 23, 2022 at 05:59:49PM -0500, Michael Roth wrote:
> On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote:
> > On Fri, May 20, 2022, Andy Lutomirski wrote:
> > > The alternative would be to have some kind of separate table or bitmap (part
> > > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > > 
> > > What do you all think?
> > 
> > My original proposal was to have expolicit shared vs. private memslots, and punch
> > holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
> > handle memslot updates, conversions would be painfully slow.  That's how we ended
> > up with the current propsoal.
> > 
> > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement
> > and wouldn't necessarily even need to interact with the memslots.  It could be a
> > consumer of memslots, e.g. if we wanted to disallow registering regions without an
> > associated memslot, but I think we'd want to avoid even that because things will
> > get messy during memslot updates, e.g. if dirty logging is toggled or a shared
> > memory region is temporarily removed then we wouldn't want to destroy the tracking.
> > 
> > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> > should be far more efficient.
> > 
> > One benefit to explicitly tracking this in KVM is that it might be useful for
> > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending"
> > based on guest hypercalls to share/unshare memory, and then complete the transaction
> > when userspace invokes the ioctl() to complete the share/unshare.
> 
> Another upside to implementing a KVM ioctl is basically the reverse of the
> discussion around avoiding double-allocations: *supporting* double-allocations.
> 
> One thing I noticed while testing SNP+UPM support is a fairly dramatic
> slow-down with how it handles OVMF, which does some really nasty stuff
> with DMA where it takes 1 or 2 pages and flips them between
> shared/private on every transaction. Obviously that's not ideal and
> should be fixed directly at some point, but it's something that exists in the
> wild and might not be the only such instance where we need to deal with that
> sort of usage pattern. 
> 
> With the current implementation, one option I had to address this was to
> disable hole-punching in QEMU when doing shared->private conversions:
> 
> Boot time from 1GB guest:
>                                SNP:   32s
>                            SNP+UPM: 1m43s
>   SNP+UPM (disable shared discard): 1m08s
> 
> Of course, we don't have the option of disabling discard/hole-punching
> for private memory to see if we get similar gains there, since that also
> doubles as the interface for doing private->shared conversions.

Private should be the same, minus time consumed for private memory, the
data should be close to SNP case. You can't try that in current version
due to we rely on the existence of the private page to tell a page is
private.

> A separate
> KVM ioctl to decouple these 2 things would allow for that, and allow for a
> way for userspace to implement things like batched/lazy-discard of
> previously-converted pages to deal with cases like these.

The planned ioctl includes two responsibilities:
  - Mark the range as private/shared
  - Zap the existing SLPT mapping for the range

Whether doing the hole-punching or not on the fd is unrelated to this
ioctl, userspace has freedom to do that or not. Since we don't reply on
the fact that private memoy should have been allocated, we can support
lazy faulting and don't need explicit fallocate(). That means, whether
the memory is discarded or not in the memory backing store is not
required by KVM, but be a userspace option.

> 
> Another motivator for these separate ioctl is that, since we're considering
> 'out-of-band' interactions with private memfd where userspace might
> erroneously/inadvertently do things like double allocations, another thing it
> might do is pre-allocating pages in the private memfd prior to associating
> the memfd with a private memslot. Since the notifiers aren't registered until
> that point, any associated callbacks that would normally need to be done as
> part of those fallocate() notification would be missed unless we do something
> like 'replay' all the notifications once the private memslot is registered and
> associating with a memfile notifier. But that seems a bit ugly, and I'm not
> sure how well that would work. This also seems to hint at this additional
> 'conversion' state being something that should be owned and managed directly
> by KVM rather than hooking into the allocations.

Right, once we move the private/shared state into KVM then we don't rely
on those callbacks so the 'replay' thing is unneeded. fallocate()
notification is useless for sure, invalidate() is likely still needed,
just like the invalidate for mmu_notifier to bump the mmu_seq and do the
zap.

> 
> It would also nicely solve the question of how to handle in-place
> encryption, since unlike userspace, KVM is perfectly capable of copying
> data from shared->private prior to conversion / guest start, and
> disallowing such things afterward. Would just need an extra flag basically.

Agree it's possible to do additional copy during the conversion but I'm
not so confident this is urgent and the right API. Currently TDX does
not have this need. Maybe as the first step just add the conversion
itself. Adding additional feature like this can always be possible
whenever we are clear.

Thanks,
Chao
Michael Roth June 24, 2022, 1:01 p.m. UTC | #15
On Fri, Jun 24, 2022 at 04:54:26PM +0800, Chao Peng wrote:
> On Thu, Jun 23, 2022 at 05:59:49PM -0500, Michael Roth wrote:
> > On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote:
> > > On Fri, May 20, 2022, Andy Lutomirski wrote:
> > > > The alternative would be to have some kind of separate table or bitmap (part
> > > > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > > > 
> > > > What do you all think?
> > > 
> > > My original proposal was to have expolicit shared vs. private memslots, and punch
> > > holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
> > > handle memslot updates, conversions would be painfully slow.  That's how we ended
> > > up with the current propsoal.
> > > 
> > > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement
> > > and wouldn't necessarily even need to interact with the memslots.  It could be a
> > > consumer of memslots, e.g. if we wanted to disallow registering regions without an
> > > associated memslot, but I think we'd want to avoid even that because things will
> > > get messy during memslot updates, e.g. if dirty logging is toggled or a shared
> > > memory region is temporarily removed then we wouldn't want to destroy the tracking.
> > > 
> > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> > > should be far more efficient.
> > > 
> > > One benefit to explicitly tracking this in KVM is that it might be useful for
> > > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending"
> > > based on guest hypercalls to share/unshare memory, and then complete the transaction
> > > when userspace invokes the ioctl() to complete the share/unshare.
> > 
> > Another upside to implementing a KVM ioctl is basically the reverse of the
> > discussion around avoiding double-allocations: *supporting* double-allocations.
> > 
> > One thing I noticed while testing SNP+UPM support is a fairly dramatic
> > slow-down with how it handles OVMF, which does some really nasty stuff
> > with DMA where it takes 1 or 2 pages and flips them between
> > shared/private on every transaction. Obviously that's not ideal and
> > should be fixed directly at some point, but it's something that exists in the
> > wild and might not be the only such instance where we need to deal with that
> > sort of usage pattern. 
> > 
> > With the current implementation, one option I had to address this was to
> > disable hole-punching in QEMU when doing shared->private conversions:
> > 
> > Boot time from 1GB guest:
> >                                SNP:   32s
> >                            SNP+UPM: 1m43s
> >   SNP+UPM (disable shared discard): 1m08s
> > 
> > Of course, we don't have the option of disabling discard/hole-punching
> > for private memory to see if we get similar gains there, since that also
> > doubles as the interface for doing private->shared conversions.
> 
> Private should be the same, minus time consumed for private memory, the
> data should be close to SNP case. You can't try that in current version
> due to we rely on the existence of the private page to tell a page is
> private.
> 
> > A separate
> > KVM ioctl to decouple these 2 things would allow for that, and allow for a
> > way for userspace to implement things like batched/lazy-discard of
> > previously-converted pages to deal with cases like these.
> 
> The planned ioctl includes two responsibilities:
>   - Mark the range as private/shared
>   - Zap the existing SLPT mapping for the range
> 
> Whether doing the hole-punching or not on the fd is unrelated to this
> ioctl, userspace has freedom to do that or not. Since we don't reply on
> the fact that private memoy should have been allocated, we can support
> lazy faulting and don't need explicit fallocate(). That means, whether
> the memory is discarded or not in the memory backing store is not
> required by KVM, but be a userspace option.

Nice, that sounds promising.

> 
> > 
> > Another motivator for these separate ioctl is that, since we're considering
> > 'out-of-band' interactions with private memfd where userspace might
> > erroneously/inadvertently do things like double allocations, another thing it
> > might do is pre-allocating pages in the private memfd prior to associating
> > the memfd with a private memslot. Since the notifiers aren't registered until
> > that point, any associated callbacks that would normally need to be done as
> > part of those fallocate() notification would be missed unless we do something
> > like 'replay' all the notifications once the private memslot is registered and
> > associating with a memfile notifier. But that seems a bit ugly, and I'm not
> > sure how well that would work. This also seems to hint at this additional
> > 'conversion' state being something that should be owned and managed directly
> > by KVM rather than hooking into the allocations.
> 
> Right, once we move the private/shared state into KVM then we don't rely
> on those callbacks so the 'replay' thing is unneeded. fallocate()
> notification is useless for sure, invalidate() is likely still needed,
> just like the invalidate for mmu_notifier to bump the mmu_seq and do the
> zap.

Ok, yah, makes sense that we'd still up needing the invalidation hooks.

> 
> > 
> > It would also nicely solve the question of how to handle in-place
> > encryption, since unlike userspace, KVM is perfectly capable of copying
> > data from shared->private prior to conversion / guest start, and
> > disallowing such things afterward. Would just need an extra flag basically.
> 
> Agree it's possible to do additional copy during the conversion but I'm
> not so confident this is urgent and the right API. Currently TDX does
> not have this need. Maybe as the first step just add the conversion
> itself. Adding additional feature like this can always be possible
> whenever we are clear.

That seems fair. In the meantime we can adopt the approach proposed by
Sean and Vishal[1] and handle it directly in the relevant SNP KVM ioctls.

If we end up keeping that approach we'll probably want to make sure these
KVM-driven 'implicit' conversions are documented in the KVM/SNP API so that
userspace can account for it in it's view of what's private/shared. In this
case at least it's pretty obvious, just thinking of when other archs and
VMMs utilizing this more.

Thanks!

-Mike

[1] https://lore.kernel.org/kvm/20220524205646.1798325-4-vannapurve@google.com/T/#m1e9bb782b1bea66c36ae7c4c9f4f0c35c2d7e338

> 
> Thanks,
> Chao
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 23baf7fce038..b959445b64cc 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1311,7 +1311,7 @@  yet and must be cleared on entry.
 :Capability: KVM_CAP_USER_MEMORY
 :Architectures: all
 :Type: vm ioctl
-:Parameters: struct kvm_userspace_memory_region (in)
+:Parameters: struct kvm_userspace_memory_region(_ext) (in)
 :Returns: 0 on success, -1 on error
 
 ::
@@ -1324,9 +1324,18 @@  yet and must be cleared on entry.
 	__u64 userspace_addr; /* start of the userspace allocated memory */
   };
 
+  struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+
   /* for kvm_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_PRIVATE		(1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1357,12 +1366,27 @@  It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region
+fields. It also includes additional fields for some specific features. See
+below description of flags field for more information. It's recommended to use
+kvm_userspace_memory_region_ext in new userspace code.
+
+The flags field supports below flags:
+
+- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to
+  memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to use it.
+
+- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to
+  make a new slot read-only.  In this case, writes to this memory will be posted
+  to userspace as KVM_EXIT_MMIO exits.
+
+- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by
+  a file descirptor(fd) and the content of the private memory is invisible to
+  userspace. In this case, userspace should use private_fd/private_offset in
+  kvm_userspace_memory_region_ext to instruct KVM to provide private memory to
+  guest. Userspace should guarantee not to map the same pfn indicated by
+  private_fd/private_offset to different gfns with multiple memslots. Failed to
+  do this may result undefined behavior.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 717716cc51c5..45a978c805bc 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -85,7 +85,7 @@ 
 
 #define KVM_MAX_VCPUS		16
 /* memory slots that does not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS	0
+#define KVM_INTERNAL_MEM_SLOTS	0
 
 #define KVM_HALT_POLL_NS_DEFAULT 500000
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c59fea4bdb6e..3f5e81ef77c8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -53,7 +53,7 @@ 
 #define KVM_MAX_VCPU_IDS (KVM_MAX_VCPUS * KVM_VCPU_ID_RATIO)
 
 /* memory slots that are not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS 3
+#define KVM_INTERNAL_MEM_SLOTS 3
 
 #define KVM_HALT_POLL_NS_DEFAULT 200000
 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index e3cbd7706136..1f160801e2a7 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -48,6 +48,8 @@  config KVM
 	select SRCU
 	select INTERVAL_TREE
 	select HAVE_KVM_PM_NOTIFIER if PM
+	select HAVE_KVM_PRIVATE_MEM if X86_64
+	select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8ee8c91fa762..d873ae56b01a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11910,7 +11910,7 @@  void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 	}
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		struct kvm_userspace_memory_region m;
+		struct kvm_user_mem_region m;
 
 		m.slot = id | (i << 16);
 		m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f94f72bbd2d3..3fd168972ecd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -44,6 +44,7 @@ 
 
 #include <asm/kvm_host.h>
 #include <linux/kvm_dirty_ring.h>
+#include <linux/memfile_notifier.h>
 
 #ifndef KVM_MAX_VCPU_IDS
 #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
@@ -573,8 +574,16 @@  struct kvm_memory_slot {
 	u32 flags;
 	short id;
 	u16 as_id;
+	struct file *private_file;
+	loff_t private_offset;
+	struct memfile_notifier notifier;
 };
 
+static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
+{
+	return slot && (slot->flags & KVM_MEM_PRIVATE);
+}
+
 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
 {
 	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
@@ -653,12 +662,12 @@  struct kvm_irq_routing_table {
 };
 #endif
 
-#ifndef KVM_PRIVATE_MEM_SLOTS
-#define KVM_PRIVATE_MEM_SLOTS 0
+#ifndef KVM_INTERNAL_MEM_SLOTS
+#define KVM_INTERNAL_MEM_SLOTS 0
 #endif
 
 #define KVM_MEM_SLOTS_NUM SHRT_MAX
-#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS)
+#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
 
 #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
 static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
@@ -1087,9 +1096,9 @@  enum kvm_mr_change {
 };
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem);
+			  const struct kvm_user_mem_region *mem);
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem);
+			    const struct kvm_user_mem_region *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e10d131edd80..28cacd3656d4 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,29 @@  struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+
+#ifdef __KERNEL__
+/* Internal helper, the layout must match above user visible structures */
+struct kvm_user_mem_region {
+	__u32 slot;
+	__u32 flags;
+	__u64 guest_phys_addr;
+	__u64 memory_size;
+	__u64 userspace_addr;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+#endif
+
 /*
  * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
  * other bits are reserved for kvm internal use which are defined in
@@ -110,6 +133,7 @@  struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_PRIVATE		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index a8c5c9f06b3c..ccaff13cc5b8 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -72,3 +72,6 @@  config KVM_XFER_TO_GUEST_WORK
 
 config HAVE_KVM_PM_NOTIFIER
        bool
+
+config HAVE_KVM_PRIVATE_MEM
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e089db822c12..db9d39a2d3a6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1830,7 +1830,7 @@  static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
  * Must be called holding kvm->slots_lock for write.
  */
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem)
+			    const struct kvm_user_mem_region *mem)
 {
 	struct kvm_memory_slot *old, *new;
 	struct kvm_memslots *slots;
@@ -1934,7 +1934,7 @@  int __kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem)
+			  const struct kvm_user_mem_region *mem)
 {
 	int r;
 
@@ -1946,7 +1946,7 @@  int kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(kvm_set_memory_region);
 
 static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
-					  struct kvm_userspace_memory_region *mem)
+					  struct kvm_user_mem_region *mem)
 {
 	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
@@ -4500,14 +4500,33 @@  static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 	case KVM_SET_USER_MEMORY_REGION: {
-		struct kvm_userspace_memory_region kvm_userspace_mem;
+		struct kvm_user_mem_region mem;
+		unsigned long size;
+		u32 flags;
+
+		memset(&mem, 0, sizeof(mem));
 
 		r = -EFAULT;
-		if (copy_from_user(&kvm_userspace_mem, argp,
-						sizeof(kvm_userspace_mem)))
+
+		if (get_user(flags,
+			(u32 __user *)(argp + offsetof(typeof(mem), flags))))
+			goto out;
+
+		if (flags & KVM_MEM_PRIVATE) {
+			r = -EINVAL;
+			goto out;
+		}
+
+		size = sizeof(struct kvm_userspace_memory_region);
+
+		if (copy_from_user(&mem, argp, size))
+			goto out;
+
+		r = -EINVAL;
+		if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
 			goto out;
 
-		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
+		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
 		break;
 	}
 	case KVM_GET_DIRTY_LOG: {