diff mbox series

[v6,3/8] mm/memfd: Introduce MFD_INACCESSIBLE flag

Message ID 20220519153713.819591-4-chao.p.peng@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: mm: fd-based approach for supporting KVM guest private memory | expand

Commit Message

Chao Peng May 19, 2022, 3:37 p.m. UTC
Introduce a new memfd_create() flag indicating the content of the
created memfd is inaccessible from userspace through ordinary MMU
access (e.g., read/write/mmap). However, the file content can be
accessed via a different mechanism (e.g. KVM MMU) indirectly.

It provides semantics required for KVM guest private memory support
that a file descriptor with this flag set is going to be used as the
source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV but may not be accessible from host userspace.

The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
also impossible for a memfd created with this flag.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/uapi/linux/memfd.h |  1 +
 mm/memfd.c                 | 15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

Comments

Vishal Annapurve May 31, 2022, 7:15 p.m. UTC | #1
On Thu, May 19, 2022 at 8:41 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> Introduce a new memfd_create() flag indicating the content of the
> created memfd is inaccessible from userspace through ordinary MMU
> access (e.g., read/write/mmap). However, the file content can be
> accessed via a different mechanism (e.g. KVM MMU) indirectly.
>

SEV, TDX, pkvm and software-only VMs seem to have usecases to set up
initial guest boot memory with the needed blobs.
TDX already supports a KVM IOCTL to transfer contents to private
memory using the TDX module but rest of the implementations will need
to invent
a way to do this.

Is there a plan to support a common implementation for either allowing
initial write access from userspace to private fd or adding a KVM
IOCTL to transfer contents to such a file,
as part of this series through future revisions?

Regards,
Vishal
Chao Peng June 1, 2022, 10:17 a.m. UTC | #2
On Tue, May 31, 2022 at 12:15:00PM -0700, Vishal Annapurve wrote:
> On Thu, May 19, 2022 at 8:41 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > Introduce a new memfd_create() flag indicating the content of the
> > created memfd is inaccessible from userspace through ordinary MMU
> > access (e.g., read/write/mmap). However, the file content can be
> > accessed via a different mechanism (e.g. KVM MMU) indirectly.
> >
> 
> SEV, TDX, pkvm and software-only VMs seem to have usecases to set up
> initial guest boot memory with the needed blobs.
> TDX already supports a KVM IOCTL to transfer contents to private
> memory using the TDX module but rest of the implementations will need
> to invent
> a way to do this.

There are some discussions in https://lkml.org/lkml/2022/5/9/1292
already. I somehow agree with Sean. TDX is using an dedicated ioctl to
copy guest boot memory to private fd so the rest can do that similarly.
The concern is the performance (extra memcpy) but it's trivial since the
initial guest payload is usually optimized in size.

> 
> Is there a plan to support a common implementation for either allowing
> initial write access from userspace to private fd or adding a KVM
> IOCTL to transfer contents to such a file,
> as part of this series through future revisions?

Indeed, adding pre-boot private memory populating on current design
isn't impossible, but there are still some opens, e.g. how to expose
private fd to userspace for access, pKVM and CC usages may have
different requirements. Before that's well-studied I would tend to not
add that and instead use an ioctl to copy. Whether we need a generic
ioctl or feature-specific ioctl, I don't have strong opinion here.
Current TDX uses a feature-specific ioctl so it's not covered in this
series.

Chao
> 
> Regards,
> Vishal
Gupta, Pankaj June 1, 2022, 12:11 p.m. UTC | #3
>>> Introduce a new memfd_create() flag indicating the content of the
>>> created memfd is inaccessible from userspace through ordinary MMU
>>> access (e.g., read/write/mmap). However, the file content can be
>>> accessed via a different mechanism (e.g. KVM MMU) indirectly.
>>>
>>
>> SEV, TDX, pkvm and software-only VMs seem to have usecases to set up
>> initial guest boot memory with the needed blobs.
>> TDX already supports a KVM IOCTL to transfer contents to private
>> memory using the TDX module but rest of the implementations will need
>> to invent
>> a way to do this.
> 
> There are some discussions in https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F5%2F9%2F1292&amp;data=05%7C01%7Cpankaj.gupta%40amd.com%7Cb81ef334e2dd44c6143308da43b87d17%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637896756895977587%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=oQbM2Hj7GlhJTwnTM%2FPnwsfJlmTL7JR9ULBysAqm6V8%3D&amp;reserved=0
> already. I somehow agree with Sean. TDX is using an dedicated ioctl to
> copy guest boot memory to private fd so the rest can do that similarly.
> The concern is the performance (extra memcpy) but it's trivial since the
> initial guest payload is usually optimized in size.
> 
>>
>> Is there a plan to support a common implementation for either allowing
>> initial write access from userspace to private fd or adding a KVM
>> IOCTL to transfer contents to such a file,
>> as part of this series through future revisions?
> 
> Indeed, adding pre-boot private memory populating on current design
> isn't impossible, but there are still some opens, e.g. how to expose
> private fd to userspace for access, pKVM and CC usages may have
> different requirements. Before that's well-studied I would tend to not
> add that and instead use an ioctl to copy. Whether we need a generic
> ioctl or feature-specific ioctl, I don't have strong opinion here.
> Current TDX uses a feature-specific ioctl so it's not covered in this
> series.

Common function or ioctl to populate preboot private memory actually 
makes sense.

Sorry, did not follow much of TDX code yet, Is it possible to filter out
the current TDX specific ioctl to common function so that it can be used 
by other technologies?

Thanks,
Pankaj
Chao Peng June 2, 2022, 10:07 a.m. UTC | #4
On Wed, Jun 01, 2022 at 02:11:42PM +0200, Gupta, Pankaj wrote:
> 
> > > > Introduce a new memfd_create() flag indicating the content of the
> > > > created memfd is inaccessible from userspace through ordinary MMU
> > > > access (e.g., read/write/mmap). However, the file content can be
> > > > accessed via a different mechanism (e.g. KVM MMU) indirectly.
> > > > 
> > > 
> > > SEV, TDX, pkvm and software-only VMs seem to have usecases to set up
> > > initial guest boot memory with the needed blobs.
> > > TDX already supports a KVM IOCTL to transfer contents to private
> > > memory using the TDX module but rest of the implementations will need
> > > to invent
> > > a way to do this.
> > 
> > There are some discussions in https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F5%2F9%2F1292&amp;data=05%7C01%7Cpankaj.gupta%40amd.com%7Cb81ef334e2dd44c6143308da43b87d17%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637896756895977587%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=oQbM2Hj7GlhJTwnTM%2FPnwsfJlmTL7JR9ULBysAqm6V8%3D&amp;reserved=0
> > already. I somehow agree with Sean. TDX is using an dedicated ioctl to
> > copy guest boot memory to private fd so the rest can do that similarly.
> > The concern is the performance (extra memcpy) but it's trivial since the
> > initial guest payload is usually optimized in size.
> > 
> > > 
> > > Is there a plan to support a common implementation for either allowing
> > > initial write access from userspace to private fd or adding a KVM
> > > IOCTL to transfer contents to such a file,
> > > as part of this series through future revisions?
> > 
> > Indeed, adding pre-boot private memory populating on current design
> > isn't impossible, but there are still some opens, e.g. how to expose
> > private fd to userspace for access, pKVM and CC usages may have
> > different requirements. Before that's well-studied I would tend to not
> > add that and instead use an ioctl to copy. Whether we need a generic
> > ioctl or feature-specific ioctl, I don't have strong opinion here.
> > Current TDX uses a feature-specific ioctl so it's not covered in this
> > series.
> 
> Common function or ioctl to populate preboot private memory actually makes
> sense.
> 
> Sorry, did not follow much of TDX code yet, Is it possible to filter out
> the current TDX specific ioctl to common function so that it can be used by
> other technologies?

TDX code is here:
https://patchwork.kernel.org/project/kvm/patch/70ed041fd47c1f7571aa259450b3f9244edda48d.1651774250.git.isaku.yamahata@intel.com/

AFAICS It might be possible to filter that out to a common function. But
would like to hear from Paolo/Sean for their opinion.

Chao
> 
> Thanks,
> Pankaj
Sean Christopherson June 14, 2022, 8:23 p.m. UTC | #5
On Thu, Jun 02, 2022, Chao Peng wrote:
> On Wed, Jun 01, 2022 at 02:11:42PM +0200, Gupta, Pankaj wrote:
> > 
> > > > > Introduce a new memfd_create() flag indicating the content of the
> > > > > created memfd is inaccessible from userspace through ordinary MMU
> > > > > access (e.g., read/write/mmap). However, the file content can be
> > > > > accessed via a different mechanism (e.g. KVM MMU) indirectly.
> > > > > 
> > > > 
> > > > SEV, TDX, pkvm and software-only VMs seem to have usecases to set up
> > > > initial guest boot memory with the needed blobs.
> > > > TDX already supports a KVM IOCTL to transfer contents to private
> > > > memory using the TDX module but rest of the implementations will need
> > > > to invent
> > > > a way to do this.
> > > 
> > > There are some discussions in https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F5%2F9%2F1292&amp;data=05%7C01%7Cpankaj.gupta%40amd.com%7Cb81ef334e2dd44c6143308da43b87d17%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637896756895977587%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=oQbM2Hj7GlhJTwnTM%2FPnwsfJlmTL7JR9ULBysAqm6V8%3D&amp;reserved=0
> > > already. I somehow agree with Sean. TDX is using an dedicated ioctl to
> > > copy guest boot memory to private fd so the rest can do that similarly.
> > > The concern is the performance (extra memcpy) but it's trivial since the
> > > initial guest payload is usually optimized in size.
> > > 
> > > > 
> > > > Is there a plan to support a common implementation for either allowing
> > > > initial write access from userspace to private fd or adding a KVM
> > > > IOCTL to transfer contents to such a file,
> > > > as part of this series through future revisions?
> > > 
> > > Indeed, adding pre-boot private memory populating on current design
> > > isn't impossible, but there are still some opens, e.g. how to expose
> > > private fd to userspace for access, pKVM and CC usages may have
> > > different requirements. Before that's well-studied I would tend to not
> > > add that and instead use an ioctl to copy. Whether we need a generic
> > > ioctl or feature-specific ioctl, I don't have strong opinion here.
> > > Current TDX uses a feature-specific ioctl so it's not covered in this
> > > series.
> > 
> > Common function or ioctl to populate preboot private memory actually makes
> > sense.
> > 
> > Sorry, did not follow much of TDX code yet, Is it possible to filter out
> > the current TDX specific ioctl to common function so that it can be used by
> > other technologies?
> 
> TDX code is here:
> https://patchwork.kernel.org/project/kvm/patch/70ed041fd47c1f7571aa259450b3f9244edda48d.1651774250.git.isaku.yamahata@intel.com/
> 
> AFAICS It might be possible to filter that out to a common function. But
> would like to hear from Paolo/Sean for their opinion.

Eh, I wouldn't put too much effort into creating a common helper, I would be very
surprised if TDX and SNP can share a meaningful amount of code that isn't already
shared, e.g. provided by MMU helpers.

The only part I truly care about sharing is whatever ioctl(s) get added, i.e. I
don't want to end up with two ioctls that do the same thing for TDX vs. SNP.
Chao Peng June 15, 2022, 8:53 a.m. UTC | #6
On Tue, Jun 14, 2022 at 08:23:46PM +0000, Sean Christopherson wrote:
> On Thu, Jun 02, 2022, Chao Peng wrote:
> > On Wed, Jun 01, 2022 at 02:11:42PM +0200, Gupta, Pankaj wrote:
> > > 
> > > > > > Introduce a new memfd_create() flag indicating the content of the
> > > > > > created memfd is inaccessible from userspace through ordinary MMU
> > > > > > access (e.g., read/write/mmap). However, the file content can be
> > > > > > accessed via a different mechanism (e.g. KVM MMU) indirectly.
> > > > > > 
> > > > > 
> > > > > SEV, TDX, pkvm and software-only VMs seem to have usecases to set up
> > > > > initial guest boot memory with the needed blobs.
> > > > > TDX already supports a KVM IOCTL to transfer contents to private
> > > > > memory using the TDX module but rest of the implementations will need
> > > > > to invent
> > > > > a way to do this.
> > > > 
> > > > There are some discussions in https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F5%2F9%2F1292&amp;data=05%7C01%7Cpankaj.gupta%40amd.com%7Cb81ef334e2dd44c6143308da43b87d17%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637896756895977587%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=oQbM2Hj7GlhJTwnTM%2FPnwsfJlmTL7JR9ULBysAqm6V8%3D&amp;reserved=0
> > > > already. I somehow agree with Sean. TDX is using an dedicated ioctl to
> > > > copy guest boot memory to private fd so the rest can do that similarly.
> > > > The concern is the performance (extra memcpy) but it's trivial since the
> > > > initial guest payload is usually optimized in size.
> > > > 
> > > > > 
> > > > > Is there a plan to support a common implementation for either allowing
> > > > > initial write access from userspace to private fd or adding a KVM
> > > > > IOCTL to transfer contents to such a file,
> > > > > as part of this series through future revisions?
> > > > 
> > > > Indeed, adding pre-boot private memory populating on current design
> > > > isn't impossible, but there are still some opens, e.g. how to expose
> > > > private fd to userspace for access, pKVM and CC usages may have
> > > > different requirements. Before that's well-studied I would tend to not
> > > > add that and instead use an ioctl to copy. Whether we need a generic
> > > > ioctl or feature-specific ioctl, I don't have strong opinion here.
> > > > Current TDX uses a feature-specific ioctl so it's not covered in this
> > > > series.
> > > 
> > > Common function or ioctl to populate preboot private memory actually makes
> > > sense.
> > > 
> > > Sorry, did not follow much of TDX code yet, Is it possible to filter out
> > > the current TDX specific ioctl to common function so that it can be used by
> > > other technologies?
> > 
> > TDX code is here:
> > https://patchwork.kernel.org/project/kvm/patch/70ed041fd47c1f7571aa259450b3f9244edda48d.1651774250.git.isaku.yamahata@intel.com/
> > 
> > AFAICS It might be possible to filter that out to a common function. But
> > would like to hear from Paolo/Sean for their opinion.
> 
> Eh, I wouldn't put too much effort into creating a common helper, I would be very
> surprised if TDX and SNP can share a meaningful amount of code that isn't already
> shared, e.g. provided by MMU helpers.
> 
> The only part I truly care about sharing is whatever ioctl(s) get added, i.e. I
> don't want to end up with two ioctls that do the same thing for TDX vs. SNP.

OK, then that part would be better to be added in TDX or SNP series.

Chao
diff mbox series

Patch

diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@ 
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U
+#define MFD_INACCESSIBLE	0x0008U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/memfd.c b/mm/memfd.c
index 08f5f8304746..775541d53f1b 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -18,6 +18,7 @@ 
 #include <linux/hugetlb.h>
 #include <linux/shmem_fs.h>
 #include <linux/memfd.h>
+#include <linux/memfile_notifier.h>
 #include <uapi/linux/memfd.h>
 
 /*
@@ -261,7 +262,8 @@  long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+		       MFD_INACCESSIBLE)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -283,6 +285,10 @@  SYSCALL_DEFINE2(memfd_create,
 			return -EINVAL;
 	}
 
+	/* Disallow sealing when MFD_INACCESSIBLE is set. */
+	if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
+		return -EINVAL;
+
 	/* length includes terminating zero */
 	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
 	if (len <= 0)
@@ -329,12 +335,19 @@  SYSCALL_DEFINE2(memfd_create,
 	if (flags & MFD_ALLOW_SEALING) {
 		file_seals = memfd_file_seals_ptr(file);
 		*file_seals &= ~F_SEAL_SEAL;
+	} else if (flags & MFD_INACCESSIBLE) {
+		error = memfile_node_set_flags(file,
+					       MEMFILE_F_USER_INACCESSIBLE);
+		if (error)
+			goto err_file;
 	}
 
 	fd_install(fd, file);
 	kfree(name);
 	return fd;
 
+err_file:
+	fput(file);
 err_fd:
 	put_unused_fd(fd);
 err_name: