diff mbox series

[3/6] mm: introduce secretmemfd system call to create "secret" memory areas

Message ID 20200720092435.17469-4-rppt@kernel.org (mailing list archive)
State New, archived
Headers show
Series mm: introduce secretmemfd system call to create "secret" memory areas | expand

Commit Message

Mike Rapoport July 20, 2020, 9:24 a.m. UTC
From: Mike Rapoport <rppt@linux.ibm.com>

Introduce "secretmemfd" system call with the ability to create memory areas
visible only in the context of the owning process and not mapped not only
to other processes but in the kernel page tables as well.

The user will create a file descriptor using the secretmemfd system call
where flags supplied as a parameter to this system call will define the
desired protection mode for the memory associated with that file
descriptor. Currently there are two protection modes:

* exclusive - the memory area is unmapped from the kernel direct map and it
              is present only in the page tables of the owning mm.
* uncached  - the memory area is present only in the page tables of the
              owning mm and it is mapped there as uncached.

For instance, the following example will create an uncached mapping (error
handling is omitted):

	fd = secretmemfd(SECRETMEM_UNCACHED);
	ftruncate(fd, MAP_SIZE);
	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED,
		   fd, 0);

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 include/uapi/linux/magic.h     |   1 +
 include/uapi/linux/secretmem.h |   9 ++
 mm/Kconfig                     |   4 +
 mm/Makefile                    |   1 +
 mm/secretmem.c                 | 263 +++++++++++++++++++++++++++++++++
 5 files changed, 278 insertions(+)
 create mode 100644 include/uapi/linux/secretmem.h
 create mode 100644 mm/secretmem.c

Comments

Arnd Bergmann July 20, 2020, 11:30 a.m. UTC | #1
On Mon, Jul 20, 2020 at 11:25 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: Mike Rapoport <rppt@linux.ibm.com>
>
> Introduce "secretmemfd" system call with the ability to create memory areas
> visible only in the context of the owning process and not mapped not only
> to other processes but in the kernel page tables as well.
>
> The user will create a file descriptor using the secretmemfd system call
> where flags supplied as a parameter to this system call will define the
> desired protection mode for the memory associated with that file
> descriptor. Currently there are two protection modes:
>
> * exclusive - the memory area is unmapped from the kernel direct map and it
>               is present only in the page tables of the owning mm.
> * uncached  - the memory area is present only in the page tables of the
>               owning mm and it is mapped there as uncached.
>
> For instance, the following example will create an uncached mapping (error
> handling is omitted):
>
>         fd = secretmemfd(SECRETMEM_UNCACHED);
>         ftruncate(fd, MAP_SIZE);
>         ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED,
>                    fd, 0);
>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>

I wonder if this should be more closely related to dmabuf file
descriptors, which
are already used for a similar purpose: sharing access to secret memory areas
that are not visible to the OS but can be shared with hardware through device
drivers that can import a dmabuf file descriptor.

      Arnd
Mike Rapoport July 20, 2020, 2:20 p.m. UTC | #2
On Mon, Jul 20, 2020 at 01:30:13PM +0200, Arnd Bergmann wrote:
> On Mon, Jul 20, 2020 at 11:25 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > From: Mike Rapoport <rppt@linux.ibm.com>
> >
> > Introduce "secretmemfd" system call with the ability to create memory areas
> > visible only in the context of the owning process and not mapped not only
> > to other processes but in the kernel page tables as well.
> >
> > The user will create a file descriptor using the secretmemfd system call
> > where flags supplied as a parameter to this system call will define the
> > desired protection mode for the memory associated with that file
> > descriptor. Currently there are two protection modes:
> >
> > * exclusive - the memory area is unmapped from the kernel direct map and it
> >               is present only in the page tables of the owning mm.
> > * uncached  - the memory area is present only in the page tables of the
> >               owning mm and it is mapped there as uncached.
> >
> > For instance, the following example will create an uncached mapping (error
> > handling is omitted):
> >
> >         fd = secretmemfd(SECRETMEM_UNCACHED);
> >         ftruncate(fd, MAP_SIZE);
> >         ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED,
> >                    fd, 0);
> >
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> 
> I wonder if this should be more closely related to dmabuf file
> descriptors, which
> are already used for a similar purpose: sharing access to secret memory areas
> that are not visible to the OS but can be shared with hardware through device
> drivers that can import a dmabuf file descriptor.

TBH, I didn't think about dmabuf, but my undestanding is that is this
case memory areas are not visible to the OS because they are on device
memory rather than normal RAM and when dmabuf is backed by the normal
RAM, the memory is visible to the OS.

Did I miss anything?


>       Arnd
Arnd Bergmann July 20, 2020, 2:34 p.m. UTC | #3
On Mon, Jul 20, 2020 at 4:21 PM Mike Rapoport <rppt@kernel.org> wrote:
> On Mon, Jul 20, 2020 at 01:30:13PM +0200, Arnd Bergmann wrote:
> > On Mon, Jul 20, 2020 at 11:25 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > >
> > > Introduce "secretmemfd" system call with the ability to create memory areas
> > > visible only in the context of the owning process and not mapped not only
> > > to other processes but in the kernel page tables as well.
> > >
> > > The user will create a file descriptor using the secretmemfd system call
> > > where flags supplied as a parameter to this system call will define the
> > > desired protection mode for the memory associated with that file
> > > descriptor. Currently there are two protection modes:
> > >
> > > * exclusive - the memory area is unmapped from the kernel direct map and it
> > >               is present only in the page tables of the owning mm.
> > > * uncached  - the memory area is present only in the page tables of the
> > >               owning mm and it is mapped there as uncached.
> > >
> > > For instance, the following example will create an uncached mapping (error
> > > handling is omitted):
> > >
> > >         fd = secretmemfd(SECRETMEM_UNCACHED);
> > >         ftruncate(fd, MAP_SIZE);
> > >         ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED,
> > >                    fd, 0);
> > >
> > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> >
> > I wonder if this should be more closely related to dmabuf file
> > descriptors, which
> > are already used for a similar purpose: sharing access to secret memory areas
> > that are not visible to the OS but can be shared with hardware through device
> > drivers that can import a dmabuf file descriptor.
>
> TBH, I didn't think about dmabuf, but my undestanding is that is this
> case memory areas are not visible to the OS because they are on device
> memory rather than normal RAM and when dmabuf is backed by the normal
> RAM, the memory is visible to the OS.

No, dmabuf is normally about normal RAM that is shared between multiple
devices, the idea is that you can have one driver allocate a buffer in RAM
and export it to user space through a file descriptor. The application can then
go and mmap() it or pass it into one or more other drivers.

This can be used e.g. for sharing a buffer between a video codec and the
gpu, or between a crypto engine and another device that accesses
unencrypted data while software can only observe the encrypted version.

       Arnd
James Bottomley July 20, 2020, 3:51 p.m. UTC | #4
On Mon, 2020-07-20 at 13:30 +0200, Arnd Bergmann wrote:
> On Mon, Jul 20, 2020 at 11:25 AM Mike Rapoport <rppt@kernel.org>
> wrote:
> > 
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > Introduce "secretmemfd" system call with the ability to create
> > memory areas visible only in the context of the owning process and
> > not mapped not only to other processes but in the kernel page
> > tables as well.
> > 
> > The user will create a file descriptor using the secretmemfd system
> > call where flags supplied as a parameter to this system call will
> > define the desired protection mode for the memory associated with
> > that file descriptor. Currently there are two protection modes:
> > 
> > * exclusive - the memory area is unmapped from the kernel direct
> > map and it
> >               is present only in the page tables of the owning mm.
> > * uncached  - the memory area is present only in the page tables of
> > the
> >               owning mm and it is mapped there as uncached.
> > 
> > For instance, the following example will create an uncached mapping
> > (error handling is omitted):
> > 
> >         fd = secretmemfd(SECRETMEM_UNCACHED);
> >         ftruncate(fd, MAP_SIZE);
> >         ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
> > MAP_SHARED,
> >                    fd, 0);
> > 
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> 
> I wonder if this should be more closely related to dmabuf file
> descriptors, which are already used for a similar purpose: sharing
> access to secret memory areas that are not visible to the OS but can
> be shared with hardware through device drivers that can import a
> dmabuf file descriptor.

I'll assume you mean the dmabuf userspace API?  Because the kernel API
is completely device exchange specific and wholly inappropriate for
this use case.

The user space API of dmabuf uses a pseudo-filesystem.  So you mount
the dmabuf file type (and by "you" I mean root because an ordinary user
doesn't have sufficient privilege).  This is basically because every
dmabuf is usable by any user who has permissions.  This really isn't
the initial interface we want for secret memory because secret regions
are supposed to be per process and not shared (at least we don't want
other tenants to see who's using what).

Once you have the fd, you can seek to find the size, mmap, poll and
ioctl it.  The ioctls are all to do with memory synchronization (as
you'd expect from a device backed region) and the mmap is handled by
the dma_buf_ops, which is device specific.  Sizing is missing because
that's reported by the device not settable by the user.

What we want is the ability to get an fd, set the properties and the
size and mmap it.  This is pretty much a 100% overlap with the memfd
API and not much overlap with the dmabuf one, which is why I don't
think the interface is very well suited.

James
Mike Rapoport July 20, 2020, 5:46 p.m. UTC | #5
On Mon, Jul 20, 2020 at 04:34:12PM +0200, Arnd Bergmann wrote:
> On Mon, Jul 20, 2020 at 4:21 PM Mike Rapoport <rppt@kernel.org> wrote:
> > On Mon, Jul 20, 2020 at 01:30:13PM +0200, Arnd Bergmann wrote:
> > > On Mon, Jul 20, 2020 at 11:25 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > >
> > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > >
> > > > Introduce "secretmemfd" system call with the ability to create memory areas
> > > > visible only in the context of the owning process and not mapped not only
> > > > to other processes but in the kernel page tables as well.
> > > >
> > > > The user will create a file descriptor using the secretmemfd system call
> > > > where flags supplied as a parameter to this system call will define the
> > > > desired protection mode for the memory associated with that file
> > > > descriptor. Currently there are two protection modes:
> > > >
> > > > * exclusive - the memory area is unmapped from the kernel direct map and it
> > > >               is present only in the page tables of the owning mm.
> > > > * uncached  - the memory area is present only in the page tables of the
> > > >               owning mm and it is mapped there as uncached.
> > > >
> > > > For instance, the following example will create an uncached mapping (error
> > > > handling is omitted):
> > > >
> > > >         fd = secretmemfd(SECRETMEM_UNCACHED);
> > > >         ftruncate(fd, MAP_SIZE);
> > > >         ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED,
> > > >                    fd, 0);
> > > >
> > > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > >
> > > I wonder if this should be more closely related to dmabuf file
> > > descriptors, which
> > > are already used for a similar purpose: sharing access to secret memory areas
> > > that are not visible to the OS but can be shared with hardware through device
> > > drivers that can import a dmabuf file descriptor.
> >
> > TBH, I didn't think about dmabuf, but my undestanding is that is this
> > case memory areas are not visible to the OS because they are on device
> > memory rather than normal RAM and when dmabuf is backed by the normal
> > RAM, the memory is visible to the OS.
> 
> No, dmabuf is normally about normal RAM that is shared between multiple
> devices, the idea is that you can have one driver allocate a buffer in RAM
> and export it to user space through a file descriptor. The application can then
> go and mmap() it or pass it into one or more other drivers.
> 
> This can be used e.g. for sharing a buffer between a video codec and the
> gpu, or between a crypto engine and another device that accesses
> unencrypted data while software can only observe the encrypted version.

For our usecase sharing is optional from one side and there are no
devices involved from the other.

As James pointed out, there is no match for the userspace API and if
there will emerge a usacase that requires integration of secretmem with
dma-buf, we'll deal with it then.

>        Arnd
Arnd Bergmann July 20, 2020, 6:08 p.m. UTC | #6
On Mon, Jul 20, 2020 at 5:52 PM James Bottomley <jejb@linux.ibm.com> wrote:
> On Mon, 2020-07-20 at 13:30 +0200, Arnd Bergmann wrote:
>
> I'll assume you mean the dmabuf userspace API?  Because the kernel API
> is completely device exchange specific and wholly inappropriate for
> this use case.
>
> The user space API of dmabuf uses a pseudo-filesystem.  So you mount
> the dmabuf file type (and by "you" I mean root because an ordinary user
> doesn't have sufficient privilege).  This is basically because every
> dmabuf is usable by any user who has permissions.  This really isn't
> the initial interface we want for secret memory because secret regions
> are supposed to be per process and not shared (at least we don't want
> other tenants to see who's using what).
>
> Once you have the fd, you can seek to find the size, mmap, poll and
> ioctl it.  The ioctls are all to do with memory synchronization (as
> you'd expect from a device backed region) and the mmap is handled by
> the dma_buf_ops, which is device specific.  Sizing is missing because
> that's reported by the device not settable by the user.

I was mainly talking about the in-kernel interface that is used for
sharing a buffer with hardware. Aside from the limited ioctls, anything
in the kernel can decide on how it wants to export a dma_buf by
calling dma_buf_export()/dma_buf_fd(), which is roughly what the
new syscall does as well. Using dma_buf vs the proposed
implementation for this is not a big difference in complexity.

The one thing that a dma_buf does is that it allows devices to
do DMA on it. This is either something that can turn out to be
useful later, or it is not. From the description, it sounded like
the sharing might be useful, since we already have known use
cases in which "secret" data is exchanged with a trusted execution
environment using the dma-buf interface.

If there is no way the data stored in this new secret memory area
would relate to secret data in a TEE or some other hardware
device, then I agree that dma-buf has no value.

> What we want is the ability to get an fd, set the properties and the
> size and mmap it.  This is pretty much a 100% overlap with the memfd
> API and not much overlap with the dmabuf one, which is why I don't
> think the interface is very well suited.

Does that mean you are suggesting to use additional flags on
memfd_create() instead of a new system call?

      Arnd
James Bottomley July 20, 2020, 7:16 p.m. UTC | #7
On Mon, 2020-07-20 at 20:08 +0200, Arnd Bergmann wrote:
> On Mon, Jul 20, 2020 at 5:52 PM James Bottomley <jejb@linux.ibm.com>
> wrote:
> > On Mon, 2020-07-20 at 13:30 +0200, Arnd Bergmann wrote:
> > 
> > I'll assume you mean the dmabuf userspace API?  Because the kernel
> > API is completely device exchange specific and wholly inappropriate
> > for this use case.
> > 
> > The user space API of dmabuf uses a pseudo-filesystem.  So you
> > mount the dmabuf file type (and by "you" I mean root because an
> > ordinary user doesn't have sufficient privilege).  This is
> > basically because every dmabuf is usable by any user who has
> > permissions.  This really isn't the initial interface we want for
> > secret memory because secret regions are supposed to be per process
> > and not shared (at least we don't want other tenants to see who's
> > using what).
> > 
> > Once you have the fd, you can seek to find the size, mmap, poll and
> > ioctl it.  The ioctls are all to do with memory synchronization (as
> > you'd expect from a device backed region) and the mmap is handled
> > by the dma_buf_ops, which is device specific.  Sizing is missing
> > because that's reported by the device not settable by the user.
> 
> I was mainly talking about the in-kernel interface that is used for
> sharing a buffer with hardware. Aside from the limited ioctls,
> anything in the kernel can decide on how it wants to export a dma_buf
> by calling dma_buf_export()/dma_buf_fd(), which is roughly what the
> new syscall does as well. Using dma_buf vs the proposed
> implementation for this is not a big difference in complexity.

I have thought about it, but haven't got much further:  We can't couple
to SGX without a huge break in the current simple userspace API (it
becomes complex because you'd have to enter the enclave each time you
want to use the memory, or put the whole process in the enclave, which
is a bit of a nightmare for simplicity), and we could only couple it to
SEV if the memory encryption engine would respond to PCID as well as
ASID, which it doesn't.

> The one thing that a dma_buf does is that it allows devices to
> do DMA on it. This is either something that can turn out to be
> useful later, or it is not. From the description, it sounded like
> the sharing might be useful, since we already have known use
> cases in which "secret" data is exchanged with a trusted execution
> environment using the dma-buf interface.

The current use case for private keys is that you take an encrypted
file (which would be the DMA coupled part) and you decrypt the contents
into the secret memory.  There might possibly be a DMA component later
where a HSM like device DMAs a key directly into your secret memory to
avoid exposure, but I wouldn't anticipate any need for anything beyond
the usual page cache API for that case (effectively this would behave
like an ordinary page cache page except that only the current process
would be able to touch the contents).

> If there is no way the data stored in this new secret memory area
> would relate to secret data in a TEE or some other hardware
> device, then I agree that dma-buf has no value.

Never say never, but current TEE designs tend to require full
confidentiality for the entire execution.  What we're probing is
whether we can improve security by doing an API that requires less than
full confidentiality for the process.  I think if the API proves useful
then we will get HW support for it, but it likely won't be in the
current TEE of today form.

> > What we want is the ability to get an fd, set the properties and
> > the size and mmap it.  This is pretty much a 100% overlap with the
> > memfd API and not much overlap with the dmabuf one, which is why I
> > don't think the interface is very well suited.
> 
> Does that mean you are suggesting to use additional flags on
> memfd_create() instead of a new system call?

Well, that was what the previous patch did.  I'm agnostic on the
mechanism for obtaining the fd: new syscall as this patch does or
extension to memfd like the old one did.  All I was saying is that once
you have the fd, the API you use on it is the same as the memfd API.

James
Arnd Bergmann July 20, 2020, 8:05 p.m. UTC | #8
On Mon, Jul 20, 2020 at 9:16 PM James Bottomley <jejb@linux.ibm.com> wrote:
> On Mon, 2020-07-20 at 20:08 +0200, Arnd Bergmann wrote:
> > On Mon, Jul 20, 2020 at 5:52 PM James Bottomley <jejb@linux.ibm.com>
> >
> > If there is no way the data stored in this new secret memory area
> > would relate to secret data in a TEE or some other hardware
> > device, then I agree that dma-buf has no value.
>
> Never say never, but current TEE designs tend to require full
> confidentiality for the entire execution.  What we're probing is
> whether we can improve security by doing an API that requires less than
> full confidentiality for the process.  I think if the API proves useful
> then we will get HW support for it, but it likely won't be in the
> current TEE of today form.

As I understand it, you normally have two kinds of buffers for the TEE:
one that may be allocated by Linux but is owned by the TEE itself
and not accessible by any process, and one that is used for
communication between the TEE and a user process.

The sharing would clearly work only for the second type: data that
a process wants to share with the TEE but as little else as possible.

A hypothetical example might be a process that passes encrypted
data to the TEE (which holds the key) for decryption, receives
decrypted data and then consumes that data in its own address
space. An electronic voting system (I know, evil example) might
receive encrypted ballots and sum them up this way without itself
having the secret key or anything else being able to observe
intermediate results.

> > > What we want is the ability to get an fd, set the properties and
> > > the size and mmap it.  This is pretty much a 100% overlap with the
> > > memfd API and not much overlap with the dmabuf one, which is why I
> > > don't think the interface is very well suited.
> >
> > Does that mean you are suggesting to use additional flags on
> > memfd_create() instead of a new system call?
>
> Well, that was what the previous patch did.  I'm agnostic on the
> mechanism for obtaining the fd: new syscall as this patch does or
> extension to memfd like the old one did.  All I was saying is that once
> you have the fd, the API you use on it is the same as the memfd API.

Ok.

I suppose we could even retrofit dma-buf underneath the
secretmemfd implementation if it ends up being useful later on,

      Arnd
Michael Kerrisk (man-pages) July 21, 2020, 10:59 a.m. UTC | #9
Hi Mike,

On Mon, 20 Jul 2020 at 11:26, Mike Rapoport <rppt@kernel.org> wrote:
>
> From: Mike Rapoport <rppt@linux.ibm.com>
>
> Introduce "secretmemfd" system call with the ability to create memory areas
> visible only in the context of the owning process and not mapped not only
> to other processes but in the kernel page tables as well.
>
> The user will create a file descriptor using the secretmemfd system call

Without wanting to start a bikeshed discussion, the more common
convention in recently added system calls is to use an underscore in
names that consist of multiple clearly distinct words. See many
examples in  https://man7.org/linux/man-pages/man2/syscalls.2.html.

Thus, I'd suggest at least secret_memfd().

Also, I wonder whether memfd_secret() might not be even better.
There's plenty of precedent for the naming style where related APIs
share a common prefix [1].

Thanks,

Michael

[1] Some examples:

       epoll_create(2)
       epoll_create1(2)
       epoll_ctl(2)
       epoll_pwait(2)
       epoll_wait(2)

       mq_getsetattr(2)
       mq_notify(2)
       mq_open(2)
       mq_timedreceive(2)
       mq_timedsend(2)
       mq_unlink(2)

       sched_get_affinity(2)
       sched_get_priority_max(2)
       sched_get_priority_min(2)
       sched_getaffinity(2)
       sched_getattr(2)
       sched_getparam(2)
       sched_getscheduler(2)
       sched_rr_get_interval(2)
       sched_set_affinity(2)
       sched_setaffinity(2)
       sched_setattr(2)
       sched_setparam(2)
       sched_setscheduler(2)
       sched_yield(2)

       timer_create(2)
       timer_delete(2)
       timer_getoverrun(2)
       timer_gettime(2)
       timer_settime(2)

       timerfd_create(2)
       timerfd_gettime(2)
       timerfd_settime(2)
diff mbox series

Patch

diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f3956fc11de6..35687dcb1a42 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -97,5 +97,6 @@ 
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
+#define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/include/uapi/linux/secretmem.h b/include/uapi/linux/secretmem.h
new file mode 100644
index 000000000000..cef7a59f7492
--- /dev/null
+++ b/include/uapi/linux/secretmem.h
@@ -0,0 +1,9 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_SECRERTMEM_H
+#define _UAPI_LINUX_SECRERTMEM_H
+
+/* secretmem operation modes */
+#define SECRETMEM_EXCLUSIVE	0x1
+#define SECRETMEM_UNCACHED	0x2
+
+#endif /* _UAPI_LINUX_SECRERTMEM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index f2104cc0d35c..c5aa948214f9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,8 @@  config ARCH_HAS_HUGEPD
 config MAPPING_DIRTY_HELPERS
         bool
 
+config SECRETMEM
+        def_bool ARCH_HAS_SET_DIRECT_MAP
+	select GENERIC_ALLOCATOR
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 6e9d46b2efc9..c2aa7a393b73 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -121,3 +121,4 @@  obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_SECRETMEM) += secretmem.o
diff --git a/mm/secretmem.c b/mm/secretmem.c
new file mode 100644
index 000000000000..2f65219baf80
--- /dev/null
+++ b/mm/secretmem.c
@@ -0,0 +1,263 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/memfd.h>
+#include <linux/bitops.h>
+#include <linux/printk.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/pseudo_fs.h>
+#include <linux/set_memory.h>
+#include <linux/sched/signal.h>
+
+#include <uapi/linux/secretmem.h>
+#include <uapi/linux/magic.h>
+
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "secretmem: " fmt
+
+#define SECRETMEM_MODE_MASK	(SECRETMEM_EXCLUSIVE | SECRETMEM_UNCACHED)
+#define SECRETMEM_FLAGS_MASK	SECRETMEM_MODE_MASK
+
+struct secretmem_ctx {
+	unsigned int mode;
+};
+
+static struct page *secretmem_alloc_page(gfp_t gfp)
+{
+	/*
+	 * FIXME: use a cache of large pages to reduce the direct map
+	 * fragmentation
+	 */
+	return alloc_page(gfp);
+}
+
+static vm_fault_t secretmem_fault(struct vm_fault *vmf)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	pgoff_t offset = vmf->pgoff;
+	unsigned long addr;
+	struct page *page;
+	int ret = 0;
+
+	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
+		return vmf_error(-EINVAL);
+
+	page = find_get_entry(mapping, offset);
+	if (!page) {
+		page = secretmem_alloc_page(vmf->gfp_mask);
+		if (!page)
+			return vmf_error(-ENOMEM);
+
+		ret = add_to_page_cache(page, mapping, offset, vmf->gfp_mask);
+		if (unlikely(ret))
+			goto err_put_page;
+
+		ret = set_direct_map_invalid_noflush(page);
+		if (ret)
+			goto err_del_page_cache;
+
+		addr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+		__SetPageUptodate(page);
+
+		ret = VM_FAULT_LOCKED;
+	}
+
+	vmf->page = page;
+	return ret;
+
+err_del_page_cache:
+	delete_from_page_cache(page);
+err_put_page:
+	put_page(page);
+	return vmf_error(ret);
+}
+
+static const struct vm_operations_struct secretmem_vm_ops = {
+	.fault = secretmem_fault,
+};
+
+static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct secretmem_ctx *ctx = file->private_data;
+	unsigned long mode = ctx->mode;
+	unsigned long len = vma->vm_end - vma->vm_start;
+
+	if (!mode)
+		return -EINVAL;
+
+	if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
+		return -EAGAIN;
+
+	switch (mode) {
+	case SECRETMEM_UNCACHED:
+		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+		fallthrough;
+	case SECRETMEM_EXCLUSIVE:
+		vma->vm_ops = &secretmem_vm_ops;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	vma->vm_flags |= VM_LOCKED;
+
+	return 0;
+}
+
+const struct file_operations secretmem_fops = {
+	.mmap		= secretmem_mmap,
+};
+
+static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
+{
+	return false;
+}
+
+static int secretmem_migratepage(struct address_space *mapping,
+				 struct page *newpage, struct page *page,
+				 enum migrate_mode mode)
+{
+	return -EBUSY;
+}
+
+static void secretmem_freepage(struct page *page)
+{
+	set_direct_map_default_noflush(page);
+}
+
+static const struct address_space_operations secretmem_aops = {
+	.freepage	= secretmem_freepage,
+	.migratepage	= secretmem_migratepage,
+	.isolate_page	= secretmem_isolate_page,
+};
+
+static struct vfsmount *secretmem_mnt;
+
+static struct file *secretmem_file_create(unsigned long flags)
+{
+	struct file *file = ERR_PTR(-ENOMEM);
+	struct secretmem_ctx *ctx;
+	struct inode *inode;
+
+	inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		goto err_free_inode;
+
+	file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
+				 O_RDWR, &secretmem_fops);
+	if (IS_ERR(file))
+		goto err_free_ctx;
+
+	mapping_set_unevictable(inode->i_mapping);
+
+	inode->i_mapping->private_data = ctx;
+	inode->i_mapping->a_ops = &secretmem_aops;
+
+	/* pretend we are a normal file with zero size */
+	inode->i_mode |= S_IFREG;
+	inode->i_size = 0;
+
+	file->private_data = ctx;
+
+	ctx->mode = flags & SECRETMEM_MODE_MASK;
+
+	return file;
+
+err_free_ctx:
+	kfree(ctx);
+err_free_inode:
+	iput(inode);
+	return file;
+}
+
+SYSCALL_DEFINE1(secretmemfd, unsigned long, flags)
+{
+	struct file *file;
+	unsigned int mode;
+	int fd, err;
+
+	/* make sure local flags do not confict with global fcntl.h */
+	BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC);
+
+	if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC))
+		return -EINVAL;
+
+	/* modes are mutually exclusive, only one mode bit should be set */
+	mode = flags & SECRETMEM_FLAGS_MASK;
+	if (ffs(mode) != fls(mode))
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(flags & O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	file = secretmem_file_create(flags);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_put_fd;
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	fd_install(fd, file);
+	return fd;
+
+err_put_fd:
+	put_unused_fd(fd);
+	return err;
+}
+
+static void secretmem_evict_inode(struct inode *inode)
+{
+	struct secretmem_ctx *ctx = inode->i_private;
+
+	truncate_inode_pages_final(&inode->i_data);
+	clear_inode(inode);
+	kfree(ctx);
+}
+
+static const struct super_operations secretmem_super_ops = {
+	.evict_inode = secretmem_evict_inode,
+};
+
+static int secretmem_init_fs_context(struct fs_context *fc)
+{
+	struct pseudo_fs_context *ctx = init_pseudo(fc, SECRETMEM_MAGIC);
+
+	if (!ctx)
+		return -ENOMEM;
+	ctx->ops = &secretmem_super_ops;
+
+	return 0;
+}
+
+static struct file_system_type secretmem_fs = {
+	.name		= "secretmem",
+	.init_fs_context = secretmem_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static int secretmem_init(void)
+{
+	int ret = 0;
+
+	secretmem_mnt = kern_mount(&secretmem_fs);
+	if (IS_ERR(secretmem_mnt))
+		ret = PTR_ERR(secretmem_mnt);
+
+	return ret;
+}
+fs_initcall(secretmem_init);