diff mbox series

[RFC] mm: extend memfd with ability to create "secret" memory areas

Message ID 20200130162340.GA14232@rapoport-lnx (mailing list archive)
State New, archived
Headers show
Series [RFC] mm: extend memfd with ability to create "secret" memory areas | expand

Commit Message

Mike Rapoport Jan. 30, 2020, 4:23 p.m. UTC
Hi,

This is essentially a resend of my attempt to implement "secret" mappings
using a file descriptor [1]. 

I've done a couple of experiments with secret/exclusive/whatever
memory backed by a file-descriptor using a chardev and memfd_create
syscall. There is indeed no need for VM_ flag, but there are still places
that would require special care, e.g vm_normal_page(), madvise(DO_FORK), so
it won't be completely free of core mm modifications.

Below is a POC that implements extension to memfd_create() that allows
mapping of a "secret" memory. The "secrecy" mode should be explicitly set
using ioctl(), for now I've implemented exclusive and uncached mappings.

The POC primarily indented to illustrate a possible userspace API for
fd-based secret memory. The idea is that user will create a file
descriptor using a system call. The user than has to use ioctl() to define
the desired mode of operation and only when the mode is set it is possible
to mmap() the memory. I.e something like

	fd = memfd_create("secret", MFD_SECRET);
	ioctl(fd, MFD_SECRET_UNCACHED);
	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED,
		   fd, 0);


The ioctl() allows a lot of flexibility in how the secrecy should be
defined. It could be either a request for a particular protection (e.g.
exclusive, uncached) or something like "secrecy level" from "a bit more
secret than normally" to "do your best even at the expense of performance".
The POC implements the first option and the modes are mutually exclusive
for now, but there is no fundamental reason they cannot be mixed.

I've chosen memfd over a chardev as it seem to play more neatly with
anon_inodes and would allow simple (ab)use of the page cache for tracking
pages allocated for the "secret" mappings as well as using
address_space_operations for e.g. page migration callbacks.

The POC implementation uses set_memory/pageattr APIs to manipulate the
direct map and does not address the direct map fragmentation issue.

Of course this is something that must be addressed, as well as
modifications to core mm to required keep the secret memory secret, but I'd
really like to focus on the userspace ABI first.

[1] https://lore.kernel.org/lkml/1572171452-7958-1-git-send-email-rppt@kernel.org/
[1] https://lore.kernel.org/lkml/20191205153400.GA25575@rapoport-lnx/

From 5ca6fb6fc3e68d7b27ef04faa19bed4e2813f7f9 Mon Sep 17 00:00:00 2001
From: Mike Rapoport <rppt@linux.ibm.com>
Date: Mon, 18 Nov 2019 09:32:22 +0200
Subject: [PATCH] mm: extend memfd with ability to create "secret" memory areas

Extend memfd_create() system call with the ability to create memory areas
visible only in the context of the owning process and not mapped not only
to other processes but in the kernel page tables as well.

The user will create a file descriptor using the memfd_create system call.
The user than has to use ioctl() to define the desired protection mode for
the memory associated with that file descriptor and only when the mode is
set it is possible to mmap() the memory. For instance, the following
exapmple will create an uncached mapping (error handling is omitted):

        fd = memfd_create("secret", MFD_SECRET);
        ioctl(fd, MFD_SECRET_UNCACHED);
	ftruncate(fd. MAP_SIZE);
        ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED,
		   fd, 0);

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 include/linux/memfd.h      |   9 ++
 include/uapi/linux/magic.h |   1 +
 include/uapi/linux/memfd.h |   6 +
 mm/Kconfig                 |   4 +
 mm/Makefile                |   1 +
 mm/memfd.c                 |  10 +-
 mm/secretmem.c             | 244 +++++++++++++++++++++++++++++++++++++
 7 files changed, 273 insertions(+), 2 deletions(-)
 create mode 100644 mm/secretmem.c

Comments

Dave Hansen Feb. 6, 2020, 6:51 p.m. UTC | #1
On 1/30/20 8:23 AM, Mike Rapoport wrote:
>  include/linux/memfd.h      |   9 ++
>  include/uapi/linux/magic.h |   1 +
>  include/uapi/linux/memfd.h |   6 +
>  mm/Kconfig                 |   4 +
>  mm/Makefile                |   1 +
>  mm/memfd.c                 |  10 +-
>  mm/secretmem.c             | 244 +++++++++++++++++++++++++++++++++++++
>  7 files changed, 273 insertions(+), 2 deletions(-)

It seems pretty self-contained and relatively harmless.

But, how much work is it going to be to tell the rest of the kernel that
page_to_virt() doesn't work any more?  Do we need to make kmap() work on
these?

I guess fixing vm_normal_page() would fix a lot of that.

In general, my concern about creating little self-contained memory types
is that they will get popular and folks will start wanting more features
from them.  For instance, what if I want NUMA affinity, migration, or
large page mappings that are secret?

Can these pages work as guest memory?

Who would the first users of this thing be?
Mike Rapoport Feb. 8, 2020, 5:39 p.m. UTC | #2
On Thu, Feb 06, 2020 at 10:51:13AM -0800, Dave Hansen wrote:
> On 1/30/20 8:23 AM, Mike Rapoport wrote:
> >  include/linux/memfd.h      |   9 ++
> >  include/uapi/linux/magic.h |   1 +
> >  include/uapi/linux/memfd.h |   6 +
> >  mm/Kconfig                 |   4 +
> >  mm/Makefile                |   1 +
> >  mm/memfd.c                 |  10 +-
> >  mm/secretmem.c             | 244 +++++++++++++++++++++++++++++++++++++
> >  7 files changed, 273 insertions(+), 2 deletions(-)
> 
> It seems pretty self-contained and relatively harmless.
> 
> But, how much work is it going to be to tell the rest of the kernel that
> page_to_virt() doesn't work any more?

Why page_to_virt() won't work anymore? Or you refer to that the kernel code
won't be able to access the page contents?

> Do we need to make kmap() work on these?

I don't think we need to make kmap() work on these. The idea is to prevent
kernel from accessing such memory areas.
 
> I guess fixing vm_normal_page() would fix a lot of that.
> 
> In general, my concern about creating little self-contained memory types
> is that they will get popular and folks will start wanting more features
> from them.  For instance, what if I want NUMA affinity, migration, or
> large page mappings that are secret?

Sure, why not :)
Well, this is true for any feature: it may become popular, people will
start using it and it will add more complexity.

My goal is to design this thing keeping in mind that all the above (and
probably more) will be requested sooner or later.
 
> Can these pages work as guest memory?

Actually, this is one of the driving usecases. I believe that people that
use mem=X to limit kernel control of the memory and the manage the
remaining memory for the guests can switch to fd-based approach.
 
> Who would the first users of this thing be?

We were thinking about using such areas to store small secrets, e.g. with
openssl_malloc().

Another usecase is the VM memory.
Reshetova, Elena Feb. 10, 2020, 8:06 a.m. UTC | #3
> On Thu, Feb 06, 2020 at 10:51:13AM -0800, Dave Hansen wrote:
> > On 1/30/20 8:23 AM, Mike Rapoport wrote:
> > >  include/linux/memfd.h      |   9 ++
> > >  include/uapi/linux/magic.h |   1 +
> > >  include/uapi/linux/memfd.h |   6 +
> > >  mm/Kconfig                 |   4 +
> > >  mm/Makefile                |   1 +
> > >  mm/memfd.c                 |  10 +-
> > >  mm/secretmem.c             | 244 +++++++++++++++++++++++++++++++++++++
> > >  7 files changed, 273 insertions(+), 2 deletions(-)
> >
> > It seems pretty self-contained and relatively harmless.
> >
> > But, how much work is it going to be to tell the rest of the kernel that
> > page_to_virt() doesn't work any more?
> 
> Why page_to_virt() won't work anymore? Or you refer to that the kernel code
> won't be able to access the page contents?
> 
> > Do we need to make kmap() work on these?
> 
> I don't think we need to make kmap() work on these. The idea is to prevent
> kernel from accessing such memory areas.
> 
> > I guess fixing vm_normal_page() would fix a lot of that.
> >
> > In general, my concern about creating little self-contained memory types
> > is that they will get popular and folks will start wanting more features
> > from them.  For instance, what if I want NUMA affinity, migration, or
> > large page mappings that are secret?
> 
> Sure, why not :)
> Well, this is true for any feature: it may become popular, people will
> start using it and it will add more complexity.
> 
> My goal is to design this thing keeping in mind that all the above (and
> probably more) will be requested sooner or later.
> 
> > Can these pages work as guest memory?
> 
> Actually, this is one of the driving usecases. I believe that people that
> use mem=X to limit kernel control of the memory and the manage the
> remaining memory for the guests can switch to fd-based approach.
> 
> > Who would the first users of this thing be?
> 
> We were thinking about using such areas to store small secrets, e.g. with
> openssl_malloc().
> 

To elaborate more on this - openssl has "secure heap" feature [1], which 
is basically a mmap area with MLOCK_ONFAULT and MADV_DONTDUMP.
It is optional feature and can be used for storing things like RSA private keys
in a bit more secure memory area (vs. just normal allocation). It is fully 
transparent for userspace applications (hidden behind openssl API), but 
provides additional security when enabled. So, it seems like a natural candidate
for smth like securememory, which in addition to MLOCK_ONFAULT and 
MADV_DONTDUMP can provide further security guarantees like exclusive 
memory and no-caching. 

[1] https://www.openssl.org/docs/manmaster/man3/OPENSSL_secure_malloc.html

Best Regards,
Elena.
Rick Edgecombe Feb. 11, 2020, 7:52 p.m. UTC | #4
On Sat, 2020-02-08 at 19:39 +0200, Mike Rapoport wrote:
> > Do we need to make kmap() work on these?
> 
> I don't think we need to make kmap() work on these. The idea is to prevent
> kernel from accessing such memory areas.

For the VM use case, KVM does kmap() guest memory when emulating page table
operations in some cases:
arch/x86/kvm/paging_tmpl.h:FNAME(cmpxchg_gpte).
Jonathan Corbet Feb. 12, 2020, 9:10 p.m. UTC | #5
On Thu, 30 Jan 2020 18:23:41 +0200
Mike Rapoport <rppt@kernel.org> wrote:

> Hi,
> 
> This is essentially a resend of my attempt to implement "secret" mappings
> using a file descriptor [1]. 

So one little thing I was curious about as I read through the patch...

> +static int secretmem_check_limits(struct vm_fault *vmf)
> +{
> +	struct secretmem_state *state = vmf->vma->vm_file->private_data;
> +	struct inode *inode = file_inode(vmf->vma->vm_file);
> +	unsigned long limit;
> +
> +	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> +		return -EINVAL;
> +
> +	limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	if (state->nr_pages + 1 >= limit)
> +		return -EPERM;
> +
> +	return 0;
> +}

If I'm not mistaken, this means each memfd can be RLIMIT_MEMLOCK in length,
with no global limit on the number of locked pages.  What's keeping me from
creating 1000 of these things and locking down lots of RAM?

Thanks,

jon
Mike Rapoport Feb. 16, 2020, 6:46 a.m. UTC | #6
On Wed, Feb 12, 2020 at 02:10:29PM -0700, Jonathan Corbet wrote:
> On Thu, 30 Jan 2020 18:23:41 +0200
> Mike Rapoport <rppt@kernel.org> wrote:
> 
> > Hi,
> > 
> > This is essentially a resend of my attempt to implement "secret" mappings
> > using a file descriptor [1]. 
> 
> So one little thing I was curious about as I read through the patch...
> 
> > +static int secretmem_check_limits(struct vm_fault *vmf)
> > +{
> > +	struct secretmem_state *state = vmf->vma->vm_file->private_data;
> > +	struct inode *inode = file_inode(vmf->vma->vm_file);
> > +	unsigned long limit;
> > +
> > +	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> > +		return -EINVAL;
> > +
> > +	limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > +	if (state->nr_pages + 1 >= limit)
> > +		return -EPERM;
> > +
> > +	return 0;
> > +}
> 
> If I'm not mistaken, this means each memfd can be RLIMIT_MEMLOCK in length,
> with no global limit on the number of locked pages.  What's keeping me from
> creating 1000 of these things and locking down lots of RAM?

Indeed, it's possible to lock down RLIMIT_MEMLOCK * RLIMIT_NOFILE of RAM
with this implementation, thanks for catching this.

I'll surely update the resource limiting once we've settle on the API
selection :)
 
> Thanks,
> 
> jon
>
Andy Lutomirski Aug. 14, 2020, 5:46 p.m. UTC | #7
On Thu, Jan 30, 2020 at 8:23 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> Hi,
>
> This is essentially a resend of my attempt to implement "secret" mappings
> using a file descriptor [1].
>
> I've done a couple of experiments with secret/exclusive/whatever
> memory backed by a file-descriptor using a chardev and memfd_create
> syscall. There is indeed no need for VM_ flag, but there are still places
> that would require special care, e.g vm_normal_page(), madvise(DO_FORK), so
> it won't be completely free of core mm modifications.
>
> Below is a POC that implements extension to memfd_create() that allows
> mapping of a "secret" memory. The "secrecy" mode should be explicitly set
> using ioctl(), for now I've implemented exclusive and uncached mappings.

Hi-

Sorry for the extremely delayed response.

I like the general concept, and I like the exclusive concept.  While
it is certainly annoying for the kernel to manage non-direct-mapped
pages, I think it's the future.  But I have serious concerns about the
uncached part.  Here are some concerns.

If it's done at all, I think it should be MFD_SECRET_X86_UNCACHED.  I
think that uncached memory is outside the scope of things that can
reasonably be considered to be architecture-neutral.  (For example, on
x86, UC and WC have very different semantics, and UC has quite
different properties than WB for things like atomics.  Also, the
performance of UC is interesting at best, and the ways to even
moderately efficiently read from UC memory or write to UC memory are
highly x86-specific.)

I'm a little unconvinced about the security benefits.  As far as I
know, UC memory will not end up in cache by any means (unless
aliased), but it's going to be tough to do much with UC data with
anything resembling reasonable performance without derived values
getting cached.  It's likely entirely impossible to do it reliably
without asm.  But even with plain WB memory, getting it into L1 really
should not be that bad unless major new vulnerabilities are
discovered.  And there are other approaches that could be more
arch-neutral and more performant.  For example, there could be an
option to flush a few cache lines on schedule out.  This way a task
could work on some (exclusive but WB) secret memory and have the cache
lines flushed if anything interrupts it.  Combined with turning SMT
off, this could offer comparable protection with much less overhead.

UC also doesn't seem reliable on x86, sadly.  From asking around,
there are at least a handful of scenarios under which the kernel can
ask the CPU for UC but get WB anyway.  Apparently Xen hypervisors will
do this unless the domain has privileged MMIO access, and ESXi will do
it under some set of common circumstances.  So unless we probe somehow
or have fancy enumeration or administrative configuration, I'm not
sure we can even get predictable behavior if we hand userspace a
supposedly UC mapping.  Giving user code WB when it thinks it has UC
could end badly.

--Andy
Dave Hansen Aug. 14, 2020, 6:09 p.m. UTC | #8
On 8/14/20 10:46 AM, Andy Lutomirski wrote:
> I'm a little unconvinced about the security benefits.  As far as I
> know, UC memory will not end up in cache by any means (unless
> aliased), but it's going to be tough to do much with UC data with
> anything resembling reasonable performance without derived values
> getting cached.

I think this is much more in the category of raising the bar than
providing any absolute security guarantees.

Let's say you have a secret and you read it into some registers and then
spill them on the stack.  You've got two cached copies, one for the
primary data and another for the stack copy.  Secret areas don't get rid
of the stack copy, but they do get rid of the other one.  One cache copy
is better than two.  Bar raised. :)

There are also some stronger protections, less in the bar-raising
category.  On x86 at least, uncached accesses also crush speculation.
You can't, for instance, speculatively get wrong values if you're not
speculating in the first place.  I was thinking of things like Load
Value Injection[1].

I _believe_ there are also things like AES-NI that can get strong
protection from stuff like this.  They load encryption keys into (AVX)
registers and then can do encrypt/decrypt operations without the keys
leaving the registers.  If the key was loaded from a secret memory area
right into the registers, I think the protection from cache attacks
would be pretty strong.


1.
https://software.intel.com/security-software-guidance/insights/deep-dive-load-value-injection
Andy Lutomirski Aug. 26, 2020, 4:54 p.m. UTC | #9
On Fri, Aug 14, 2020 at 11:09 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 8/14/20 10:46 AM, Andy Lutomirski wrote:
> > I'm a little unconvinced about the security benefits.  As far as I
> > know, UC memory will not end up in cache by any means (unless
> > aliased), but it's going to be tough to do much with UC data with
> > anything resembling reasonable performance without derived values
> > getting cached.
>
> I think this is much more in the category of raising the bar than
> providing any absolute security guarantees.

The problem here is that we're raising the bar in a way that is
weirdly architecture dependent, *extremely* nonperformant, and may not
even accomplish what it's trying to accomplish.

>
> Let's say you have a secret and you read it into some registers and then
> spill them on the stack.  You've got two cached copies, one for the
> primary data and another for the stack copy.  Secret areas don't get rid
> of the stack copy, but they do get rid of the other one.  One cache copy
> is better than two.  Bar raised. :)

If we have two bars right next to each other and we raise one of them,
did we really accomplish much?  I admit that having a secret in its
own dedicated cache line seems like an easier target than a secret in
a cache line that may be quickly overwritten by something else.  But
even user registers right now aren't specially protected -- pt_regs
lives is cached and probably has a predictable location, especially if
you execve() a setuid program.

>
> There are also some stronger protections, less in the bar-raising
> category.  On x86 at least, uncached accesses also crush speculation.
> You can't, for instance, speculatively get wrong values if you're not
> speculating in the first place.  I was thinking of things like Load
> Value Injection[1].

This seems genuinely useful, but it doesn't really address the fact
that requesting UC memory via PAT apparently has a good chance of
getting WB anyway.

>
> I _believe_ there are also things like AES-NI that can get strong
> protection from stuff like this.  They load encryption keys into (AVX)
> registers and then can do encrypt/decrypt operations without the keys
> leaving the registers.  If the key was loaded from a secret memory area
> right into the registers, I think the protection from cache attacks
> would be pretty strong.
>

Except for context switches :)
>
> 1.
> https://software.intel.com/security-software-guidance/insights/deep-dive-load-value-injection
Florian Weimer Aug. 26, 2020, 7:01 p.m. UTC | #10
* Andy Lutomirski:

>> I _believe_ there are also things like AES-NI that can get strong
>> protection from stuff like this.  They load encryption keys into (AVX)
>> registers and then can do encrypt/decrypt operations without the keys
>> leaving the registers.  If the key was loaded from a secret memory area
>> right into the registers, I think the protection from cache attacks
>> would be pretty strong.
>
> Except for context switches :)

An rseq sequence could request that the AVX registers should be
cleared on context switch.  (I'm mostly kidding.)

I think the main issue is that we do not have a good established
programming model to actually use such features and completely avoid
making copies of secret data.
diff mbox series

Patch

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..d3ca7285f51a 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -13,4 +13,13 @@  static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
 }
 #endif
 
+#ifdef CONFIG_MEMFD_SECRETMEM
+extern struct file *secretmem_file_create(const char *name, unsigned int flags);
+#else
+static inline struct file *secretmem_file_create(const char *name, unsigned int flags)
+{
+	return ERR_PTR(-EINVAL);
+}
+#endif
+
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 3ac436376d79..c0104e6da894 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -95,5 +95,6 @@ 
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
+#define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..3320a79b638d 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,12 @@ 
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U
+#define MFD_SECRET		0x0008U
+
+/* ioctls for secret memory */
+#define MFD_SECRET_IOCTL '-'
+#define MFD_SECRET_EXCLUSIVE	_IOW(MFD_SECRET_IOCTL, 0x13, unsigned long)
+#define MFD_SECRET_UNCACHED	_IOW(MFD_SECRET_IOCTL, 0x14, unsigned long)
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/Kconfig b/mm/Kconfig
index ab80933be65f..2a8956d9048d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -739,4 +739,8 @@  config ARCH_HAS_HUGEPD
 config MAPPING_DIRTY_HELPERS
         bool
 
+config MEMFD_SECRETMEM
+        def_bool MEMFD_CREATE && ARCH_HAS_SET_DIRECT_MAP
+
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 1937cc251883..9399e823ccdb 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -108,3 +108,4 @@  obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
+obj-$(CONFIG_MEMFD_SECRETMEM) += secretmem.o
diff --git a/mm/memfd.c b/mm/memfd.c
index 2647c898990c..3e1cc37e0389 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -245,7 +245,8 @@  long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_SECRET_MASK (MFD_CLOEXEC | MFD_SECRET)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_SECRET)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -257,6 +258,9 @@  SYSCALL_DEFINE2(memfd_create,
 	char *name;
 	long len;
 
+	if (flags & ~(unsigned int)MFD_SECRET_MASK)
+		return -EINVAL;
+
 	if (!(flags & MFD_HUGETLB)) {
 		if (flags & ~(unsigned int)MFD_ALL_FLAGS)
 			return -EINVAL;
@@ -296,7 +300,9 @@  SYSCALL_DEFINE2(memfd_create,
 		goto err_name;
 	}
 
-	if (flags & MFD_HUGETLB) {
+	if (flags & MFD_SECRET) {
+		file = secretmem_file_create(name, flags);
+	} else if (flags & MFD_HUGETLB) {
 		struct user_struct *user = NULL;
 
 		file = hugetlb_file_setup(name, 0, VM_NORESERVE, &user,
diff --git a/mm/secretmem.c b/mm/secretmem.c
new file mode 100644
index 000000000000..ac67a67aa29c
--- /dev/null
+++ b/mm/secretmem.c
@@ -0,0 +1,244 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/memfd.h>
+#include <linux/printk.h>
+#include <linux/pagemap.h>
+#include <linux/pseudo_fs.h>
+#include <linux/set_memory.h>
+#include <linux/sched/signal.h>
+
+#include <uapi/linux/memfd.h>
+#include <uapi/linux/magic.h>
+
+#include <asm/tlbflush.h>
+
+#define SECRETMEM_EXCLUSIVE	0x1
+#define SECRETMEM_UNCACHED	0x2
+
+struct secretmem_state {
+	unsigned int mode;
+	unsigned long nr_pages;
+};
+
+static struct page *secretmem_alloc_page(gfp_t gfp)
+{
+	/*
+	 * FIXME: use a cache of large pages to reduce the direct map
+	 * fragmentation
+	 */
+	return alloc_page(gfp);
+}
+
+static int secretmem_check_limits(struct vm_fault *vmf)
+{
+	struct secretmem_state *state = vmf->vma->vm_file->private_data;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	unsigned long limit;
+
+	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
+		return -EINVAL;
+
+	limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	if (state->nr_pages + 1 >= limit)
+		return -EPERM;
+
+	return 0;
+}
+
+static vm_fault_t secretmem_fault(struct vm_fault *vmf)
+{
+	struct secretmem_state *state = vmf->vma->vm_file->private_data;
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	pgoff_t offset = vmf->pgoff;
+	unsigned long addr;
+	struct page *page;
+	int ret;
+
+	ret = secretmem_check_limits(vmf);
+	if (ret)
+		return vmf_error(ret);
+
+	page = find_get_entry(mapping, offset);
+	if (!page) {
+		page = secretmem_alloc_page(vmf->gfp_mask);
+		if (!page)
+			return vmf_error(-ENOMEM);
+
+		ret = add_to_page_cache_lru(page, mapping, offset, vmf->gfp_mask);
+		if (unlikely(ret)) {
+			put_page(page);
+			return vmf_error(ret);
+		}
+
+		ret = set_direct_map_invalid_noflush(page);
+		if (ret) {
+			delete_from_page_cache(page);
+			return vmf_error(ret);
+		}
+
+		addr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+		__SetPageUptodate(page);
+
+		state->nr_pages++;
+		ret = VM_FAULT_LOCKED;
+	}
+
+	vmf->page = page;
+	return ret;
+}
+
+static const struct vm_operations_struct secretmem_vm_ops = {
+	.fault = secretmem_fault,
+};
+
+static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct secretmem_state *state = file->private_data;
+	unsigned long mode = state->mode;
+
+	if (!mode)
+		return -EINVAL;
+
+	switch (mode) {
+	case SECRETMEM_UNCACHED:
+		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+		/* fallthrough */
+	case SECRETMEM_EXCLUSIVE:
+		vma->vm_ops = &secretmem_vm_ops;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static long secretmem_ioctl(struct file *file, unsigned cmd, unsigned long arg)
+{
+	struct secretmem_state *state = file->private_data;
+	unsigned long mode = state->mode;
+
+	if (mode)
+		return -EINVAL;
+
+	switch (cmd) {
+	case MFD_SECRET_EXCLUSIVE:
+		mode = SECRETMEM_EXCLUSIVE;
+		break;
+	case MFD_SECRET_UNCACHED:
+		mode = SECRETMEM_UNCACHED;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	state->mode = mode;
+
+	return 0;
+}
+
+static int secretmem_release(struct inode *inode, struct file *file)
+{
+	struct secretmem_state *state = file->private_data;
+
+	kfree(state);
+
+	return 0;
+}
+
+const struct file_operations secretmem_fops = {
+	.release	= secretmem_release,
+	.mmap		= secretmem_mmap,
+	.unlocked_ioctl = secretmem_ioctl,
+	.compat_ioctl	= secretmem_ioctl,
+};
+
+static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
+{
+	return false;
+}
+
+static int secretmem_migratepage(struct address_space *mapping,
+				 struct page *newpage, struct page *page,
+				 enum migrate_mode mode)
+{
+	return -EBUSY;
+}
+
+static void secretmem_freepage(struct page *page)
+{
+	set_direct_map_default_noflush(page);
+}
+
+static const struct address_space_operations secretmem_aops = {
+	.freepage	= secretmem_freepage,
+	.migratepage	= secretmem_migratepage,
+	.isolate_page	= secretmem_isolate_page,
+};
+
+static struct vfsmount *secretmem_mnt;
+
+struct file *secretmem_file_create(const char *name, unsigned int flags)
+{
+	struct inode *inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+	struct file *file = ERR_PTR(-ENOMEM);
+	struct secretmem_state *state;
+
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	state = kzalloc(sizeof(*state), GFP_KERNEL);
+	if (!state)
+		goto err_free_inode;
+
+	file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
+				 O_RDWR, &secretmem_fops);
+	if (IS_ERR(file))
+		goto err_free_state;
+
+	mapping_set_unevictable(inode->i_mapping);
+
+	inode->i_mapping->private_data = state;
+	inode->i_mapping->a_ops = &secretmem_aops;
+
+	/* pretend we are a normal file with zero size */
+	inode->i_mode |= S_IFREG;
+	inode->i_size = 0;
+
+	file->private_data = state;
+
+	return file;
+
+err_free_state:
+	kfree(state);
+err_free_inode:
+	iput(inode);
+	return file;
+}
+
+static int secretmem_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, SECRETMEM_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type secretmem_fs = {
+	.name		= "secretmem",
+	.init_fs_context = secretmem_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static int secretmem_init(void)
+{
+	int ret = 0;
+
+	secretmem_mnt = kern_mount(&secretmem_fs);
+	if (IS_ERR(secretmem_mnt))
+		ret = PTR_ERR(secretmem_mnt);
+
+	return ret;
+}
+fs_initcall(secretmem_init);