diff mbox series

[v17,07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Message ID 20210208084920.2884-8-rppt@kernel.org (mailing list archive)
State New, archived
Headers show
Series mm: introduce memfd_secret system call to create "secret" memory areas | expand

Commit Message

Mike Rapoport Feb. 8, 2021, 8:49 a.m. UTC
From: Mike Rapoport <rppt@linux.ibm.com>

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly enable
it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the owning mm.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
paves the way for VMMs to remove the secret memory range from the process;
there may be situations where sharing is useful and file descriptor based
approach allows to seal the operations.

As secret memory implementation is not an extension of tmpfs or hugetlbfs,
usage of a dedicated system call rather than hooking new functionality into
memfd_create(2) emphasises that memfd_secret(2) has different semantics and
allows better upwards compatibility.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

The secretmem mappings are locked in memory so they cannot exceed
RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
mlock() secretmem range would fail and mlockall() will ignore secretmem
mappings.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

	fd = memfd_secret(0);
	ftruncate(fd, MAP_SIZE);
	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
---
 include/linux/secretmem.h  |  24 ++++
 include/uapi/linux/magic.h |   1 +
 kernel/sys_ni.c            |   2 +
 mm/Kconfig                 |   3 +
 mm/Makefile                |   1 +
 mm/gup.c                   |  10 ++
 mm/mlock.c                 |   3 +-
 mm/secretmem.c             | 246 +++++++++++++++++++++++++++++++++++++
 8 files changed, 289 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/secretmem.h
 create mode 100644 mm/secretmem.c

Comments

Michal Hocko Feb. 8, 2021, 10:49 a.m. UTC | #1
On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Introduce "memfd_secret" system call with the ability to create memory
> areas visible only in the context of the owning process and not mapped not
> only to other processes but in the kernel page tables as well.
> 
> The secretmem feature is off by default and the user must explicitly enable
> it at the boot time.
> 
> Once secretmem is enabled, the user will be able to create a file
> descriptor using the memfd_secret() system call. The memory areas created
> by mmap() calls from this file descriptor will be unmapped from the kernel
> direct map and they will be only mapped in the page table of the owning mm.

Is this really true? I guess you meant to say that the memory will
visible only via page tables to anybody who can mmap the respective file
descriptor. There is nothing like an owning mm as the fd is inherently a
shareable resource and the ownership becomes a very vague and hard to
define term.

> The file descriptor based memory has several advantages over the
> "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> paves the way for VMMs to remove the secret memory range from the process;

I do not understand how it helps to remove the memory from the process
as the interface explicitly allows to add a memory that is removed from
all other processes via direct map.

> there may be situations where sharing is useful and file descriptor based
> approach allows to seal the operations.

It would be great to expand on this some more.

> As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> usage of a dedicated system call rather than hooking new functionality into
> memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> allows better upwards compatibility.

What is this supposed to mean? What are differences?

> The secret memory remains accessible in the process context using uaccess
> primitives, but it is not exposed to the kernel otherwise; secret memory
> areas are removed from the direct map and functions in the
> follow_page()/get_user_page() family will refuse to return a page that
> belongs to the secret memory area.
> 
> Once there will be a use case that will require exposing secretmem to the
> kernel it will be an opt-in request in the system call flags so that user
> would have to decide what data can be exposed to the kernel.
>
> Removing of the pages from the direct map may cause its fragmentation on
> architectures that use large pages to map the physical memory which affects
> the system performance. However, the original Kconfig text for
> CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
> improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
> ("x86: add gbpages switches")) and the recent report [1] showed that "...
> although 1G mappings are a good default choice, there is no compelling
> evidence that it must be the only choice". Hence, it is sufficient to have
> secretmem disabled by default with the ability of a system administrator to
> enable it at boot time.

OK, this looks like a reasonable compromise for the initial
implementation. Documentation of the command line parameter should be
very explicit about this though.

> The secretmem mappings are locked in memory so they cannot exceed
> RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> mlock() secretmem range would fail and mlockall() will ignore secretmem
> mappings.

What about munlock?

> Pages in the secretmem regions are unevictable and unmovable to avoid
> accidental exposure of the sensitive data via swap or during page
> migration.
> 
> A page that was a part of the secret memory area is cleared when it is
> freed to ensure the data is not exposed to the next user of that page.
> 
> The following example demonstrates creation of a secret mapping (error
> handling is omitted):
> 
> 	fd = memfd_secret(0);
> 	ftruncate(fd, MAP_SIZE);
> 	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
> 		   MAP_SHARED, fd, 0);

Please also list usecases which you are aware of as well.

I am also missing some more information about the implementation. E.g.
does this memory live on an unevictable LRU and therefore participates
into stats. What about memcg accounting. What is the cross fork (CoW)/exec
behavior. How is the memory reflected in OOM situation? Is a shared
mapping enforced?

Anyway, thanks for improving the changelog. This is definitely much more
informative.

> [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

I have only glanced through the implementation and it looks sane. I will
have a closer look later but this should be pretty simple with the
proposed semantic.
Mike Rapoport Feb. 8, 2021, 9:26 p.m. UTC | #2
On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > Introduce "memfd_secret" system call with the ability to create memory
> > areas visible only in the context of the owning process and not mapped not
> > only to other processes but in the kernel page tables as well.
> > 
> > The secretmem feature is off by default and the user must explicitly enable
> > it at the boot time.
> > 
> > Once secretmem is enabled, the user will be able to create a file
> > descriptor using the memfd_secret() system call. The memory areas created
> > by mmap() calls from this file descriptor will be unmapped from the kernel
> > direct map and they will be only mapped in the page table of the owning mm.
> 
> Is this really true? I guess you meant to say that the memory will
> visible only via page tables to anybody who can mmap the respective file
> descriptor. There is nothing like an owning mm as the fd is inherently a
> shareable resource and the ownership becomes a very vague and hard to
> define term.

Hmm, it seems I've been dragging this paragraph from the very first
mmap(MAP_EXCLUSIVE) rfc and nobody (including myself) noticed the
inconsistency.
 
> > The file descriptor based memory has several advantages over the
> > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > paves the way for VMMs to remove the secret memory range from the process;
> 
> I do not understand how it helps to remove the memory from the process
> as the interface explicitly allows to add a memory that is removed from
> all other processes via direct map.

The current implementation does not help to remove the memory from the
process, but using fd-backed memory seems a better interface to remove
guest memory from host mappings than mmap. As Andy nicely put it:

"Getting fd-backed memory into a guest will take some possibly major work in
the kernel, but getting vma-backed memory into a guest without mapping it
in the host user address space seems much, much worse."
 
> > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > usage of a dedicated system call rather than hooking new functionality into
> > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > allows better upwards compatibility.
> 
> What is this supposed to mean? What are differences?

Well, the phrasing could be better indeed. That supposed to mean that
they differ in the semantics behind the file descriptor: memfd_create
implements sealing for shmem and hugetlbfs while memfd_secret implements
memory hidden from the kernel.
 
> > The secretmem mappings are locked in memory so they cannot exceed
> > RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> > mlock() secretmem range would fail and mlockall() will ignore secretmem
> > mappings.
> 
> What about munlock?

Isn't this implied? ;-)
I'll add a sentence about it.
Michal Hocko Feb. 9, 2021, 8:47 a.m. UTC | #3
On Mon 08-02-21 23:26:05, Mike Rapoport wrote:
> On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> > On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
[...]
> > > The file descriptor based memory has several advantages over the
> > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > > paves the way for VMMs to remove the secret memory range from the process;
> > 
> > I do not understand how it helps to remove the memory from the process
> > as the interface explicitly allows to add a memory that is removed from
> > all other processes via direct map.
> 
> The current implementation does not help to remove the memory from the
> process, but using fd-backed memory seems a better interface to remove
> guest memory from host mappings than mmap. As Andy nicely put it:
> 
> "Getting fd-backed memory into a guest will take some possibly major work in
> the kernel, but getting vma-backed memory into a guest without mapping it
> in the host user address space seems much, much worse."

OK, so IIUC this means that the model is to hand over memory from host
to guest. I thought the guest would be under control of its address
space and therefore it operates on the VMAs. This would benefit from
an additional and more specific clarification.

> > > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > > usage of a dedicated system call rather than hooking new functionality into
> > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > > allows better upwards compatibility.
> > 
> > What is this supposed to mean? What are differences?
> 
> Well, the phrasing could be better indeed. That supposed to mean that
> they differ in the semantics behind the file descriptor: memfd_create
> implements sealing for shmem and hugetlbfs while memfd_secret implements
> memory hidden from the kernel.

Right but why memfd_create model is not sufficient for the usecase?
Please note that I am arguing against. To be honest I do not really care
much. Using an existing scheme is usually preferable from my POV but
there might be real reasons why shmem as a backing "storage" is not
appropriate.
  
> > > The secretmem mappings are locked in memory so they cannot exceed
> > > RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> > > mlock() secretmem range would fail and mlockall() will ignore secretmem
> > > mappings.
> > 
> > What about munlock?
> 
> Isn't this implied? ;-)

My bad here. I thought that munlock fails on vmas which are not mlocked
and I was curious about the behavior when mlockall() is followed by
munlock. But I do not see this being the case. So this should be ok.
Mike Rapoport Feb. 9, 2021, 9:09 a.m. UTC | #4
On Tue, Feb 09, 2021 at 09:47:08AM +0100, Michal Hocko wrote:
> On Mon 08-02-21 23:26:05, Mike Rapoport wrote:
> > On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> > > On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> [...]
> > > > The file descriptor based memory has several advantages over the
> > > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > > > paves the way for VMMs to remove the secret memory range from the process;
> > > 
> > > I do not understand how it helps to remove the memory from the process
> > > as the interface explicitly allows to add a memory that is removed from
> > > all other processes via direct map.
> > 
> > The current implementation does not help to remove the memory from the
> > process, but using fd-backed memory seems a better interface to remove
> > guest memory from host mappings than mmap. As Andy nicely put it:
> > 
> > "Getting fd-backed memory into a guest will take some possibly major work in
> > the kernel, but getting vma-backed memory into a guest without mapping it
> > in the host user address space seems much, much worse."
> 
> OK, so IIUC this means that the model is to hand over memory from host
> to guest. I thought the guest would be under control of its address
> space and therefore it operates on the VMAs. This would benefit from
> an additional and more specific clarification.

How guest would operate on VMAs if the interface between host and guest is
virtual hardware?

If you mean qemu (or any other userspace part of VMM that uses KVM), so one
of the points Andy mentioned back than is to remove mappings of the guest
memory from the qemu process.
 
> > > > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > > > usage of a dedicated system call rather than hooking new functionality into
> > > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > > > allows better upwards compatibility.
> > > 
> > > What is this supposed to mean? What are differences?
> > 
> > Well, the phrasing could be better indeed. That supposed to mean that
> > they differ in the semantics behind the file descriptor: memfd_create
> > implements sealing for shmem and hugetlbfs while memfd_secret implements
> > memory hidden from the kernel.
> 
> Right but why memfd_create model is not sufficient for the usecase?
> Please note that I am arguing against. To be honest I do not really care
> much. Using an existing scheme is usually preferable from my POV but
> there might be real reasons why shmem as a backing "storage" is not
> appropriate.
   
Citing my older email:

    I've hesitated whether to continue to use new flags to memfd_create() or to
    add a new system call and I've decided to use a new system call after I've
    started to look into man pages update. There would have been two completely
    independent descriptions and I think it would have been very confusing.
Michal Hocko Feb. 9, 2021, 1:17 p.m. UTC | #5
On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> On Tue, Feb 09, 2021 at 09:47:08AM +0100, Michal Hocko wrote:
> > On Mon 08-02-21 23:26:05, Mike Rapoport wrote:
> > > On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> > > > On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> > [...]
> > > > > The file descriptor based memory has several advantages over the
> > > > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > > > > paves the way for VMMs to remove the secret memory range from the process;
> > > > 
> > > > I do not understand how it helps to remove the memory from the process
> > > > as the interface explicitly allows to add a memory that is removed from
> > > > all other processes via direct map.
> > > 
> > > The current implementation does not help to remove the memory from the
> > > process, but using fd-backed memory seems a better interface to remove
> > > guest memory from host mappings than mmap. As Andy nicely put it:
> > > 
> > > "Getting fd-backed memory into a guest will take some possibly major work in
> > > the kernel, but getting vma-backed memory into a guest without mapping it
> > > in the host user address space seems much, much worse."
> > 
> > OK, so IIUC this means that the model is to hand over memory from host
> > to guest. I thought the guest would be under control of its address
> > space and therefore it operates on the VMAs. This would benefit from
> > an additional and more specific clarification.
> 
> How guest would operate on VMAs if the interface between host and guest is
> virtual hardware?

I have to say that I am not really familiar with this area so my view
might be misleading or completely wrong. I thought that the HW address
ranges are mapped to the guest process and therefore have a VMA.

> If you mean qemu (or any other userspace part of VMM that uses KVM), so one
> of the points Andy mentioned back than is to remove mappings of the guest
> memory from the qemu process.
>  
> > > > > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > > > > usage of a dedicated system call rather than hooking new functionality into
> > > > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > > > > allows better upwards compatibility.
> > > > 
> > > > What is this supposed to mean? What are differences?
> > > 
> > > Well, the phrasing could be better indeed. That supposed to mean that
> > > they differ in the semantics behind the file descriptor: memfd_create
> > > implements sealing for shmem and hugetlbfs while memfd_secret implements
> > > memory hidden from the kernel.
> > 
> > Right but why memfd_create model is not sufficient for the usecase?
> > Please note that I am arguing against. To be honest I do not really care
> > much. Using an existing scheme is usually preferable from my POV but
> > there might be real reasons why shmem as a backing "storage" is not
> > appropriate.
>    
> Citing my older email:
> 
>     I've hesitated whether to continue to use new flags to memfd_create() or to
>     add a new system call and I've decided to use a new system call after I've
>     started to look into man pages update. There would have been two completely
>     independent descriptions and I think it would have been very confusing.

Could you elaborate? Unmapping from the kernel address space can work
both for sealed or hugetlb memfds, no? Those features are completely
orthogonal AFAICS. With a dedicated syscall you will need to introduce
this functionality on top if that is required. Have you considered that?
I mean hugetlb pages are used to back guest memory very often. Is this
something that will be a secret memory usecase?

Please be really specific when giving arguments to back a new syscall
decision.
Mike Rapoport Feb. 11, 2021, 7:13 a.m. UTC | #6
On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
> On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> > On Tue, Feb 09, 2021 at 09:47:08AM +0100, Michal Hocko wrote:
> > > 
> > > OK, so IIUC this means that the model is to hand over memory from host
> > > to guest. I thought the guest would be under control of its address
> > > space and therefore it operates on the VMAs. This would benefit from
> > > an additional and more specific clarification.
> > 
> > How guest would operate on VMAs if the interface between host and guest is
> > virtual hardware?
> 
> I have to say that I am not really familiar with this area so my view
> might be misleading or completely wrong. I thought that the HW address
> ranges are mapped to the guest process and therefore have a VMA.

There is a qemu process that currently has mappings of what guest sees as
its physical memory, but qemu is a part of hypervisor, i.e. host.
 
> > Citing my older email:
> > 
> >     I've hesitated whether to continue to use new flags to memfd_create() or to
> >     add a new system call and I've decided to use a new system call after I've
> >     started to look into man pages update. There would have been two completely
> >     independent descriptions and I think it would have been very confusing.
> 
> Could you elaborate? Unmapping from the kernel address space can work
> both for sealed or hugetlb memfds, no? Those features are completely
> orthogonal AFAICS. With a dedicated syscall you will need to introduce
> this functionality on top if that is required. Have you considered that?
> I mean hugetlb pages are used to back guest memory very often. Is this
> something that will be a secret memory usecase?
> 
> Please be really specific when giving arguments to back a new syscall
> decision.

Isn't "syscalls have completely independent description" specific enough?

We are talking about API here, not the implementation details whether
secretmem supports large pages or not.

The purpose of memfd_create() is to create a file-like access to memory.
The purpose of memfd_secret() is to create a way to access memory hidden
from the kernel.

I don't think overloading memfd_create() with the secretmem flags because
they happen to return a file descriptor will be better for users, but
rather will be more confusing.
Michal Hocko Feb. 11, 2021, 8:39 a.m. UTC | #7
On Thu 11-02-21 09:13:19, Mike Rapoport wrote:
> On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
> > On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
[...]
> > > Citing my older email:
> > > 
> > >     I've hesitated whether to continue to use new flags to memfd_create() or to
> > >     add a new system call and I've decided to use a new system call after I've
> > >     started to look into man pages update. There would have been two completely
> > >     independent descriptions and I think it would have been very confusing.
> > 
> > Could you elaborate? Unmapping from the kernel address space can work
> > both for sealed or hugetlb memfds, no? Those features are completely
> > orthogonal AFAICS. With a dedicated syscall you will need to introduce
> > this functionality on top if that is required. Have you considered that?
> > I mean hugetlb pages are used to back guest memory very often. Is this
> > something that will be a secret memory usecase?
> > 
> > Please be really specific when giving arguments to back a new syscall
> > decision.
> 
> Isn't "syscalls have completely independent description" specific enough?

No, it's not as you can see from questions I've had above. More on that
below.

> We are talking about API here, not the implementation details whether
> secretmem supports large pages or not.
> 
> The purpose of memfd_create() is to create a file-like access to memory.
> The purpose of memfd_secret() is to create a way to access memory hidden
> from the kernel.
> 
> I don't think overloading memfd_create() with the secretmem flags because
> they happen to return a file descriptor will be better for users, but
> rather will be more confusing.

This is quite a subjective conclusion. I could very well argue that it
would be much better to have a single syscall to get a fd backed memory
with spedific requirements (sealing, unmapping from the kernel address
space). Neither of us would be clearly right or wrong. A more important
point is a future extensibility and usability, though. So let's just
think of few usecases I have outlined above. Is it unrealistic to expect
that secret memory should be sealable? What about hugetlb? Because if
the answer is no then a new API is a clear win as the combination of
flags would never work and then we would just suffer from the syscall 
multiplexing without much gain. On the other hand if combination of the
functionality is to be expected then you will have to jam it into
memfd_create and copy the interface likely causing more confusion. See
what I mean?

I by no means do not insist one way or the other but from what I have
seen so far I have a feeling that the interface hasn't been thought
through enough. Sure you have landed with fd based approach and that
seems fair. But how to get that fd seems to still have some gaps IMHO.
David Hildenbrand Feb. 11, 2021, 9:01 a.m. UTC | #8
On 11.02.21 09:39, Michal Hocko wrote:
> On Thu 11-02-21 09:13:19, Mike Rapoport wrote:
>> On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
>>> On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> [...]
>>>> Citing my older email:
>>>>
>>>>      I've hesitated whether to continue to use new flags to memfd_create() or to
>>>>      add a new system call and I've decided to use a new system call after I've
>>>>      started to look into man pages update. There would have been two completely
>>>>      independent descriptions and I think it would have been very confusing.
>>>
>>> Could you elaborate? Unmapping from the kernel address space can work
>>> both for sealed or hugetlb memfds, no? Those features are completely
>>> orthogonal AFAICS. With a dedicated syscall you will need to introduce
>>> this functionality on top if that is required. Have you considered that?
>>> I mean hugetlb pages are used to back guest memory very often. Is this
>>> something that will be a secret memory usecase?
>>>
>>> Please be really specific when giving arguments to back a new syscall
>>> decision.
>>
>> Isn't "syscalls have completely independent description" specific enough?
> 
> No, it's not as you can see from questions I've had above. More on that
> below.
> 
>> We are talking about API here, not the implementation details whether
>> secretmem supports large pages or not.
>>
>> The purpose of memfd_create() is to create a file-like access to memory.
>> The purpose of memfd_secret() is to create a way to access memory hidden
>> from the kernel.
>>
>> I don't think overloading memfd_create() with the secretmem flags because
>> they happen to return a file descriptor will be better for users, but
>> rather will be more confusing.
> 
> This is quite a subjective conclusion. I could very well argue that it
> would be much better to have a single syscall to get a fd backed memory
> with spedific requirements (sealing, unmapping from the kernel address
> space). Neither of us would be clearly right or wrong. A more important
> point is a future extensibility and usability, though. So let's just
> think of few usecases I have outlined above. Is it unrealistic to expect
> that secret memory should be sealable? What about hugetlb? Because if
> the answer is no then a new API is a clear win as the combination of
> flags would never work and then we would just suffer from the syscall
> multiplexing without much gain. On the other hand if combination of the
> functionality is to be expected then you will have to jam it into
> memfd_create and copy the interface likely causing more confusion. See
> what I mean?
> 
> I by no means do not insist one way or the other but from what I have
> seen so far I have a feeling that the interface hasn't been thought
> through enough. Sure you have landed with fd based approach and that
> seems fair. But how to get that fd seems to still have some gaps IMHO.
> 

I agree with Michal. This has been raised by different
people already, including on LWN (https://lwn.net/Articles/835342/).

I can follow Mike's reasoning (man page), and I am also fine if there is
a valid reason. However, IMHO the basic description seems to match quite good:

        memfd_create() creates an anonymous file and returns a file descriptor that refers to it.  The
        file behaves like a regular file, and so can be modified, truncated, memory-mapped, and so on.
        However,  unlike a regular file, it lives in RAM and has a volatile backing storage.  Once all
        references to the file are dropped, it is automatically released.  Anonymous  memory  is  used
        for  all  backing pages of the file.  Therefore, files created by memfd_create() have the same
        semantics as other anonymous memory allocations such as those allocated using mmap(2) with the
        MAP_ANONYMOUS flag.

AFAIKS, we would need MFD_SECRET and disallow
MFD_ALLOW_SEALING and MFD_HUGETLB.

In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of
temporary mappings (eor migration). TBC.

---

Some random thoughts regarding files.

What is the page size of secretmem memory? Sometimes we use huge pages,
sometimes we fallback to 4k pages. So I assume huge pages in general?

What are semantics of MADV()/FALLOCATE() etc on such files?
I assume PUNCH_HOLE fails in a nice way? does it work?

Does mremap()/mremap(FIXED) work/is it blocked?

Does mprotect() fail in a nice way?

Is userfaultfd() properly fenced? Or does it even work (doubt)?

How does it behave if I mmap(FIXED) something in between?
In which granularity can I do that (->page-size?)?

What are other granularity restrictions (->page size)?

Don't want to open a big discussion here, just some random thoughts.
Maybe it has all been already figured out and most of the answers
above are "Fails with -EINVAL".
Michal Hocko Feb. 11, 2021, 9:38 a.m. UTC | #9
On Thu 11-02-21 10:01:32, David Hildenbrand wrote:
[...]
> AFAIKS, we would need MFD_SECRET and disallow
> MFD_ALLOW_SEALING and MFD_HUGETLB.

Yes for an initial version. But I do expect a request to support both
features is just a matter of time.

> In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of
> temporary mappings (eor migration). TBC.

I believe this is the mode Mike wants to have by default. A more relax
one would be an opt-in. MFD_SECRET_RELAXED which would allow temporal
mappings in the kernel for content copying (e.g. for migration).

> ---
> 
> Some random thoughts regarding files.
> 
> What is the page size of secretmem memory? Sometimes we use huge pages,
> sometimes we fallback to 4k pages. So I assume huge pages in general?

Unless there is an explicit request for hugetlb I would say the page
size is not really important like for any other fds. Huge pages can be
used transparently.
 
> What are semantics of MADV()/FALLOCATE() etc on such files?

I would expect the same semantic as regular shmem (memfd_create) except
the memory doesn't have _any_ backing storage which makes it
unevictable. So the reclaim related madv won't work but there shouldn't
be any real reason why e.g. MADV_DONTNEED, WILLNEED, DONT_FORK and
others don't work.

> I assume PUNCH_HOLE fails in a nice way? does it work?
> Does mremap()/mremap(FIXED) work/is it blocked?
> Does mprotect() fail in a nice way?

I do not see a reason why those shouldn't work.
 
> Is userfaultfd() properly fenced? Or does it even work (doubt)?
> 
> How does it behave if I mmap(FIXED) something in between?
> In which granularity can I do that (->page-size?)?

Again, nothing really exceptional here. This is a mapping like any
other from address space manipulation POV.

> What are other granularity restrictions (->page size)?
> 
> Don't want to open a big discussion here, just some random thoughts.
> Maybe it has all been already figured out and most of the answers
> above are "Fails with -EINVAL".

I think that the behavior should be really in sync with shmem semantic
as much as possible. Most operations should simply work with an
aditional direct map manipulation. There is no real reason to be
special. Some functionality might be missing, e.g. hugetlb support but
that has been traditionally added on top of shmem interface so nothing
really new here.
David Hildenbrand Feb. 11, 2021, 9:48 a.m. UTC | #10
>> Some random thoughts regarding files.
>>
>> What is the page size of secretmem memory? Sometimes we use huge pages,
>> sometimes we fallback to 4k pages. So I assume huge pages in general?
> 
> Unless there is an explicit request for hugetlb I would say the page
> size is not really important like for any other fds. Huge pages can be
> used transparently.

If everything is currently allocated/mapped on PTE granularity, then yes 
I agree. I remember previous versions used to "pool 2MB pages", which 
might have been problematic (thus, my concerns regarding mmap() etc.). 
If that part is now gone, good!

>   
>> What are semantics of MADV()/FALLOCATE() etc on such files?
> 
> I would expect the same semantic as regular shmem (memfd_create) except
> the memory doesn't have _any_ backing storage which makes it
> unevictable. So the reclaim related madv won't work but there shouldn't
> be any real reason why e.g. MADV_DONTNEED, WILLNEED, DONT_FORK and
> others don't work.

Agreed if we don't have hugepage semantics.

>> Is userfaultfd() properly fenced? Or does it even work (doubt)?
>>
>> How does it behave if I mmap(FIXED) something in between?
>> In which granularity can I do that (->page-size?)?
> 
> Again, nothing really exceptional here. This is a mapping like any
> other from address space manipulation POV.

Agreed with the PTE mapping approach.

> 
>> What are other granularity restrictions (->page size)?
>>
>> Don't want to open a big discussion here, just some random thoughts.
>> Maybe it has all been already figured out and most of the answers
>> above are "Fails with -EINVAL".
> 
> I think that the behavior should be really in sync with shmem semantic
> as much as possible. Most operations should simply work with an
> aditional direct map manipulation. There is no real reason to be
> special. Some functionality might be missing, e.g. hugetlb support but
> that has been traditionally added on top of shmem interface so nothing
> really new here.

Agreed!
David Hildenbrand Feb. 11, 2021, 10:02 a.m. UTC | #11
On 11.02.21 10:38, Michal Hocko wrote:
> On Thu 11-02-21 10:01:32, David Hildenbrand wrote:
> [...]
>> AFAIKS, we would need MFD_SECRET and disallow
>> MFD_ALLOW_SEALING and MFD_HUGETLB.
> 
> Yes for an initial version. But I do expect a request to support both
> features is just a matter of time.
> 
>> In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of
>> temporary mappings (eor migration). TBC.
> 
> I believe this is the mode Mike wants to have by default. A more relax
> one would be an opt-in. MFD_SECRET_RELAXED which would allow temporal
> mappings in the kernel for content copying (e.g. for migration).
> 
>> ---
>>
>> Some random thoughts regarding files.
>>
>> What is the page size of secretmem memory? Sometimes we use huge pages,
>> sometimes we fallback to 4k pages. So I assume huge pages in general?
> 
> Unless there is an explicit request for hugetlb I would say the page
> size is not really important like for any other fds. Huge pages can be
> used transparently.
>   
>> What are semantics of MADV()/FALLOCATE() etc on such files?
> 
> I would expect the same semantic as regular shmem (memfd_create) except
> the memory doesn't have _any_ backing storage which makes it
> unevictable. So the reclaim related madv won't work but there shouldn't
> be any real reason why e.g. MADV_DONTNEED, WILLNEED, DONT_FORK and
> others don't work.

Another thought regarding "doesn't have _any_ backing storage"

What are the right semantics when it comes to memory accounting/commit?

As secretmem does not have
a) any backing storage
b) cannot go to swap

The MAP_NORESERVE vs. !MAP_NORESERVE handling gets a little unclear. Why 
"reserve swap space" if the allocations cannot ever go to swap? Sure, we 
want to "reserve physical memory", but in contrast to other users that 
can go to swap.

Of course, this is only relevant for MAP_PRIVATE secretmem mappings. 
Other MAP_SHARED assumes there is no need for reserving swap space as it 
can just go to the backing storage. (yeah, tmpfs/shmem is weird in that 
regard as well, but again, it's a bit different)
Mike Rapoport Feb. 11, 2021, 11:20 a.m. UTC | #12
On Thu, Feb 11, 2021 at 09:39:38AM +0100, Michal Hocko wrote:
> On Thu 11-02-21 09:13:19, Mike Rapoport wrote:
> > On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
> > > On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> [...]
> > > > Citing my older email:
> > > > 
> > > >     I've hesitated whether to continue to use new flags to memfd_create() or to
> > > >     add a new system call and I've decided to use a new system call after I've
> > > >     started to look into man pages update. There would have been two completely
> > > >     independent descriptions and I think it would have been very confusing.
> > > 
> > > Could you elaborate? Unmapping from the kernel address space can work
> > > both for sealed or hugetlb memfds, no? Those features are completely
> > > orthogonal AFAICS. With a dedicated syscall you will need to introduce
> > > this functionality on top if that is required. Have you considered that?
> > > I mean hugetlb pages are used to back guest memory very often. Is this
> > > something that will be a secret memory usecase?
> > > 
> > > Please be really specific when giving arguments to back a new syscall
> > > decision.
> > 
> > Isn't "syscalls have completely independent description" specific enough?
> 
> No, it's not as you can see from questions I've had above. More on that
> below.
> 
> > We are talking about API here, not the implementation details whether
> > secretmem supports large pages or not.
> > 
> > The purpose of memfd_create() is to create a file-like access to memory.
> > The purpose of memfd_secret() is to create a way to access memory hidden
> > from the kernel.
> > 
> > I don't think overloading memfd_create() with the secretmem flags because
> > they happen to return a file descriptor will be better for users, but
> > rather will be more confusing.
> 
> This is quite a subjective conclusion. I could very well argue that it
> would be much better to have a single syscall to get a fd backed memory
> with spedific requirements (sealing, unmapping from the kernel address
> space). 

> Neither of us would be clearly right or wrong.

100% agree :)

> A more important point is a future extensibility and usability, though.
> So let's just think of few usecases I have outlined above. Is it
> unrealistic to expect that secret memory should be sealable? What about
> hugetlb? Because if the answer is no then a new API is a clear win as the
> combination of flags would never work and then we would just suffer from
> the syscall multiplexing without much gain. On the other hand if
> combination of the functionality is to be expected then you will have to
> jam it into memfd_create and copy the interface likely causing more
> confusion. See what I mean?

I see your point, but I think that overloading memfd_create definitely gets
us into syscall multiplexing from day one and support for seals and huge
pages in the secretmem will not make it less of a multiplexer.

Sealing is anyway controlled via fcntl() and I don't think
MFD_ALLOW_SEALING makes much sense for the secretmem because it is there to
prevent rogue file sealing in tmpfs/hugetlbfs.

As for the huge pages, I'm not sure at all that supporting huge pages in
secretmem will involve hugetlbfs. And even if yes, adding SECRETMEM_HUGE
flag seems to me less confusing than saying "from kernel x.y you can use
MFD_CREATE | MFD_SECRET | MFD_HUGE" etc for all possible combinations.
 
> I by no means do not insist one way or the other but from what I have
> seen so far I have a feeling that the interface hasn't been thought
> through enough.

It has been, but we have different thoughts about it ;-)
Mike Rapoport Feb. 11, 2021, 11:27 a.m. UTC | #13
On Thu, Feb 11, 2021 at 10:01:32AM +0100, David Hildenbrand wrote:
> On 11.02.21 09:39, Michal Hocko wrote:
> > On Thu 11-02-21 09:13:19, Mike Rapoport wrote:
> > > On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
> > > > On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> > [...]
> > > > > Citing my older email:
> > > > > 
> > > > >      I've hesitated whether to continue to use new flags to memfd_create() or to
> > > > >      add a new system call and I've decided to use a new system call after I've
> > > > >      started to look into man pages update. There would have been two completely
> > > > >      independent descriptions and I think it would have been very confusing.
> > > > 
> > > > Could you elaborate? Unmapping from the kernel address space can work
> > > > both for sealed or hugetlb memfds, no? Those features are completely
> > > > orthogonal AFAICS. With a dedicated syscall you will need to introduce
> > > > this functionality on top if that is required. Have you considered that?
> > > > I mean hugetlb pages are used to back guest memory very often. Is this
> > > > something that will be a secret memory usecase?
> > > > 
> > > > Please be really specific when giving arguments to back a new syscall
> > > > decision.
> > > 
> > > Isn't "syscalls have completely independent description" specific enough?
> > 
> > No, it's not as you can see from questions I've had above. More on that
> > below.
> > 
> > > We are talking about API here, not the implementation details whether
> > > secretmem supports large pages or not.
> > > 
> > > The purpose of memfd_create() is to create a file-like access to memory.
> > > The purpose of memfd_secret() is to create a way to access memory hidden
> > > from the kernel.
> > > 
> > > I don't think overloading memfd_create() with the secretmem flags because
> > > they happen to return a file descriptor will be better for users, but
> > > rather will be more confusing.
> > 
> > This is quite a subjective conclusion. I could very well argue that it
> > would be much better to have a single syscall to get a fd backed memory
> > with spedific requirements (sealing, unmapping from the kernel address
> > space). Neither of us would be clearly right or wrong. A more important
> > point is a future extensibility and usability, though. So let's just
> > think of few usecases I have outlined above. Is it unrealistic to expect
> > that secret memory should be sealable? What about hugetlb? Because if
> > the answer is no then a new API is a clear win as the combination of
> > flags would never work and then we would just suffer from the syscall
> > multiplexing without much gain. On the other hand if combination of the
> > functionality is to be expected then you will have to jam it into
> > memfd_create and copy the interface likely causing more confusion. See
> > what I mean?
> > 
> > I by no means do not insist one way or the other but from what I have
> > seen so far I have a feeling that the interface hasn't been thought
> > through enough. Sure you have landed with fd based approach and that
> > seems fair. But how to get that fd seems to still have some gaps IMHO.
> > 
> 
> I agree with Michal. This has been raised by different
> people already, including on LWN (https://lwn.net/Articles/835342/).
> 
> I can follow Mike's reasoning (man page), and I am also fine if there is
> a valid reason. However, IMHO the basic description seems to match quite good:
> 
>        memfd_create() creates an anonymous file and returns a file descriptor that refers to it.  The
>        file behaves like a regular file, and so can be modified, truncated, memory-mapped, and so on.
>        However,  unlike a regular file, it lives in RAM and has a volatile backing storage.  Once all
>        references to the file are dropped, it is automatically released.  Anonymous  memory  is  used
>        for  all  backing pages of the file.  Therefore, files created by memfd_create() have the same
>        semantics as other anonymous memory allocations such as those allocated using mmap(2) with the
>        MAP_ANONYMOUS flag.

Even despite my laziness and huge amount of copy-paste you can spot the
differences (this is a very old version, update is due):

       memfd_secret()  creates an anonymous file and returns a file descriptor
       that refers to it.  The file can only be memory-mapped; the  memory  in
       such  mapping  will  have  stronger protection than usual memory mapped
       files, and so it can be used to store application  secrets.   Unlike  a
       regular file, a file created with memfd_secret() lives in RAM and has a
       volatile backing storage.  Once all references to the file are dropped,
       it  is  automatically released.  The initial size of the file is set to
       0.  Following the call, the file size should be set using ftruncate(2).

       The memory areas obtained with mmap(2) from the file descriptor are ex‐
       clusive to the owning context.  These areas are removed from the kernel
       page tables and only the page table of the process holding the file de‐
       scriptor maps the corresponding physical memory.
 
> AFAIKS, we would need MFD_SECRET and disallow
> MFD_ALLOW_SEALING and MFD_HUGETLB.

So here we start to multiplex.

> In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of
> temporary mappings (eor migration). TBC.

Never map is the default. When we'll need to map we'll add an explicit flag
for it.
Mike Rapoport Feb. 11, 2021, 11:29 a.m. UTC | #14
On Thu, Feb 11, 2021 at 11:02:07AM +0100, David Hildenbrand wrote:
> 
> Another thought regarding "doesn't have _any_ backing storage"
> 
> What are the right semantics when it comes to memory accounting/commit?
> 
> As secretmem does not have
> a) any backing storage
> b) cannot go to swap
> 
> The MAP_NORESERVE vs. !MAP_NORESERVE handling gets a little unclear. Why
> "reserve swap space" if the allocations cannot ever go to swap? Sure, we
> want to "reserve physical memory", but in contrast to other users that can
> go to swap.
> 
> Of course, this is only relevant for MAP_PRIVATE secretmem mappings. Other
> MAP_SHARED assumes there is no need for reserving swap space as it can just
> go to the backing storage. (yeah, tmpfs/shmem is weird in that regard as
> well, but again, it's a bit different)

In that sense seceremem is as weird as tmpfs and it only allows MAP_SHARED.
David Hildenbrand Feb. 11, 2021, 12:07 p.m. UTC | #15
On 11.02.21 12:27, Mike Rapoport wrote:
> On Thu, Feb 11, 2021 at 10:01:32AM +0100, David Hildenbrand wrote:
>> On 11.02.21 09:39, Michal Hocko wrote:
>>> On Thu 11-02-21 09:13:19, Mike Rapoport wrote:
>>>> On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
>>>>> On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
>>> [...]
>>>>>> Citing my older email:
>>>>>>
>>>>>>       I've hesitated whether to continue to use new flags to memfd_create() or to
>>>>>>       add a new system call and I've decided to use a new system call after I've
>>>>>>       started to look into man pages update. There would have been two completely
>>>>>>       independent descriptions and I think it would have been very confusing.
>>>>>
>>>>> Could you elaborate? Unmapping from the kernel address space can work
>>>>> both for sealed or hugetlb memfds, no? Those features are completely
>>>>> orthogonal AFAICS. With a dedicated syscall you will need to introduce
>>>>> this functionality on top if that is required. Have you considered that?
>>>>> I mean hugetlb pages are used to back guest memory very often. Is this
>>>>> something that will be a secret memory usecase?
>>>>>
>>>>> Please be really specific when giving arguments to back a new syscall
>>>>> decision.
>>>>
>>>> Isn't "syscalls have completely independent description" specific enough?
>>>
>>> No, it's not as you can see from questions I've had above. More on that
>>> below.
>>>
>>>> We are talking about API here, not the implementation details whether
>>>> secretmem supports large pages or not.
>>>>
>>>> The purpose of memfd_create() is to create a file-like access to memory.
>>>> The purpose of memfd_secret() is to create a way to access memory hidden
>>>> from the kernel.
>>>>
>>>> I don't think overloading memfd_create() with the secretmem flags because
>>>> they happen to return a file descriptor will be better for users, but
>>>> rather will be more confusing.
>>>
>>> This is quite a subjective conclusion. I could very well argue that it
>>> would be much better to have a single syscall to get a fd backed memory
>>> with spedific requirements (sealing, unmapping from the kernel address
>>> space). Neither of us would be clearly right or wrong. A more important
>>> point is a future extensibility and usability, though. So let's just
>>> think of few usecases I have outlined above. Is it unrealistic to expect
>>> that secret memory should be sealable? What about hugetlb? Because if
>>> the answer is no then a new API is a clear win as the combination of
>>> flags would never work and then we would just suffer from the syscall
>>> multiplexing without much gain. On the other hand if combination of the
>>> functionality is to be expected then you will have to jam it into
>>> memfd_create and copy the interface likely causing more confusion. See
>>> what I mean?
>>>
>>> I by no means do not insist one way or the other but from what I have
>>> seen so far I have a feeling that the interface hasn't been thought
>>> through enough. Sure you have landed with fd based approach and that
>>> seems fair. But how to get that fd seems to still have some gaps IMHO.
>>>
>>
>> I agree with Michal. This has been raised by different
>> people already, including on LWN (https://lwn.net/Articles/835342/).
>>
>> I can follow Mike's reasoning (man page), and I am also fine if there is
>> a valid reason. However, IMHO the basic description seems to match quite good:
>>
>>         memfd_create() creates an anonymous file and returns a file descriptor that refers to it.  The
>>         file behaves like a regular file, and so can be modified, truncated, memory-mapped, and so on.
>>         However,  unlike a regular file, it lives in RAM and has a volatile backing storage.  Once all
>>         references to the file are dropped, it is automatically released.  Anonymous  memory  is  used
>>         for  all  backing pages of the file.  Therefore, files created by memfd_create() have the same
>>         semantics as other anonymous memory allocations such as those allocated using mmap(2) with the
>>         MAP_ANONYMOUS flag.
> 
> Even despite my laziness and huge amount of copy-paste you can spot the
> differences (this is a very old version, update is due):
> 
>         memfd_secret()  creates an anonymous file and returns a file descriptor
>         that refers to it.  The file can only be memory-mapped; the  memory  in
>         such  mapping  will  have  stronger protection than usual memory mapped
>         files, and so it can be used to store application  secrets.   Unlike  a
>         regular file, a file created with memfd_secret() lives in RAM and has a
>         volatile backing storage.  Once all references to the file are dropped,
>         it  is  automatically released.  The initial size of the file is set to
>         0.  Following the call, the file size should be set using ftruncate(2).
> 
>         The memory areas obtained with mmap(2) from the file descriptor are ex‐
>         clusive to the owning context.  These areas are removed from the kernel
>         page tables and only the page table of the process holding the file de‐
>         scriptor maps the corresponding physical memory.
>   

So let's talk about the main user-visible differences to other memfd 
files (especially, other purely virtual files like hugetlbfs). With 
secretmem:

- File content can only be read/written via memory mappings.
- File content cannot be swapped out.

I think there are still valid ways to modify file content using 
syscalls: e.g., fallocate(PUNCH_HOLE). Things like truncate also seems 
to work just fine.

What else?


>> AFAIKS, we would need MFD_SECRET and disallow
>> MFD_ALLOW_SEALING and MFD_HUGETLB.
> 
> So here we start to multiplex.

Yes. And as Michal said, maybe we can support combinations in the future.

> 
>> In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of
>> temporary mappings (eor migration). TBC.
> 
> Never map is the default. When we'll need to map we'll add an explicit flag
> for it.

No strong opinion. (I'd try to hurt the kernel less as default)
Michal Hocko Feb. 11, 2021, 12:30 p.m. UTC | #16
On Thu 11-02-21 13:20:08, Mike Rapoport wrote:
[...]
> Sealing is anyway controlled via fcntl() and I don't think
> MFD_ALLOW_SEALING makes much sense for the secretmem because it is there to
> prevent rogue file sealing in tmpfs/hugetlbfs.

This doesn't really match my understanding. The primary usecase for the
sealing is to safely and predictably coordinate over shared memory. I
absolutely do not see why this would be incompatible with an additional
requirement to unmap the memory from the kernel to prevent additional
interference from the kernel side. Quite contrary it looks like a very
nice extension to this model.
 
> As for the huge pages, I'm not sure at all that supporting huge pages in
> secretmem will involve hugetlbfs.

Have a look how hugetlb proliferates through our MM APIs. I strongly
suspect this is strong signal that this won't be any different.

> And even if yes, adding SECRETMEM_HUGE
> flag seems to me less confusing than saying "from kernel x.y you can use
> MFD_CREATE | MFD_SECRET | MFD_HUGE" etc for all possible combinations.

I really fail to see your point. This is a standard model we have. It is
quite natural that flags are added. Moreover adding a new syscall will
not make it any less of a problem.

> > I by no means do not insist one way or the other but from what I have
> > seen so far I have a feeling that the interface hasn't been thought
> > through enough.
> 
> It has been, but we have different thoughts about it ;-)

Then you must be carrying a lot of implicit knowledge which I want you
to document.
Mike Rapoport Feb. 11, 2021, 10:59 p.m. UTC | #17
On Thu, Feb 11, 2021 at 01:30:42PM +0100, Michal Hocko wrote:
> On Thu 11-02-21 13:20:08, Mike Rapoport wrote:
> [...]
> > Sealing is anyway controlled via fcntl() and I don't think
> > MFD_ALLOW_SEALING makes much sense for the secretmem because it is there to
> > prevent rogue file sealing in tmpfs/hugetlbfs.
> 
> This doesn't really match my understanding. The primary usecase for the
> sealing is to safely and predictably coordinate over shared memory. I
> absolutely do not see why this would be incompatible with an additional
> requirement to unmap the memory from the kernel to prevent additional
> interference from the kernel side. Quite contrary it looks like a very
> nice extension to this model.

I didn't mean that secretmem should not support sealing. I meant that
MFD_ALLOW_SEALING flag does not make sense. Unlike tmpfs, the secretmem fd
does not need protection from somebody unexpectedly sealing it.

> > As for the huge pages, I'm not sure at all that supporting huge pages in
> > secretmem will involve hugetlbfs.
> 
> Have a look how hugetlb proliferates through our MM APIs. I strongly
> suspect this is strong signal that this won't be any different.
> 
> > And even if yes, adding SECRETMEM_HUGE
> > flag seems to me less confusing than saying "from kernel x.y you can use
> > MFD_CREATE | MFD_SECRET | MFD_HUGE" etc for all possible combinations.
> 
> I really fail to see your point. This is a standard model we have. It is
> quite natural that flags are added. Moreover adding a new syscall will
> not make it any less of a problem.

Nowadays adding a new syscall is not as costly as it used to be. And I
think it'll provide better extensibility when new features would be added
to secretmem. 

For instance, for creating a secretmem fd backed with sealing we'd have

	memfd_secretm(SECRETMEM_HUGE);

rather than

	memfd_create(MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_SECRET);


Besides, if we overload memfd_secret we add complexity to flags validation
of allowable flag combinations even with the simplest initial
implementation.
And what it will become when more features are added to secretmem?
 
> > > I by no means do not insist one way or the other but from what I have
> > > seen so far I have a feeling that the interface hasn't been thought
> > > through enough.
> > 
> > It has been, but we have different thoughts about it ;-)
> 
> Then you must be carrying a lot of implicit knowledge which I want you
> to document.

I don't have any implicit knowledge, we just have a different perspective.
Mike Rapoport Feb. 11, 2021, 11:09 p.m. UTC | #18
On Thu, Feb 11, 2021 at 01:07:10PM +0100, David Hildenbrand wrote:
> On 11.02.21 12:27, Mike Rapoport wrote:
> > On Thu, Feb 11, 2021 at 10:01:32AM +0100, David Hildenbrand wrote:
> 
> So let's talk about the main user-visible differences to other memfd files
> (especially, other purely virtual files like hugetlbfs). With secretmem:
> 
> - File content can only be read/written via memory mappings.
> - File content cannot be swapped out.
> 
> I think there are still valid ways to modify file content using syscalls:
> e.g., fallocate(PUNCH_HOLE). Things like truncate also seems to work just
> fine.
 
These work perfectly with any file, so maybe we should have added
memfd_create as a flag to open(2) back then and now the secretmem file
descriptors?
 
> > > AFAIKS, we would need MFD_SECRET and disallow
> > > MFD_ALLOW_SEALING and MFD_HUGETLB.
> > 
> > So here we start to multiplex.
> 
> Yes. And as Michal said, maybe we can support combinations in the future.

Isn't there a general agreement that syscall multiplexing is not a good
thing?
memfd_create already has flags validation that does not look very nice.
Adding there only MFD_SECRET will make it a bit less nice, but when we'll
grow new functionality into secretmem that will become horrible.
Michal Hocko Feb. 12, 2021, 9:02 a.m. UTC | #19
On Fri 12-02-21 00:59:29, Mike Rapoport wrote:
> On Thu, Feb 11, 2021 at 01:30:42PM +0100, Michal Hocko wrote:
[...]
> > Have a look how hugetlb proliferates through our MM APIs. I strongly
> > suspect this is strong signal that this won't be any different.
> > 
> > > And even if yes, adding SECRETMEM_HUGE
> > > flag seems to me less confusing than saying "from kernel x.y you can use
> > > MFD_CREATE | MFD_SECRET | MFD_HUGE" etc for all possible combinations.
> > 
> > I really fail to see your point. This is a standard model we have. It is
> > quite natural that flags are added. Moreover adding a new syscall will
> > not make it any less of a problem.
> 
> Nowadays adding a new syscall is not as costly as it used to be. And I
> think it'll provide better extensibility when new features would be added
> to secretmem. 
> 
> For instance, for creating a secretmem fd backed with sealing we'd have
> 
> 	memfd_secretm(SECRETMEM_HUGE);

You mean SECRETMEM_HUGE_1G_AND_SEALED or SECRET_HUGE_2MB_WITHOUT_SEALED?
This would be rather an antipatern to our flags design, no? Really there
are orthogonal requirements here and there is absolutely zero reason
to smash everything into a single thing. It is just perfectly fine to
combine those functionalities without a pre-described way how to do
that.

> rather than
> 
> 	memfd_create(MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_SECRET);
> 
> 
> Besides, if we overload memfd_secret we add complexity to flags validation
> of allowable flag combinations even with the simplest initial
> implementation.

This is the least of my worry, really. The existing code in
memfd_create, unlike others legacy interfaces, allows extensions just
fine.

> And what it will become when more features are added to secretmem?

Example?

> > > > I by no means do not insist one way or the other but from what I have
> > > > seen so far I have a feeling that the interface hasn't been thought
> > > > through enough.
> > > 
> > > It has been, but we have different thoughts about it ;-)
> > 
> > Then you must be carrying a lot of implicit knowledge which I want you
> > to document.
> 
> I don't have any implicit knowledge, we just have a different perspective.

OK, I will stop discussing now because it doesn't really seem to lead
anywhere.

Just to recap my current understanding. Your main argument so far is
that this is somehow special and you believe it would be confusing
to use an existing interface. I beg to disagree here because memfd
interface is exactly a way to get a file handle to describe a memory
which is what you want. About the only thing that secretmem is special
is that it only operates on mapped areas and read/write interface is
not supported (but I do not see a fundamental reason this couldn't be
added in the future). All the rest is just operating on a memory backed
file. I envison the hugetlb support will follow and sealing sounds like
a useful thing to be requested as well.  All that would have to be
added to a new syscall over time and then we will land at two parallel
interface supporting a largerly overlapping feature set.

To me all the above sounds to be much stronher argument than your worry
this might be confusing.

I will not insist on this but you should have some more thought on those
arguments.
David Hildenbrand Feb. 12, 2021, 9:18 a.m. UTC | #20
On 12.02.21 00:09, Mike Rapoport wrote:
> On Thu, Feb 11, 2021 at 01:07:10PM +0100, David Hildenbrand wrote:
>> On 11.02.21 12:27, Mike Rapoport wrote:
>>> On Thu, Feb 11, 2021 at 10:01:32AM +0100, David Hildenbrand wrote:
>>
>> So let's talk about the main user-visible differences to other memfd files
>> (especially, other purely virtual files like hugetlbfs). With secretmem:
>>
>> - File content can only be read/written via memory mappings.
>> - File content cannot be swapped out.
>>
>> I think there are still valid ways to modify file content using syscalls:
>> e.g., fallocate(PUNCH_HOLE). Things like truncate also seems to work just
>> fine.
>   
> These work perfectly with any file, so maybe we should have added
> memfd_create as a flag to open(2) back then and now the secretmem file
> descriptors?

I think open() vs memfd_create() makes sense: for open, the path 
specifies main properties (tmpfs, hugetlbfs, filesystem). On memfd, 
there is no such path and the "type" has to be specified differently.

Also, open() might open existing files - memfd always creates new files.

>   
>>>> AFAIKS, we would need MFD_SECRET and disallow
>>>> MFD_ALLOW_SEALING and MFD_HUGETLB.
>>>
>>> So here we start to multiplex.
>>
>> Yes. And as Michal said, maybe we can support combinations in the future.
> 
> Isn't there a general agreement that syscall multiplexing is not a good
> thing?

Looking at mmap(), madvise(), fallocate(), I think multiplexing is just 
fine and flags can be mutually exclusive - as long as we're not 
squashing completely unrelated things into a single system call.

As one example: we don't have mmap_private() vs. mmap_shared() vs. 
mmap_shared_validate(). E.g., MAP_SYNC is only available for 
MAP_SHARED_VALIDATE.


> memfd_create already has flags validation that does not look very nice.

I assume you're talking about the hugetlb size specifications, right? 
It's not nice but fairly compact.

> Adding there only MFD_SECRET will make it a bit less nice, but when we'll
> grow new functionality into secretmem that will become horrible.

What do you have in mind? A couple of MFD_SECRET_* flags that only work 
with MFD_SECRET won't hurt IMHO. Just like we allow MFD_HUGE_* only with 
MFD_HUGETLB.

Thanks,

David / dhildenb
Mike Rapoport Feb. 14, 2021, 9:19 a.m. UTC | #21
On Fri, Feb 12, 2021 at 10:18:19AM +0100, David Hildenbrand wrote:
> On 12.02.21 00:09, Mike Rapoport wrote:
> > On Thu, Feb 11, 2021 at 01:07:10PM +0100, David Hildenbrand wrote:
> > > On 11.02.21 12:27, Mike Rapoport wrote:
> > > > On Thu, Feb 11, 2021 at 10:01:32AM +0100, David Hildenbrand wrote:
> > > 
> > > So let's talk about the main user-visible differences to other memfd files
> > > (especially, other purely virtual files like hugetlbfs). With secretmem:
> > > 
> > > - File content can only be read/written via memory mappings.
> > > - File content cannot be swapped out.
> > > 
> > > I think there are still valid ways to modify file content using syscalls:
> > > e.g., fallocate(PUNCH_HOLE). Things like truncate also seems to work just
> > > fine.
> > These work perfectly with any file, so maybe we should have added
> > memfd_create as a flag to open(2) back then and now the secretmem file
> > descriptors?
> 
> I think open() vs memfd_create() makes sense: for open, the path specifies
> main properties (tmpfs, hugetlbfs, filesystem). On memfd, there is no such
> path and the "type" has to be specified differently.
> 
> Also, open() might open existing files - memfd always creates new files.

Yes, but still open() returns a handle to a file and memfd_create() returns
a handle to a file. The differences may be well hidden by e.g. O_MEMORY and
than features unique to memfd files will have their set of O_SOMETHING
flags.

It's the same logic that says "we already have an interface that's close
enough and it's fine to add a bunch of new flags there".
 
And here we come to the question "what are the differences that justify a
new system call?" and the answer to this is very subjective. And as such we
can continue bikeshedding forever.
David Hildenbrand Feb. 14, 2021, 9:58 a.m. UTC | #22
> Am 14.02.2021 um 10:20 schrieb Mike Rapoport <rppt@kernel.org>:
> 
> On Fri, Feb 12, 2021 at 10:18:19AM +0100, David Hildenbrand wrote:
>>> On 12.02.21 00:09, Mike Rapoport wrote:
>>> On Thu, Feb 11, 2021 at 01:07:10PM +0100, David Hildenbrand wrote:
>>>> On 11.02.21 12:27, Mike Rapoport wrote:
>>>>> On Thu, Feb 11, 2021 at 10:01:32AM +0100, David Hildenbrand wrote:
>>>> 
>>>> So let's talk about the main user-visible differences to other memfd files
>>>> (especially, other purely virtual files like hugetlbfs). With secretmem:
>>>> 
>>>> - File content can only be read/written via memory mappings.
>>>> - File content cannot be swapped out.
>>>> 
>>>> I think there are still valid ways to modify file content using syscalls:
>>>> e.g., fallocate(PUNCH_HOLE). Things like truncate also seems to work just
>>>> fine.
>>> These work perfectly with any file, so maybe we should have added
>>> memfd_create as a flag to open(2) back then and now the secretmem file
>>> descriptors?
>> 
>> I think open() vs memfd_create() makes sense: for open, the path specifies
>> main properties (tmpfs, hugetlbfs, filesystem). On memfd, there is no such
>> path and the "type" has to be specified differently.
>> 
>> Also, open() might open existing files - memfd always creates new files.
> 
> Yes, but still open() returns a handle to a file and memfd_create() returns
> a handle to a file. The differences may be well hidden by e.g. O_MEMORY and
> than features unique to memfd files will have their set of O_SOMETHING
> flags.
> 

Let‘s agree to disagree.

> It's the same logic that says "we already have an interface that's close
> enough and it's fine to add a bunch of new flags there".

No, not quite. But let‘s agree to disagree.

> 
> And here we come to the question "what are the differences that justify a
> new system call?" and the answer to this is very subjective. And as such we
> can continue bikeshedding forever.

I think this fits into the existing memfd_create() syscall just fine, and I heard no compelling argument why it shouldn‘t. That‘s all I can say.
James Bottomley Feb. 14, 2021, 7:21 p.m. UTC | #23
On Sun, 2021-02-14 at 10:58 +0100, David Hildenbrand wrote:
[...]
> > And here we come to the question "what are the differences that
> > justify a new system call?" and the answer to this is very
> > subjective. And as such we can continue bikeshedding forever.
> 
> I think this fits into the existing memfd_create() syscall just fine,
> and I heard no compelling argument why it shouldn‘t. That‘s all I can
> say.

OK, so let's review history.  In the first two incarnations of the
patch, it was an extension of memfd_create().  The specific objection
by Kirill Shutemov was that it doesn't share any code in common with
memfd and so should be a separate system call:

https://lore.kernel.org/linux-api/20200713105812.dnwtdhsuyj3xbh4f@box/

The other objection raised offlist is that if we do use memfd_create,
then we have to add all the secret memory flags as an additional ioctl,
whereas they can be specified on open if we do a separate system call. 
The container people violently objected to the ioctl because it can't
be properly analysed by seccomp and much preferred the syscall version.

Since we're dumping the uncached variant, the ioctl problem disappears
but so does the possibility of ever adding it back if we take on the
container peoples' objection.  This argues for a separate syscall
because we can add additional features and extend the API with flags
without causing anti-ioctl riots.

James
Michal Hocko Feb. 15, 2021, 9:13 a.m. UTC | #24
On Sun 14-02-21 11:21:02, James Bottomley wrote:
> On Sun, 2021-02-14 at 10:58 +0100, David Hildenbrand wrote:
> [...]
> > > And here we come to the question "what are the differences that
> > > justify a new system call?" and the answer to this is very
> > > subjective. And as such we can continue bikeshedding forever.
> > 
> > I think this fits into the existing memfd_create() syscall just fine,
> > and I heard no compelling argument why it shouldn‘t. That‘s all I can
> > say.
> 
> OK, so let's review history.  In the first two incarnations of the
> patch, it was an extension of memfd_create().  The specific objection
> by Kirill Shutemov was that it doesn't share any code in common with
> memfd and so should be a separate system call:
> 
> https://lore.kernel.org/linux-api/20200713105812.dnwtdhsuyj3xbh4f@box/

Thanks for the pointer. But this argument hasn't been challenged at all.
It hasn't been brought up that the overlap would be considerable higher
by the hugetlb/sealing support. And so far nobody has claimed those
combinations as unviable.

> The other objection raised offlist is that if we do use memfd_create,
> then we have to add all the secret memory flags as an additional ioctl,
> whereas they can be specified on open if we do a separate system call. 
> The container people violently objected to the ioctl because it can't
> be properly analysed by seccomp and much preferred the syscall version.
> 
> Since we're dumping the uncached variant, the ioctl problem disappears
> but so does the possibility of ever adding it back if we take on the
> container peoples' objection.  This argues for a separate syscall
> because we can add additional features and extend the API with flags
> without causing anti-ioctl riots.

I am sorry but I do not understand this argument. What kind of flags are
we talking about and why would that be a problem with memfd_create
interface? Could you be more specific please?
James Bottomley Feb. 15, 2021, 6:14 p.m. UTC | #25
On Mon, 2021-02-15 at 10:13 +0100, Michal Hocko wrote:
> On Sun 14-02-21 11:21:02, James Bottomley wrote:
> > On Sun, 2021-02-14 at 10:58 +0100, David Hildenbrand wrote:
> > [...]
> > > > And here we come to the question "what are the differences that
> > > > justify a new system call?" and the answer to this is very
> > > > subjective. And as such we can continue bikeshedding forever.
> > > 
> > > I think this fits into the existing memfd_create() syscall just
> > > fine, and I heard no compelling argument why it shouldn‘t. That‘s
> > > all I can say.
> > 
> > OK, so let's review history.  In the first two incarnations of the
> > patch, it was an extension of memfd_create().  The specific
> > objection by Kirill Shutemov was that it doesn't share any code in
> > common with memfd and so should be a separate system call:
> > 
> > https://lore.kernel.org/linux-api/20200713105812.dnwtdhsuyj3xbh4f@box/
> 
> Thanks for the pointer. But this argument hasn't been challenged at
> all. It hasn't been brought up that the overlap would be considerable
> higher by the hugetlb/sealing support. And so far nobody has claimed
> those combinations as unviable.

Kirill is actually interested in the sealing path for his KVM code so
we took a look.  There might be a two line overlap in memfd_create for
the seal case, but there's no real overlap in memfd_add_seals which is
the bulk of the code.  So the best way would seem to lift the inode ...
-> seals helpers to be non-static so they can be reused and roll our
own add_seals.

I can't see a use case at all for hugetlb support, so it seems to be a
bit of an angels on pin head discussion.  However, if one were to come
along handling it in the same way seems reasonable.

> > The other objection raised offlist is that if we do use
> > memfd_create, then we have to add all the secret memory flags as an
> > additional ioctl, whereas they can be specified on open if we do a
> > separate system call.  The container people violently objected to
> > the ioctl because it can't be properly analysed by seccomp and much
> > preferred the syscall version.
> > 
> > Since we're dumping the uncached variant, the ioctl problem
> > disappears but so does the possibility of ever adding it back if we
> > take on the container peoples' objection.  This argues for a
> > separate syscall because we can add additional features and extend
> > the API with flags without causing anti-ioctl riots.
> 
> I am sorry but I do not understand this argument.

You don't understand why container guarding technology doesn't like
ioctls?  The problem is each ioctl is the multiplexor is specific to
the particular fd implementation, so unlike fcntl you don't have global
ioctl numbers (although we do try to separate the space somewhat with
the _IO macros).  This makes analysis and blocking a hard problem for
container seccomp.

>  What kind of flags are we talking about and why would that be a
> problem with memfd_create interface? Could you be more specific
> please?

You mean what were the ioctl flags in the patch series linked above? 
They were SECRETMEM_EXCLUSIVE and SECRETMEM_UNCACHED in patch 3/5. 
They were eventually dropped after v10, because of problems with
architectural semantics, with the idea that it could be added back
again if a compelling need arose:

https://lore.kernel.org/linux-api/20201123095432.5860-1-rppt@kernel.org/

In theory the extra flags could be multiplexed into the memfd_create
flags like hugetlbfs is but with 32 flags and a lot already taken it
gets messy for expansion.  When we run out of flags the first question
people will ask is "why didn't you do separate system calls?".

James
Michal Hocko Feb. 15, 2021, 7:20 p.m. UTC | #26
On Mon 15-02-21 10:14:43, James Bottomley wrote:
> On Mon, 2021-02-15 at 10:13 +0100, Michal Hocko wrote:
> > On Sun 14-02-21 11:21:02, James Bottomley wrote:
> > > On Sun, 2021-02-14 at 10:58 +0100, David Hildenbrand wrote:
> > > [...]
> > > > > And here we come to the question "what are the differences that
> > > > > justify a new system call?" and the answer to this is very
> > > > > subjective. And as such we can continue bikeshedding forever.
> > > > 
> > > > I think this fits into the existing memfd_create() syscall just
> > > > fine, and I heard no compelling argument why it shouldn‘t. That‘s
> > > > all I can say.
> > > 
> > > OK, so let's review history.  In the first two incarnations of the
> > > patch, it was an extension of memfd_create().  The specific
> > > objection by Kirill Shutemov was that it doesn't share any code in
> > > common with memfd and so should be a separate system call:
> > > 
> > > https://lore.kernel.org/linux-api/20200713105812.dnwtdhsuyj3xbh4f@box/
> > 
> > Thanks for the pointer. But this argument hasn't been challenged at
> > all. It hasn't been brought up that the overlap would be considerable
> > higher by the hugetlb/sealing support. And so far nobody has claimed
> > those combinations as unviable.
> 
> Kirill is actually interested in the sealing path for his KVM code so
> we took a look.  There might be a two line overlap in memfd_create for
> the seal case, but there's no real overlap in memfd_add_seals which is
> the bulk of the code.  So the best way would seem to lift the inode ...
> -> seals helpers to be non-static so they can be reused and roll our
> own add_seals.

These are implementation details which are not really relevant to the
API IMHO. 

> I can't see a use case at all for hugetlb support, so it seems to be a
> bit of an angels on pin head discussion.  However, if one were to come
> along handling it in the same way seems reasonable.

Those angels have made their way to mmap, System V shm, memfd_create and
other MM interfaces which have never envisioned when introduced. Hugetlb
pages to back guest memory is quite a common usecase so why do you think
those guests wouldn't like to see their memory be "secret"?

As I've said in my last response (YCZEGuLK94szKZDf@dhcp22.suse.cz), I am
not going to argue all these again. I have made my point and you are
free to take it or leave it.

> > > The other objection raised offlist is that if we do use
> > > memfd_create, then we have to add all the secret memory flags as an
> > > additional ioctl, whereas they can be specified on open if we do a
> > > separate system call.  The container people violently objected to
> > > the ioctl because it can't be properly analysed by seccomp and much
> > > preferred the syscall version.
> > > 
> > > Since we're dumping the uncached variant, the ioctl problem
> > > disappears but so does the possibility of ever adding it back if we
> > > take on the container peoples' objection.  This argues for a
> > > separate syscall because we can add additional features and extend
> > > the API with flags without causing anti-ioctl riots.
> > 
> > I am sorry but I do not understand this argument.
> 
> You don't understand why container guarding technology doesn't like
> ioctls?

No, I did not see where the ioctl argument came from.

[...]

> >  What kind of flags are we talking about and why would that be a
> > problem with memfd_create interface? Could you be more specific
> > please?
> 
> You mean what were the ioctl flags in the patch series linked above? 
> They were SECRETMEM_EXCLUSIVE and SECRETMEM_UNCACHED in patch 3/5. 

OK I see. How many potential modes are we talking about? A few or
potentially many?

> They were eventually dropped after v10, because of problems with
> architectural semantics, with the idea that it could be added back
> again if a compelling need arose:
> 
> https://lore.kernel.org/linux-api/20201123095432.5860-1-rppt@kernel.org/
> 
> In theory the extra flags could be multiplexed into the memfd_create
> flags like hugetlbfs is but with 32 flags and a lot already taken it
> gets messy for expansion.  When we run out of flags the first question
> people will ask is "why didn't you do separate system calls?".

OK, I do not necessarily see a lack of flag space a problem. I can be
wrong here but I do not see how that would be solved by a separate
syscall when it sounds rather forseeable that many modes supported by
memfd_create will eventually find their way to a secret memory as well.
If for no other reason, secret memory is nothing really special. It is
just a memory which is not mapped to the kernel via 1:1 mapping. That's
it. And that can be applied to any memory provided to the userspace.

But I am repeating myself again here so I better stop.
James Bottomley Feb. 16, 2021, 4:25 p.m. UTC | #27
On Mon, 2021-02-15 at 20:20 +0100, Michal Hocko wrote:
[...]
> > >   What kind of flags are we talking about and why would that be a
> > > problem with memfd_create interface? Could you be more specific
> > > please?
> > 
> > You mean what were the ioctl flags in the patch series linked
> > above? They were SECRETMEM_EXCLUSIVE and SECRETMEM_UNCACHED in
> > patch 3/5. 
> 
> OK I see. How many potential modes are we talking about? A few or
> potentially many?
 
Well I initially thought there were two (uncached or not) until you
came up with the migratable or non-migratable, which affects the
security properties.  But now there's also potential for hardware
backing, like mktme,  described by flags as well.  I suppose you could
also use RDT to restrict which cache the data goes into: say L1 but not
L2 on to lessen the impact of fully uncached (although the big thrust
of uncached was to blunt hyperthread side channels).  So there is
potential for quite a large expansion even though I'd be willing to bet
that a lot of the modes people have thought about turn out not to be
very effective in the field.

James
David Hildenbrand Feb. 16, 2021, 4:34 p.m. UTC | #28
On 16.02.21 17:25, James Bottomley wrote:
> On Mon, 2021-02-15 at 20:20 +0100, Michal Hocko wrote:
> [...]
>>>>    What kind of flags are we talking about and why would that be a
>>>> problem with memfd_create interface? Could you be more specific
>>>> please?
>>>
>>> You mean what were the ioctl flags in the patch series linked
>>> above? They were SECRETMEM_EXCLUSIVE and SECRETMEM_UNCACHED in
>>> patch 3/5.
>>
>> OK I see. How many potential modes are we talking about? A few or
>> potentially many?
>   
> Well I initially thought there were two (uncached or not) until you
> came up with the migratable or non-migratable, which affects the
> security properties.  But now there's also potential for hardware
> backing, like mktme,  described by flags as well.  I suppose you could
> also use RDT to restrict which cache the data goes into: say L1 but not
> L2 on to lessen the impact of fully uncached (although the big thrust
> of uncached was to blunt hyperthread side channels).  So there is
> potential for quite a large expansion even though I'd be willing to bet
> that a lot of the modes people have thought about turn out not to be
> very effective in the field.

Thanks for the insight. I remember that even the "uncached" parts was 
effectively nacked by x86 maintainers (I might be wrong). For the other 
parts, the question is what we actually want to let user space configure.

Being able to specify "Very secure" "maximum secure" "average secure" 
all doesn't really make sense to me. The discussion regarding 
migratability only really popped up because this is a user-visible thing 
and not being able to migrate can be a real problem (fragmentation, 
ZONE_MOVABLE, ...).
James Bottomley Feb. 16, 2021, 4:44 p.m. UTC | #29
On Tue, 2021-02-16 at 17:34 +0100, David Hildenbrand wrote:
> On 16.02.21 17:25, James Bottomley wrote:
> > On Mon, 2021-02-15 at 20:20 +0100, Michal Hocko wrote:
> > [...]
> > > > >    What kind of flags are we talking about and why would that
> > > > > be a problem with memfd_create interface? Could you be more
> > > > > specific please?
> > > > 
> > > > You mean what were the ioctl flags in the patch series linked
> > > > above? They were SECRETMEM_EXCLUSIVE and SECRETMEM_UNCACHED in
> > > > patch 3/5.
> > > 
> > > OK I see. How many potential modes are we talking about? A few or
> > > potentially many?
> >   
> > Well I initially thought there were two (uncached or not) until you
> > came up with the migratable or non-migratable, which affects the
> > security properties.  But now there's also potential for hardware
> > backing, like mktme,  described by flags as well.  I suppose you
> > could also use RDT to restrict which cache the data goes into: say
> > L1 but not L2 on to lessen the impact of fully uncached (although
> > the big thrust of uncached was to blunt hyperthread side
> > channels).  So there is potential for quite a large expansion even
> > though I'd be willing to bet that a lot of the modes people have
> > thought about turn out not to be very effective in the field.
> 
> Thanks for the insight. I remember that even the "uncached" parts
> was effectively nacked by x86 maintainers (I might be wrong).

It wasn't liked by x86 maintainers, no.  Plus there's no
architecturally standard mechanism for making a page uncached and, as
the arm people pointed out, sometimes no way of ensuring it's never
cached.

>  For the other parts, the question is what we actually want to let
> user space configure.
> 
> Being able to specify "Very secure" "maximum secure" "average
> secure"  all doesn't really make sense to me.

Well, it doesn't to me either unless the user feels a cost/benefit, so
if max cost $100 per invocation and average cost nothing, most people
would chose average unless they had a very good reason not to.  In your
migratable model, if we had separate limits for non-migratable and
migratable, with non-migratable being set low to prevent exhaustion,
max secure becomes a highly scarce resource, whereas average secure is
abundant then having the choice might make sense.

>  The discussion regarding migratability only really popped up because
> this is a user-visible thing and not being able to migrate can be a
> real problem (fragmentation, ZONE_MOVABLE, ...).

I think the biggest use will potentially come from hardware
acceleration.  If it becomes simple to add say encryption to a secret
page with no cost, then no flag needed.  However, if we only have a
limited number of keys so once we run out no more encrypted memory then
it becomes a costly resource and users might want a choice of being
backed by encryption or not.

James
Michal Hocko Feb. 16, 2021, 4:51 p.m. UTC | #30
On Tue 16-02-21 08:25:39, James Bottomley wrote:
> On Mon, 2021-02-15 at 20:20 +0100, Michal Hocko wrote:
> [...]
> > > >   What kind of flags are we talking about and why would that be a
> > > > problem with memfd_create interface? Could you be more specific
> > > > please?
> > > 
> > > You mean what were the ioctl flags in the patch series linked
> > > above? They were SECRETMEM_EXCLUSIVE and SECRETMEM_UNCACHED in
> > > patch 3/5. 
> > 
> > OK I see. How many potential modes are we talking about? A few or
> > potentially many?
>  
> Well I initially thought there were two (uncached or not) until you
> came up with the migratable or non-migratable, which affects the
> security properties.  But now there's also potential for hardware
> backing, like mktme,  described by flags as well.

I do not remember details about mktme but from what I still recall it
had keys associated with direct maps. Is the key management something
that fits into flags management?

> I suppose you could
> also use RDT to restrict which cache the data goes into: say L1 but not
> L2 on to lessen the impact of fully uncached (although the big thrust
> of uncached was to blunt hyperthread side channels).  So there is
> potential for quite a large expansion even though I'd be willing to bet
> that a lot of the modes people have thought about turn out not to be
> very effective in the field.

Are those very HW specific features really viable through a generic
syscall? Don't get me wrong but I find it much more likely somebody will
want a hugetlb (pretty HW independent) without a direct map than a very
close to the HW caching mode soon.

But thanks for the clarification anyway.
David Hildenbrand Feb. 16, 2021, 5:16 p.m. UTC | #31
>>   For the other parts, the question is what we actually want to let
>> user space configure.
>>
>> Being able to specify "Very secure" "maximum secure" "average
>> secure"  all doesn't really make sense to me.
> 
> Well, it doesn't to me either unless the user feels a cost/benefit, so
> if max cost $100 per invocation and average cost nothing, most people
> would chose average unless they had a very good reason not to.  In your
> migratable model, if we had separate limits for non-migratable and
> migratable, with non-migratable being set low to prevent exhaustion,
> max secure becomes a highly scarce resource, whereas average secure is
> abundant then having the choice might make sense.

I hope that we can find a way to handle the migration part internally. 
Especially, because Mike wants the default to be "as secure as 
possible", so if there is a flag, it would have to be an opt-out flag.

I guess as long as we don't temporarily map it into the "owned" location 
in the direct map shared by all VCPUs we are in a good positon. But this 
needs more thought, of course.

> 
>>   The discussion regarding migratability only really popped up because
>> this is a user-visible thing and not being able to migrate can be a
>> real problem (fragmentation, ZONE_MOVABLE, ...).
> 
> I think the biggest use will potentially come from hardware
> acceleration.  If it becomes simple to add say encryption to a secret
> page with no cost, then no flag needed.  However, if we only have a
> limited number of keys so once we run out no more encrypted memory then
> it becomes a costly resource and users might want a choice of being
> backed by encryption or not.

Right. But wouldn't HW support with configurable keys etc. need more 
syscall parameters (meaning, even memefd_secret() as it is would not be 
sufficient?). I suspect the simplistic flag approach might not be 
sufficient. I might be wrong because I have no clue about MKTME and friends.

Anyhow, I still think extending memfd_create() might just be good enough 
- at least for now. Things like HW support might have requirements we 
don't even know yet and that we cannot even model in memfd_secret() 
right now.
James Bottomley Feb. 17, 2021, 4:19 p.m. UTC | #32
On Tue, 2021-02-16 at 18:16 +0100, David Hildenbrand wrote:
[...]
> > >   The discussion regarding migratability only really popped up
> > > because this is a user-visible thing and not being able to
> > > migrate can be a real problem (fragmentation, ZONE_MOVABLE, ...).
> > 
> > I think the biggest use will potentially come from hardware
> > acceleration.  If it becomes simple to add say encryption to a
> > secret page with no cost, then no flag needed.  However, if we only
> > have a limited number of keys so once we run out no more encrypted
> > memory then it becomes a costly resource and users might want a
> > choice of being backed by encryption or not.
> 
> Right. But wouldn't HW support with configurable keys etc. need more 
> syscall parameters (meaning, even memefd_secret() as it is would not
> be sufficient?). I suspect the simplistic flag approach might not
> be sufficient. I might be wrong because I have no clue about MKTME
> and friends.

The theory I was operating under is key management is automatic and
hidden, but key scarcity can't be, so if you flag requesting hardware
backing then you either get success (the kernel found a key) or failure
(the kernel is out of keys).  If we actually want to specify the key
then we need an extra argument and we *must* have a new system call.

> Anyhow, I still think extending memfd_create() might just be good
> enough - at least for now.

I really think this is the wrong approach for a user space ABI.  If we
think we'll ever need to move to a separate syscall, we should begin
with one.  The pain of trying to shift userspace from memfd_create to a
new syscall would be enormous.  It's not impossible (see clone3) but
it's a pain we should avoid if we know it's coming.

>  Things like HW support might have requirements we don't even know
> yet and that we cannot even model in memfd_secret() right now.

This is the annoying problem with our Linux unbreakable ABI policy: we
get to plan when the ABI is introduced for stuff we don't yet even know
about.

James
David Hildenbrand Feb. 22, 2021, 9:38 a.m. UTC | #33
On 17.02.21 17:19, James Bottomley wrote:
> On Tue, 2021-02-16 at 18:16 +0100, David Hildenbrand wrote:
> [...]
>>>>    The discussion regarding migratability only really popped up
>>>> because this is a user-visible thing and not being able to
>>>> migrate can be a real problem (fragmentation, ZONE_MOVABLE, ...).
>>>
>>> I think the biggest use will potentially come from hardware
>>> acceleration.  If it becomes simple to add say encryption to a
>>> secret page with no cost, then no flag needed.  However, if we only
>>> have a limited number of keys so once we run out no more encrypted
>>> memory then it becomes a costly resource and users might want a
>>> choice of being backed by encryption or not.
>>
>> Right. But wouldn't HW support with configurable keys etc. need more
>> syscall parameters (meaning, even memefd_secret() as it is would not
>> be sufficient?). I suspect the simplistic flag approach might not
>> be sufficient. I might be wrong because I have no clue about MKTME
>> and friends.
> 
> The theory I was operating under is key management is automatic and
> hidden, but key scarcity can't be, so if you flag requesting hardware
> backing then you either get success (the kernel found a key) or failure
> (the kernel is out of keys).  If we actually want to specify the key
> then we need an extra argument and we *must* have a new system call.
> 
>> Anyhow, I still think extending memfd_create() might just be good
>> enough - at least for now.
> 
> I really think this is the wrong approach for a user space ABI.  If we
> think we'll ever need to move to a separate syscall, we should begin
> with one.  The pain of trying to shift userspace from memfd_create to a
> new syscall would be enormous.  It's not impossible (see clone3) but
> it's a pain we should avoid if we know it's coming.

Sorry for the late reply, there is just too much going on :)

*If* we ever realize we need to pass more parameters we can easily have 
a new syscall for that purpose. *Then*, we know how that syscall will 
look like. Right now, it's just pure speculation.

Until then, going with memfd_create() works just fine IMHO.

The worst think that could happen is that we might not be able to create 
all fancy sectremem flavors in the future via memfd_create() but only 
via different, highly specialized syscall. I don't see a real problem 
with that.
David Hildenbrand Feb. 22, 2021, 10:50 a.m. UTC | #34
On 22.02.21 10:38, David Hildenbrand wrote:
> On 17.02.21 17:19, James Bottomley wrote:
>> On Tue, 2021-02-16 at 18:16 +0100, David Hildenbrand wrote:
>> [...]
>>>>>     The discussion regarding migratability only really popped up
>>>>> because this is a user-visible thing and not being able to
>>>>> migrate can be a real problem (fragmentation, ZONE_MOVABLE, ...).
>>>>
>>>> I think the biggest use will potentially come from hardware
>>>> acceleration.  If it becomes simple to add say encryption to a
>>>> secret page with no cost, then no flag needed.  However, if we only
>>>> have a limited number of keys so once we run out no more encrypted
>>>> memory then it becomes a costly resource and users might want a
>>>> choice of being backed by encryption or not.
>>>
>>> Right. But wouldn't HW support with configurable keys etc. need more
>>> syscall parameters (meaning, even memefd_secret() as it is would not
>>> be sufficient?). I suspect the simplistic flag approach might not
>>> be sufficient. I might be wrong because I have no clue about MKTME
>>> and friends.
>>
>> The theory I was operating under is key management is automatic and
>> hidden, but key scarcity can't be, so if you flag requesting hardware
>> backing then you either get success (the kernel found a key) or failure
>> (the kernel is out of keys).  If we actually want to specify the key
>> then we need an extra argument and we *must* have a new system call.
>>
>>> Anyhow, I still think extending memfd_create() might just be good
>>> enough - at least for now.
>>
>> I really think this is the wrong approach for a user space ABI.  If we
>> think we'll ever need to move to a separate syscall, we should begin
>> with one.  The pain of trying to shift userspace from memfd_create to a
>> new syscall would be enormous.  It's not impossible (see clone3) but
>> it's a pain we should avoid if we know it's coming.
> 
> Sorry for the late reply, there is just too much going on :)
> 
> *If* we ever realize we need to pass more parameters we can easily have
> a new syscall for that purpose. *Then*, we know how that syscall will
> look like. Right now, it's just pure speculation.
> 
> Until then, going with memfd_create() works just fine IMHO.
> 
> The worst think that could happen is that we might not be able to create
> all fancy sectremem flavors in the future via memfd_create() but only
> via different, highly specialized syscall. I don't see a real problem
> with that.
> 

Adding to that, I'll give up arguing now as I have more important things 
to do. It has been questioned by various people why we need a dedicate 
syscall and at least for me, without a satisfying answer.

Worst thing is that we end up with a syscall that could have been 
avoided, for example, because
1. We add existing/future memfd_create() flags to memfd_secret() as well 
when we need them (sealing, hugetlb., ..).
2. We decide in the future to still add MFD_SECRET support to 
memfd_secret().

So be it.
diff mbox series

Patch

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
new file mode 100644
index 000000000000..70e7db9f94fe
--- /dev/null
+++ b/include/linux/secretmem.h
@@ -0,0 +1,24 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_SECRETMEM_H
+#define _LINUX_SECRETMEM_H
+
+#ifdef CONFIG_SECRETMEM
+
+bool vma_is_secretmem(struct vm_area_struct *vma);
+bool page_is_secretmem(struct page *page);
+
+#else
+
+static inline bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return false;
+}
+
+static inline bool page_is_secretmem(struct page *page)
+{
+	return false;
+}
+
+#endif /* CONFIG_SECRETMEM */
+
+#endif /* _LINUX_SECRETMEM_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f3956fc11de6..35687dcb1a42 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -97,5 +97,6 @@ 
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
+#define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 19aa806890d5..e9a2011ee4a2 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -352,6 +352,8 @@  COND_SYSCALL(pkey_mprotect);
 COND_SYSCALL(pkey_alloc);
 COND_SYSCALL(pkey_free);
 
+/* memfd_secret */
+COND_SYSCALL(memfd_secret);
 
 /*
  * Architecture specific weak syscall entries.
diff --git a/mm/Kconfig b/mm/Kconfig
index 24c045b24b95..5f8243442f66 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,7 @@  config MAPPING_DIRTY_HELPERS
 config KMAP_LOCAL
 	bool
 
+config SECRETMEM
+	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 72227b24a616..b2a564eec27f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,3 +120,4 @@  obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_SECRETMEM) += secretmem.o
diff --git a/mm/gup.c b/mm/gup.c
index e4c224cd9661..3e086b073624 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,7 @@ 
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/secretmem.h>
 
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
@@ -759,6 +760,9 @@  struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	struct follow_page_context ctx = { NULL };
 	struct page *page;
 
+	if (vma_is_secretmem(vma))
+		return NULL;
+
 	page = follow_page_mask(vma, address, foll_flags, &ctx);
 	if (ctx.pgmap)
 		put_dev_pagemap(ctx.pgmap);
@@ -892,6 +896,9 @@  static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
 		return -EOPNOTSUPP;
 
+	if (vma_is_secretmem(vma))
+		return -EFAULT;
+
 	if (write) {
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
@@ -2031,6 +2038,9 @@  static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
+		if (page_is_secretmem(page))
+			goto pte_unmap;
+
 		head = try_grab_compound_head(page, 1, flags);
 		if (!head)
 			goto pte_unmap;
diff --git a/mm/mlock.c b/mm/mlock.c
index 73960bb3464d..127e72dcac3d 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -23,6 +23,7 @@ 
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
+#include <linux/secretmem.h>
 
 #include "internal.h"
 
@@ -503,7 +504,7 @@  static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
-	    vma_is_dax(vma))
+	    vma_is_dax(vma) || vma_is_secretmem(vma))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/secretmem.c b/mm/secretmem.c
new file mode 100644
index 000000000000..fa6738e860c2
--- /dev/null
+++ b/mm/secretmem.c
@@ -0,0 +1,246 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2021
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/swap.h>
+#include <linux/mount.h>
+#include <linux/memfd.h>
+#include <linux/bitops.h>
+#include <linux/printk.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/pseudo_fs.h>
+#include <linux/secretmem.h>
+#include <linux/set_memory.h>
+#include <linux/sched/signal.h>
+
+#include <uapi/linux/magic.h>
+
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "secretmem: " fmt
+
+/*
+ * Define mode and flag masks to allow validation of the system call
+ * parameters.
+ */
+#define SECRETMEM_MODE_MASK	(0x0)
+#define SECRETMEM_FLAGS_MASK	SECRETMEM_MODE_MASK
+
+static bool secretmem_enable __ro_after_init;
+module_param_named(enable, secretmem_enable, bool, 0400);
+MODULE_PARM_DESC(secretmem_enable,
+		 "Enable secretmem and memfd_secret(2) system call");
+
+static vm_fault_t secretmem_fault(struct vm_fault *vmf)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	pgoff_t offset = vmf->pgoff;
+	gfp_t gfp = vmf->gfp_mask;
+	unsigned long addr;
+	struct page *page;
+	int err;
+
+	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
+		return vmf_error(-EINVAL);
+
+retry:
+	page = find_lock_page(mapping, offset);
+	if (!page) {
+		page = alloc_page(gfp | __GFP_ZERO);
+		if (!page)
+			return VM_FAULT_OOM;
+
+		err = set_direct_map_invalid_noflush(page, 1);
+		if (err) {
+			put_page(page);
+			return vmf_error(err);
+		}
+
+		__SetPageUptodate(page);
+		err = add_to_page_cache_lru(page, mapping, offset, gfp);
+		if (unlikely(err)) {
+			put_page(page);
+			/*
+			 * If a split of large page was required, it
+			 * already happened when we marked the page invalid
+			 * which guarantees that this call won't fail
+			 */
+			set_direct_map_default_noflush(page, 1);
+			if (err == -EEXIST)
+				goto retry;
+
+			return vmf_error(err);
+		}
+
+		addr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+	}
+
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct secretmem_vm_ops = {
+	.fault = secretmem_fault,
+};
+
+static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long len = vma->vm_end - vma->vm_start;
+
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
+		return -EINVAL;
+
+	if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
+		return -EAGAIN;
+
+	vma->vm_flags |= VM_LOCKED | VM_DONTDUMP;
+	vma->vm_ops = &secretmem_vm_ops;
+
+	return 0;
+}
+
+bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return vma->vm_ops == &secretmem_vm_ops;
+}
+
+static const struct file_operations secretmem_fops = {
+	.mmap		= secretmem_mmap,
+};
+
+static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
+{
+	return false;
+}
+
+static int secretmem_migratepage(struct address_space *mapping,
+				 struct page *newpage, struct page *page,
+				 enum migrate_mode mode)
+{
+	return -EBUSY;
+}
+
+static void secretmem_freepage(struct page *page)
+{
+	set_direct_map_default_noflush(page, 1);
+	clear_highpage(page);
+}
+
+static const struct address_space_operations secretmem_aops = {
+	.freepage	= secretmem_freepage,
+	.migratepage	= secretmem_migratepage,
+	.isolate_page	= secretmem_isolate_page,
+};
+
+bool page_is_secretmem(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
+	if (!mapping)
+		return false;
+
+	return mapping->a_ops == &secretmem_aops;
+}
+
+static struct vfsmount *secretmem_mnt;
+
+static struct file *secretmem_file_create(unsigned long flags)
+{
+	struct file *file = ERR_PTR(-ENOMEM);
+	struct inode *inode;
+
+	inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
+				 O_RDWR, &secretmem_fops);
+	if (IS_ERR(file))
+		goto err_free_inode;
+
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_unevictable(inode->i_mapping);
+
+	inode->i_mapping->a_ops = &secretmem_aops;
+
+	/* pretend we are a normal file with zero size */
+	inode->i_mode |= S_IFREG;
+	inode->i_size = 0;
+
+	return file;
+
+err_free_inode:
+	iput(inode);
+	return file;
+}
+
+SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
+{
+	struct file *file;
+	int fd, err;
+
+	/* make sure local flags do not confict with global fcntl.h */
+	BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC);
+
+	if (!secretmem_enable)
+		return -ENOSYS;
+
+	if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC))
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(flags & O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	file = secretmem_file_create(flags);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_put_fd;
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	fd_install(fd, file);
+	return fd;
+
+err_put_fd:
+	put_unused_fd(fd);
+	return err;
+}
+
+static int secretmem_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, SECRETMEM_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type secretmem_fs = {
+	.name		= "secretmem",
+	.init_fs_context = secretmem_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static int secretmem_init(void)
+{
+	int ret = 0;
+
+	if (!secretmem_enable)
+		return ret;
+
+	secretmem_mnt = kern_mount(&secretmem_fs);
+	if (IS_ERR(secretmem_mnt))
+		ret = PTR_ERR(secretmem_mnt);
+
+	return ret;
+}
+fs_initcall(secretmem_init);