diff mbox series

[v6,5/6] mm: secretmem: use PMD-size pages to amortize direct map fragmentation

Message ID 20200924132904.1391-6-rppt@kernel.org (mailing list archive)
State New, archived
Headers show
Series mm: introduce memfd_secret system call to create "secret" memory areas | expand

Commit Message

Mike Rapoport Sept. 24, 2020, 1:29 p.m. UTC
From: Mike Rapoport <rppt@linux.ibm.com>

Removing a PAGE_SIZE page from the direct map every time such page is
allocated for a secret memory mapping will cause severe fragmentation of
the direct map. This fragmentation can be reduced by using PMD-size pages
as a pool for small pages for secret memory mappings.

Add a gen_pool per secretmem inode and lazily populate this pool with
PMD-size pages.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 mm/secretmem.c | 107 ++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 88 insertions(+), 19 deletions(-)

Comments

Peter Zijlstra Sept. 25, 2020, 7:41 a.m. UTC | #1
On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Removing a PAGE_SIZE page from the direct map every time such page is
> allocated for a secret memory mapping will cause severe fragmentation of
> the direct map. This fragmentation can be reduced by using PMD-size pages
> as a pool for small pages for secret memory mappings.
> 
> Add a gen_pool per secretmem inode and lazily populate this pool with
> PMD-size pages.

What's the actual efficacy of this? Since the pmd is per inode, all I
need is a lot of inodes and we're in business to destroy the directmap,
no?

Afaict there's no privs needed to use this, all a process needs is to
stay below the mlock limit, so a 'fork-bomb' that maps a single secret
page will utterly destroy the direct map.

I really don't like this, at all.

IIRC Kirill looked at merging the directmap. I think he ran into
performance issues there, but we really need something like that before
something like this lands.
David Hildenbrand Sept. 25, 2020, 9 a.m. UTC | #2
On 25.09.20 09:41, Peter Zijlstra wrote:
> On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
>> From: Mike Rapoport <rppt@linux.ibm.com>
>>
>> Removing a PAGE_SIZE page from the direct map every time such page is
>> allocated for a secret memory mapping will cause severe fragmentation of
>> the direct map. This fragmentation can be reduced by using PMD-size pages
>> as a pool for small pages for secret memory mappings.
>>
>> Add a gen_pool per secretmem inode and lazily populate this pool with
>> PMD-size pages.
> 
> What's the actual efficacy of this? Since the pmd is per inode, all I
> need is a lot of inodes and we're in business to destroy the directmap,
> no?
> 
> Afaict there's no privs needed to use this, all a process needs is to
> stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> page will utterly destroy the direct map.
> 
> I really don't like this, at all.

As I expressed earlier, I would prefer allowing allocation of secretmem
only from a previously defined CMA area. This would physically locally
limit the pain.

But my suggestion was not well received :)
Peter Zijlstra Sept. 25, 2020, 9:50 a.m. UTC | #3
On Fri, Sep 25, 2020 at 11:00:30AM +0200, David Hildenbrand wrote:
> On 25.09.20 09:41, Peter Zijlstra wrote:
> > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> >> From: Mike Rapoport <rppt@linux.ibm.com>
> >>
> >> Removing a PAGE_SIZE page from the direct map every time such page is
> >> allocated for a secret memory mapping will cause severe fragmentation of
> >> the direct map. This fragmentation can be reduced by using PMD-size pages
> >> as a pool for small pages for secret memory mappings.
> >>
> >> Add a gen_pool per secretmem inode and lazily populate this pool with
> >> PMD-size pages.
> > 
> > What's the actual efficacy of this? Since the pmd is per inode, all I
> > need is a lot of inodes and we're in business to destroy the directmap,
> > no?
> > 
> > Afaict there's no privs needed to use this, all a process needs is to
> > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > page will utterly destroy the direct map.
> > 
> > I really don't like this, at all.
> 
> As I expressed earlier, I would prefer allowing allocation of secretmem
> only from a previously defined CMA area. This would physically locally
> limit the pain.

Given that this thing doesn't have a migrate hook, that seems like an
eminently reasonable contraint. Because not only will it mess up the
directmap, it will also destroy the ability of the page-allocator /
compaction to re-form high order blocks by sprinkling holes throughout.

Also, this is all very close to XPFO, yet I don't see that mentioned
anywhere.

Further still, it has this HAVE_SECRETMEM_UNCACHED nonsense which is
completely unused. I'm not at all sure exposing UNCACHED to random
userspace is a sane idea.
Mark Rutland Sept. 25, 2020, 10:31 a.m. UTC | #4
Hi,

Sorry to come to this so late; I've been meaning to provide feedback on
this for a while but have been indisposed for a bit due to an injury.

On Fri, Sep 25, 2020 at 11:50:29AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 25, 2020 at 11:00:30AM +0200, David Hildenbrand wrote:
> > On 25.09.20 09:41, Peter Zijlstra wrote:
> > > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > >> From: Mike Rapoport <rppt@linux.ibm.com>
> > >>
> > >> Removing a PAGE_SIZE page from the direct map every time such page is
> > >> allocated for a secret memory mapping will cause severe fragmentation of
> > >> the direct map. This fragmentation can be reduced by using PMD-size pages
> > >> as a pool for small pages for secret memory mappings.
> > >>
> > >> Add a gen_pool per secretmem inode and lazily populate this pool with
> > >> PMD-size pages.
> > > 
> > > What's the actual efficacy of this? Since the pmd is per inode, all I
> > > need is a lot of inodes and we're in business to destroy the directmap,
> > > no?
> > > 
> > > Afaict there's no privs needed to use this, all a process needs is to
> > > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > > page will utterly destroy the direct map.
> > > 
> > > I really don't like this, at all.
> > 
> > As I expressed earlier, I would prefer allowing allocation of secretmem
> > only from a previously defined CMA area. This would physically locally
> > limit the pain.
> 
> Given that this thing doesn't have a migrate hook, that seems like an
> eminently reasonable contraint. Because not only will it mess up the
> directmap, it will also destroy the ability of the page-allocator /
> compaction to re-form high order blocks by sprinkling holes throughout.
> 
> Also, this is all very close to XPFO, yet I don't see that mentioned
> anywhere.

Agreed. I think if we really need something like this, something between
XPFO and DEBUG_PAGEALLOC would be generally better, since:

* Secretmem puts userspace in charge of kernel internals (AFAICT without
  any ulimits?), so that seems like an avenue for malicious or buggy
  userspace to exploit and trigger DoS, etc. The other approaches leave
  the kernel in charge at all times, and it's a system-level choice
  which is easier to reason about and test.

* Secretmem interaction with existing ABIs is unclear. Should uaccess
  primitives work for secretmem? If so, this means that it's not valid
  to transform direct uaccesses in syscalls etc into accesses via the
  linear/direct map. If not, how do we prevent syscalls? The other
  approaches are clear that this should always work, but the kernel
  should avoid mappings wherever possible.

* The uncached option doesn't work in a number of situations, such as
  systems which are purely cache coherent at all times, or where the
  hypervisor has overridden attributes. The kernel cannot even know that
  whther this works as intended. On its own this doens't solve a
  particular problem, and I think this is a solution looking for a
  problem.

... and fundamentally, this seems like a "more security, please" option
that is going to be abused, since everyone wants security, regardless of
how we say it *should* be used. The few use-cases that may make sense
(e.g. protection of ketys and/or crypto secrrets), aren't going to be
able to rely on this (since e.g. other uses may depelete memory pools),
so this is going to be best-effort. With all that in mind, I struggle to
beleive that this is going to be worth the maintenance cost (e.g. with
any issues arising from uaccess, IO, etc).

Overall, I would prefer to not see this syscall in the kernel.

> Further still, it has this HAVE_SECRETMEM_UNCACHED nonsense which is
> completely unused. I'm not at all sure exposing UNCACHED to random
> userspace is a sane idea.

I agree the uncached stuff should be removed. It is at best misleading
since the kernel can't guarantee it does what it says, I think it's
liable to lead to issues in future (e.g. since it can cause memory
operations to raise different exceptions relative to what they can
today), and as above it seems like a solution looking for a problem.

Thanks,
Mark.
Tycho Andersen Sept. 25, 2020, 2:57 p.m. UTC | #5
On Fri, Sep 25, 2020 at 11:31:14AM +0100, Mark Rutland wrote:
> Hi,
> 
> Sorry to come to this so late; I've been meaning to provide feedback on
> this for a while but have been indisposed for a bit due to an injury.
> 
> On Fri, Sep 25, 2020 at 11:50:29AM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 25, 2020 at 11:00:30AM +0200, David Hildenbrand wrote:
> > > On 25.09.20 09:41, Peter Zijlstra wrote:
> > > > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > > >> From: Mike Rapoport <rppt@linux.ibm.com>
> > > >>
> > > >> Removing a PAGE_SIZE page from the direct map every time such page is
> > > >> allocated for a secret memory mapping will cause severe fragmentation of
> > > >> the direct map. This fragmentation can be reduced by using PMD-size pages
> > > >> as a pool for small pages for secret memory mappings.
> > > >>
> > > >> Add a gen_pool per secretmem inode and lazily populate this pool with
> > > >> PMD-size pages.
> > > > 
> > > > What's the actual efficacy of this? Since the pmd is per inode, all I
> > > > need is a lot of inodes and we're in business to destroy the directmap,
> > > > no?
> > > > 
> > > > Afaict there's no privs needed to use this, all a process needs is to
> > > > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > > > page will utterly destroy the direct map.
> > > > 
> > > > I really don't like this, at all.
> > > 
> > > As I expressed earlier, I would prefer allowing allocation of secretmem
> > > only from a previously defined CMA area. This would physically locally
> > > limit the pain.
> > 
> > Given that this thing doesn't have a migrate hook, that seems like an
> > eminently reasonable contraint. Because not only will it mess up the
> > directmap, it will also destroy the ability of the page-allocator /
> > compaction to re-form high order blocks by sprinkling holes throughout.
> > 
> > Also, this is all very close to XPFO, yet I don't see that mentioned
> > anywhere.
> 
> Agreed. I think if we really need something like this, something between
> XPFO and DEBUG_PAGEALLOC would be generally better, since:

Perhaps we can brainstorm on this? XPFO has mostly been abandoned
because there's no good/safe way to make it faster. There was work on
eliminating TLB flushes, but that waters down the protection. When I
was last thinking about it in anger, it just seemed like it was
destined to be slow, especially on $large_num_cores machines, since
you have to flush everyone else's map too.

I think the idea of "opt in to XPFO" is mostly attractive because then
people only have to pay the slowness cost for memory they really care
about. But if there's some way to make XPFO, or some alternative
design, that may be better.

Tycho
Mike Rapoport Sept. 29, 2020, 1:05 p.m. UTC | #6
On Fri, Sep 25, 2020 at 09:41:25AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > Removing a PAGE_SIZE page from the direct map every time such page is
> > allocated for a secret memory mapping will cause severe fragmentation of
> > the direct map. This fragmentation can be reduced by using PMD-size pages
> > as a pool for small pages for secret memory mappings.
> > 
> > Add a gen_pool per secretmem inode and lazily populate this pool with
> > PMD-size pages.
> 
> What's the actual efficacy of this? Since the pmd is per inode, all I
> need is a lot of inodes and we're in business to destroy the directmap,
> no?
> 
> Afaict there's no privs needed to use this, all a process needs is to
> stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> page will utterly destroy the direct map.

This indeed will cause 1G pages in the direct map to be split into 2M
chunks, but I disagree with 'destroy' term here. Citing the cover letter
of an earlier version of this series:

  I've tried to find some numbers that show the benefit of using larger
  pages in the direct map, but I couldn't find anything so I've run a
  couple of benchmarks from phoronix-test-suite on my laptop (i7-8650U
  with 32G RAM).
  
  I've tested three variants: the default with 28G of the physical
  memory covered with 1G pages, then I disabled 1G pages using
  "nogbpages" in the kernel command line and at last I've forced the
  entire direct map to use 4K pages using a simple patch to
  arch/x86/mm/init.c.  I've made runs of the benchmarks with SSD and
  tmpfs.
  
  Surprisingly, the results does not show huge advantage for large
  pages. For instance, here the results for kernel build with
  'make -j8', in seconds:
  
                        |  1G    |  2M    |  4K
  ----------------------+--------+--------+---------
  ssd, mitigations=on	| 308.75 | 317.37 | 314.9
  ssd, mitigations=off	| 305.25 | 295.32 | 304.92
  ram, mitigations=on	| 301.58 | 322.49 | 306.54
  ram, mitigations=off	| 299.32 | 288.44 | 310.65
  
  All the results I have are available here:
 
  https://docs.google.com/spreadsheets/d/1tdD-cu8e93vnfGsTFxZ5YdaEfs2E1GELlvWNOGkJV2U/edit?usp=sharing

The numbers suggest that using smaller pages in the direct map does not
necessarily leads to performance degradation and some runs produced
better results with smaller pages in the direct map.

> I really don't like this, at all.
> 
> IIRC Kirill looked at merging the directmap. I think he ran into
> performance issues there, but we really need something like that before
> something like this lands.
Mike Rapoport Sept. 29, 2020, 1:06 p.m. UTC | #7
On Fri, Sep 25, 2020 at 11:00:30AM +0200, David Hildenbrand wrote:
> On 25.09.20 09:41, Peter Zijlstra wrote:
> > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> >> From: Mike Rapoport <rppt@linux.ibm.com>
> >>
> >> Removing a PAGE_SIZE page from the direct map every time such page is
> >> allocated for a secret memory mapping will cause severe fragmentation of
> >> the direct map. This fragmentation can be reduced by using PMD-size pages
> >> as a pool for small pages for secret memory mappings.
> >>
> >> Add a gen_pool per secretmem inode and lazily populate this pool with
> >> PMD-size pages.
> > 
> > What's the actual efficacy of this? Since the pmd is per inode, all I
> > need is a lot of inodes and we're in business to destroy the directmap,
> > no?
> > 
> > Afaict there's no privs needed to use this, all a process needs is to
> > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > page will utterly destroy the direct map.
> > 
> > I really don't like this, at all.
> 
> As I expressed earlier, I would prefer allowing allocation of secretmem
> only from a previously defined CMA area. This would physically locally
> limit the pain.

The prevois version contained a patch that allowed reserving a memory
pool for the secretmem at boot time to avpoid splitting pages from the
direct map

> But my suggestion was not well received :)

The disagreemet was only whether to use CMA or simple boot time
reservation :-P

> -- 
> Thanks,
> 
> David / dhildenb
>
Mike Rapoport Sept. 29, 2020, 1:07 p.m. UTC | #8
On Fri, Sep 25, 2020 at 11:50:29AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 25, 2020 at 11:00:30AM +0200, David Hildenbrand wrote:
> > On 25.09.20 09:41, Peter Zijlstra wrote:
> > > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > >> From: Mike Rapoport <rppt@linux.ibm.com>
> > >>
> > >> Removing a PAGE_SIZE page from the direct map every time such page is
> > >> allocated for a secret memory mapping will cause severe fragmentation of
> > >> the direct map. This fragmentation can be reduced by using PMD-size pages
> > >> as a pool for small pages for secret memory mappings.
> > >>
> > >> Add a gen_pool per secretmem inode and lazily populate this pool with
> > >> PMD-size pages.
> > > 
> > > What's the actual efficacy of this? Since the pmd is per inode, all I
> > > need is a lot of inodes and we're in business to destroy the directmap,
> > > no?
> > > 
> > > Afaict there's no privs needed to use this, all a process needs is to
> > > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > > page will utterly destroy the direct map.
> > > 
> > > I really don't like this, at all.
> > 
> > As I expressed earlier, I would prefer allowing allocation of secretmem
> > only from a previously defined CMA area. This would physically locally
> > limit the pain.
> 
> Given that this thing doesn't have a migrate hook, that seems like an
> eminently reasonable contraint. Because not only will it mess up the
> directmap, it will also destroy the ability of the page-allocator /
> compaction to re-form high order blocks by sprinkling holes throughout.
> 
> Also, this is all very close to XPFO, yet I don't see that mentioned
> anywhere.

It's close to XPFO in the sense it removes pages from the kernel page
table. But unlike XPFO memfd_secret() does not mean allowing access to
these pages in the kernel until they are freed by the user. And, unlike
XPFO, it does not require TLB flushing all over the place.

> Further still, it has this HAVE_SECRETMEM_UNCACHED nonsense which is
> completely unused. I'm not at all sure exposing UNCACHED to random
> userspace is a sane idea.

The uncached mappings were originally proposed as a mean "... to prevent
or considerably restrict speculation on such pages" [1] as a comment to
my initial proposal to use mmap(MAP_EXCLUSIVE).

I've added the ability to create uncached mappings into the fd-based
implementation of the exclusive mappings as it is indeed can reduce
availability of side channels and the implementation was quite straight
forward.

[1] https://lore.kernel.org/linux-mm/2236FBA76BA1254E88B949DDB74E612BA4EEC0CE@IRSMSX102.ger.corp.intel.com/
Mike Rapoport Sept. 29, 2020, 2:04 p.m. UTC | #9
On Fri, Sep 25, 2020 at 11:31:14AM +0100, Mark Rutland wrote:
> Hi,
> 
> Agreed. I think if we really need something like this, something between
> XPFO and DEBUG_PAGEALLOC would be generally better, since:
> 
> * Secretmem puts userspace in charge of kernel internals (AFAICT without
>   any ulimits?), so that seems like an avenue for malicious or buggy
>   userspace to exploit and trigger DoS, etc. The other approaches leave
>   the kernel in charge at all times, and it's a system-level choice
>   which is easier to reason about and test.

Secretmem obeys RLIMIT_MLOCK.
I don't see why it "puts userpspace in charge of kernel internals" more
than other system calls. The fact that memory is dropped from
linear/direct mapping does not make userspace in charge of the kernel
internals. The fact that this is not system-level actually makes it more
controllable and tunable, IMHO.

> * Secretmem interaction with existing ABIs is unclear. Should uaccess
>   primitives work for secretmem? If so, this means that it's not valid
>   to transform direct uaccesses in syscalls etc into accesses via the
>   linear/direct map. If not, how do we prevent syscalls? The other
>   approaches are clear that this should always work, but the kernel
>   should avoid mappings wherever possible.

Our idea was that direct uaccess in the context of the process that owns
the secretmem should work and that transforming the direct uaccesses
into accesses via the linear map would be valid only when allowed
explicitly. E.g with addition of FOLL_SOMETHING to gup.
Yet, this would be required for any implementation of memory areas that
excludes pages from the linear mapping.

> * The uncached option doesn't work in a number of situations, such as
>   systems which are purely cache coherent at all times, or where the
>   hypervisor has overridden attributes. The kernel cannot even know that
>   whther this works as intended. On its own this doens't solve a
>   particular problem, and I think this is a solution looking for a
>   problem.

As we discussed at one of the previous iterations, the uncached makes
sense for x86 to reduce availability of side channels and I've only
enabled uncached mappings on x86.

> ... and fundamentally, this seems like a "more security, please" option
> that is going to be abused, since everyone wants security, regardless of
> how we say it *should* be used. The few use-cases that may make sense
> (e.g. protection of ketys and/or crypto secrrets), aren't going to be
> able to rely on this (since e.g. other uses may depelete memory pools),
> so this is going to be best-effort. With all that in mind, I struggle to
> beleive that this is going to be worth the maintenance cost (e.g. with
> any issues arising from uaccess, IO, etc).

I think that making secretmem a file descriptor that only allows mmap()
already makes it quite self contained and simple. There could be several
cases that will need special treatment, but I don't think it will have
large maintenance cost.
I've run syzkaller for some time with memfd_secret() enabled and I never
hit a crash because of it.

> Thanks,
> Mark.
Peter Zijlstra Sept. 29, 2020, 2:12 p.m. UTC | #10
On Tue, Sep 29, 2020 at 04:05:29PM +0300, Mike Rapoport wrote:
> On Fri, Sep 25, 2020 at 09:41:25AM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > 
> > > Removing a PAGE_SIZE page from the direct map every time such page is
> > > allocated for a secret memory mapping will cause severe fragmentation of
> > > the direct map. This fragmentation can be reduced by using PMD-size pages
> > > as a pool for small pages for secret memory mappings.
> > > 
> > > Add a gen_pool per secretmem inode and lazily populate this pool with
> > > PMD-size pages.
> > 
> > What's the actual efficacy of this? Since the pmd is per inode, all I
> > need is a lot of inodes and we're in business to destroy the directmap,
> > no?
> > 
> > Afaict there's no privs needed to use this, all a process needs is to
> > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > page will utterly destroy the direct map.
> 
> This indeed will cause 1G pages in the direct map to be split into 2M
> chunks, but I disagree with 'destroy' term here. Citing the cover letter
> of an earlier version of this series:

It will drop them down to 4k pages. Given enough inodes, and allocating
only a single sekrit page per pmd, we'll shatter the directmap into 4k.

>   I've tried to find some numbers that show the benefit of using larger
>   pages in the direct map, but I couldn't find anything so I've run a
>   couple of benchmarks from phoronix-test-suite on my laptop (i7-8650U
>   with 32G RAM).

Existing benchmarks suck at this, but FB had a load that had a
deterministic enough performance regression to bisect to a directmap
issue, fixed by:

  7af0145067bc ("x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text")

>   I've tested three variants: the default with 28G of the physical
>   memory covered with 1G pages, then I disabled 1G pages using
>   "nogbpages" in the kernel command line and at last I've forced the
>   entire direct map to use 4K pages using a simple patch to
>   arch/x86/mm/init.c.  I've made runs of the benchmarks with SSD and
>   tmpfs.
>   
>   Surprisingly, the results does not show huge advantage for large
>   pages. For instance, here the results for kernel build with
>   'make -j8', in seconds:

Your benchmark should stress the TLB of your uarch, such that additional
pressure added by the shattered directmap shows up.

And no, I don't have one either.

>                         |  1G    |  2M    |  4K
>   ----------------------+--------+--------+---------
>   ssd, mitigations=on	| 308.75 | 317.37 | 314.9
>   ssd, mitigations=off	| 305.25 | 295.32 | 304.92
>   ram, mitigations=on	| 301.58 | 322.49 | 306.54
>   ram, mitigations=off	| 299.32 | 288.44 | 310.65

These results lack error data, but assuming the reults are significant,
then this very much makes a case for 1G mappings. 5s on a kernel builds
is pretty good.
Dave Hansen Sept. 29, 2020, 2:31 p.m. UTC | #11
On 9/29/20 7:12 AM, Peter Zijlstra wrote:
>>                              |  1G    |  2M    |  4K
>>        ----------------------+--------+--------+---------
>>   ssd, mitigations=on	| 308.75 | 317.37 | 314.9
>>   ssd, mitigations=off	| 305.25 | 295.32 | 304.92
>>   ram, mitigations=on	| 301.58 | 322.49 | 306.54
>>   ram, mitigations=off	| 299.32 | 288.44 | 310.65
> These results lack error data, but assuming the reults are significant,
> then this very much makes a case for 1G mappings. 5s on a kernel builds
> is pretty good.

Is something like secretmem all or nothing?

This seems like a similar situation to the side-channel mitigations.  We
know what the most "secure" thing to do is.  But, folks also disagree
about how much pain that security is worth.

That seems to indicate we're never going to come up with a
one-size-fits-all solution to this.  Apps are going to have to live
without secretmem being around if they want to run on old kernels
anyway, so it seems like something we should be able to enable or
disable without ABI concerns.

Do we just include it, but disable it by default so it doesn't eat
performance?  But, allow it to be reenabled by the folks who generally
prioritize hardening over performance, like Chromebooks for instance.
Mike Rapoport Sept. 29, 2020, 2:58 p.m. UTC | #12
On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 29, 2020 at 04:05:29PM +0300, Mike Rapoport wrote:
> > On Fri, Sep 25, 2020 at 09:41:25AM +0200, Peter Zijlstra wrote:
> > > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > > 
> > > > Removing a PAGE_SIZE page from the direct map every time such page is
> > > > allocated for a secret memory mapping will cause severe fragmentation of
> > > > the direct map. This fragmentation can be reduced by using PMD-size pages
> > > > as a pool for small pages for secret memory mappings.
> > > > 
> > > > Add a gen_pool per secretmem inode and lazily populate this pool with
> > > > PMD-size pages.
> > > 
> > > What's the actual efficacy of this? Since the pmd is per inode, all I
> > > need is a lot of inodes and we're in business to destroy the directmap,
> > > no?
> > > 
> > > Afaict there's no privs needed to use this, all a process needs is to
> > > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > > page will utterly destroy the direct map.
> > 
> > This indeed will cause 1G pages in the direct map to be split into 2M
> > chunks, but I disagree with 'destroy' term here. Citing the cover letter
> > of an earlier version of this series:
> 
> It will drop them down to 4k pages. Given enough inodes, and allocating
> only a single sekrit page per pmd, we'll shatter the directmap into 4k.

Why? Secretmem allocates PMD-size page per inode and uses it as a pool
of 4K pages for that inode. This way it ensures that
__kernel_map_pages() is always called on PMD boundaries.
James Bottomley Sept. 29, 2020, 3:03 p.m. UTC | #13
On Tue, 2020-09-29 at 16:12 +0200, Peter Zijlstra wrote:
> On Tue, Sep 29, 2020 at 04:05:29PM +0300, Mike Rapoport wrote:
> > On Fri, Sep 25, 2020 at 09:41:25AM +0200, Peter Zijlstra wrote:
> > > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > > 
> > > > Removing a PAGE_SIZE page from the direct map every time such
> > > > page is allocated for a secret memory mapping will cause severe
> > > > fragmentation of the direct map. This fragmentation can be
> > > > reduced by using PMD-size pages as a pool for small pages for
> > > > secret memory mappings.
> > > > 
> > > > Add a gen_pool per secretmem inode and lazily populate this
> > > > pool with PMD-size pages.
> > > 
> > > What's the actual efficacy of this? Since the pmd is per inode,
> > > all I need is a lot of inodes and we're in business to destroy
> > > the directmap, no?
> > > 
> > > Afaict there's no privs needed to use this, all a process needs
> > > is to stay below the mlock limit, so a 'fork-bomb' that maps a
> > > single secret page will utterly destroy the direct map.
> > 
> > This indeed will cause 1G pages in the direct map to be split into
> > 2M chunks, but I disagree with 'destroy' term here. Citing the
> > cover letter of an earlier version of this series:
> 
> It will drop them down to 4k pages. Given enough inodes, and
> allocating only a single sekrit page per pmd, we'll shatter the
> directmap into 4k.

Since the only requirement is 2M, even if this happens, which I'm not
sure it does, it's fixable to only fragment down to 2M, right?

We could also enforce a global limit in the secretmem syscall, so the
fork bomb problem can be made to go away.

Lastly, we could go back to boot time allocation as the previous patch
did, so this isn't even a fundamental problem with the patch set.

That said, I think investigation of the importance of direct map tiling
is useful, since it does fragment for other reasons, and fixing or
proving that the fragmentation doesn't matter is also something we'll
keep on investigating.  But it would be useful in the meantime to
explore things which may be more fundamental issues with the approach.

Regards,

James
Peter Zijlstra Sept. 29, 2020, 3:15 p.m. UTC | #14
On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
> On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:

> > It will drop them down to 4k pages. Given enough inodes, and allocating
> > only a single sekrit page per pmd, we'll shatter the directmap into 4k.
> 
> Why? Secretmem allocates PMD-size page per inode and uses it as a pool
> of 4K pages for that inode. This way it ensures that
> __kernel_map_pages() is always called on PMD boundaries.

Oh, you unmap the 2m page upfront? I read it like you did the unmap at
the sekrit page alloc, not the pool alloc side of things.

Then yes, but then you're wasting gobs of memory. Basically you can pin
2M per inode while only accounting a single page.
Mike Rapoport Sept. 30, 2020, 10:20 a.m. UTC | #15
On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 29, 2020 at 04:05:29PM +0300, Mike Rapoport wrote:
> > On Fri, Sep 25, 2020 at 09:41:25AM +0200, Peter Zijlstra wrote:
> > > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > > 
> > > > Removing a PAGE_SIZE page from the direct map every time such page is
> > > > allocated for a secret memory mapping will cause severe fragmentation of
> > > > the direct map. This fragmentation can be reduced by using PMD-size pages
> > > > as a pool for small pages for secret memory mappings.
> > > > 
> > > > Add a gen_pool per secretmem inode and lazily populate this pool with
> > > > PMD-size pages.
> > > 
> > > What's the actual efficacy of this? Since the pmd is per inode, all I
> > > need is a lot of inodes and we're in business to destroy the directmap,
> > > no?
> > > 
> > > Afaict there's no privs needed to use this, all a process needs is to
> > > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > > page will utterly destroy the direct map.
> > 
> > This indeed will cause 1G pages in the direct map to be split into 2M
> > chunks, but I disagree with 'destroy' term here. Citing the cover letter
> > of an earlier version of this series:
> 
> It will drop them down to 4k pages. Given enough inodes, and allocating
> only a single sekrit page per pmd, we'll shatter the directmap into 4k.
> 
> >   I've tried to find some numbers that show the benefit of using larger
> >   pages in the direct map, but I couldn't find anything so I've run a
> >   couple of benchmarks from phoronix-test-suite on my laptop (i7-8650U
> >   with 32G RAM).
> 
> Existing benchmarks suck at this, but FB had a load that had a

I tried to dig the regression report in the mailing list, and the best I
could find is

https://lore.kernel.org/lkml/20190823052335.572133-1-songliubraving@fb.com/

which does not mention the actual performance regression but it only
complaints about kernel text mapping being split into 4K pages.

Any chance you have the regression report handy? 

> deterministic enough performance regression to bisect to a directmap
> issue, fixed by:
> 
>   7af0145067bc ("x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text")

This commit talks about large page split for the text and mentions iTLB
performance.
Could it be that for data the behavoiur is different?

> >   I've tested three variants: the default with 28G of the physical
> >   memory covered with 1G pages, then I disabled 1G pages using
> >   "nogbpages" in the kernel command line and at last I've forced the
> >   entire direct map to use 4K pages using a simple patch to
> >   arch/x86/mm/init.c.  I've made runs of the benchmarks with SSD and
> >   tmpfs.
> >   
> >   Surprisingly, the results does not show huge advantage for large
> >   pages. For instance, here the results for kernel build with
> >   'make -j8', in seconds:
> 
> Your benchmark should stress the TLB of your uarch, such that additional
> pressure added by the shattered directmap shows up.

I understand that the benchmark should stress the TLB, but it's not that
we can add something like random access to a large working set as a
kernel module and insmod it. The userspace should do something that will
cause the stress to the TLB so that entries corresponding to the direct
map will be evicted frequently. And, frankly, 

> And no, I don't have one either.
> 
> >                         |  1G    |  2M    |  4K
> >   ----------------------+--------+--------+---------
> >   ssd, mitigations=on	| 308.75 | 317.37 | 314.9
> >   ssd, mitigations=off	| 305.25 | 295.32 | 304.92
> >   ram, mitigations=on	| 301.58 | 322.49 | 306.54
> >   ram, mitigations=off	| 299.32 | 288.44 | 310.65
> 
> These results lack error data, but assuming the reults are significant,
> then this very much makes a case for 1G mappings. 5s on a kernel builds
> is pretty good.

The standard error for those are between 2.5 and 4.5 out of 3 runs for
each variant. 

For kernel build 1G mappings perform better, but here 5s is only 1.6% of
300s and the direct map fragmentation was taken to the extreme here.
I'm not saying that the direct map fragmentation comes with no cost, but
the cost is not so big to dismiss features that cause the fragmentation
out of hand.

There were also benchmarks that actually performed better with 2M pages
in the direct map, so I'm still not convinced that 1G pages in the
direct map are the clear cut winner.
Mike Rapoport Sept. 30, 2020, 10:27 a.m. UTC | #16
On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
> > On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
> 
> > > It will drop them down to 4k pages. Given enough inodes, and allocating
> > > only a single sekrit page per pmd, we'll shatter the directmap into 4k.
> > 
> > Why? Secretmem allocates PMD-size page per inode and uses it as a pool
> > of 4K pages for that inode. This way it ensures that
> > __kernel_map_pages() is always called on PMD boundaries.
> 
> Oh, you unmap the 2m page upfront? I read it like you did the unmap at
> the sekrit page alloc, not the pool alloc side of things.
> 
> Then yes, but then you're wasting gobs of memory. Basically you can pin
> 2M per inode while only accounting a single page.

Right, quite like THP :)

I considered using a global pool of 2M pages for secretmem and handing
4K pages to each inode from that global pool. But I've decided to waste
memory in favor of simplicity.

The prevoius version of this set included additional patch that allowed
reserving chunk of the physical memory for a global secretmem pool at
boot time. We didn't reach an agreement with David H. about whether this
pool should be allocated directly from memblock or from CMA and I've
dropped the boot time reservation patch because it can always be added on
top.
Peter Zijlstra Sept. 30, 2020, 10:43 a.m. UTC | #17
On Wed, Sep 30, 2020 at 01:20:31PM +0300, Mike Rapoport wrote:

> I tried to dig the regression report in the mailing list, and the best I
> could find is
> 
> https://lore.kernel.org/lkml/20190823052335.572133-1-songliubraving@fb.com/
> 
> which does not mention the actual performance regression but it only
> complaints about kernel text mapping being split into 4K pages.
> 
> Any chance you have the regression report handy? 

I think the saga started here:

 20190820075128.2912224-1-songliubraving@fb.com
 20190820202314.1083149-1-songliubraving@fb.com
 20190823052335.572133-1-songliubraving@fb.com

After that Thomas did the patch I referred to earlier and I endeavoured
to rewrite x86-ftrace.

I added Song to CC, maybe he can remember more.
James Bottomley Sept. 30, 2020, 2:39 p.m. UTC | #18
On Wed, 2020-09-30 at 13:27 +0300, Mike Rapoport wrote:
> On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
> > > On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
> > > > It will drop them down to 4k pages. Given enough inodes, and
> > > > allocating only a single sekrit page per pmd, we'll shatter the
> > > > directmap into 4k.
> > > 
> > > Why? Secretmem allocates PMD-size page per inode and uses it as a
> > > pool of 4K pages for that inode. This way it ensures that
> > > __kernel_map_pages() is always called on PMD boundaries.
> > 
> > Oh, you unmap the 2m page upfront? I read it like you did the unmap
> > at the sekrit page alloc, not the pool alloc side of things.
> > 
> > Then yes, but then you're wasting gobs of memory. Basically you can
> > pin 2M per inode while only accounting a single page.
> 
> Right, quite like THP :)
> 
> I considered using a global pool of 2M pages for secretmem and
> handing 4K pages to each inode from that global pool. But I've
> decided to waste memory in favor of simplicity.

I can also add that the user space consumer of this we wrote does its
user pool allocation at a 2M granularity, so nothing is actually
wasted.

https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git/

James
David Hildenbrand Sept. 30, 2020, 2:45 p.m. UTC | #19
On 30.09.20 16:39, James Bottomley wrote:
> On Wed, 2020-09-30 at 13:27 +0300, Mike Rapoport wrote:
>> On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
>>> On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
>>>> On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
>>>>> It will drop them down to 4k pages. Given enough inodes, and
>>>>> allocating only a single sekrit page per pmd, we'll shatter the
>>>>> directmap into 4k.
>>>>
>>>> Why? Secretmem allocates PMD-size page per inode and uses it as a
>>>> pool of 4K pages for that inode. This way it ensures that
>>>> __kernel_map_pages() is always called on PMD boundaries.
>>>
>>> Oh, you unmap the 2m page upfront? I read it like you did the unmap
>>> at the sekrit page alloc, not the pool alloc side of things.
>>>
>>> Then yes, but then you're wasting gobs of memory. Basically you can
>>> pin 2M per inode while only accounting a single page.
>>
>> Right, quite like THP :)
>>
>> I considered using a global pool of 2M pages for secretmem and
>> handing 4K pages to each inode from that global pool. But I've
>> decided to waste memory in favor of simplicity.
> 
> I can also add that the user space consumer of this we wrote does its
> user pool allocation at a 2M granularity, so nothing is actually
> wasted.

... for that specific user space consumer. (or am I missing something?)
Matthew Wilcox Sept. 30, 2020, 3:09 p.m. UTC | #20
On Wed, Sep 30, 2020 at 01:27:45PM +0300, Mike Rapoport wrote:
> On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
> > > On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
> > 
> > > > It will drop them down to 4k pages. Given enough inodes, and allocating
> > > > only a single sekrit page per pmd, we'll shatter the directmap into 4k.
> > > 
> > > Why? Secretmem allocates PMD-size page per inode and uses it as a pool
> > > of 4K pages for that inode. This way it ensures that
> > > __kernel_map_pages() is always called on PMD boundaries.
> > 
> > Oh, you unmap the 2m page upfront? I read it like you did the unmap at
> > the sekrit page alloc, not the pool alloc side of things.
> > 
> > Then yes, but then you're wasting gobs of memory. Basically you can pin
> > 2M per inode while only accounting a single page.
> 
> Right, quite like THP :)

Huh?  THP accounts every page it allocates.  If you allocate 2MB,
it accounts 512 pages.  And THP are reclaimable by vmscan, this is
obviously not.
James Bottomley Sept. 30, 2020, 3:17 p.m. UTC | #21
On Wed, 2020-09-30 at 16:45 +0200, David Hildenbrand wrote:
> On 30.09.20 16:39, James Bottomley wrote:
> > On Wed, 2020-09-30 at 13:27 +0300, Mike Rapoport wrote:
> > > On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
> > > > > On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra
> > > > > wrote:
> > > > > > It will drop them down to 4k pages. Given enough inodes,
> > > > > > and allocating only a single sekrit page per pmd, we'll
> > > > > > shatter the directmap into 4k.
> > > > > 
> > > > > Why? Secretmem allocates PMD-size page per inode and uses it
> > > > > as a pool of 4K pages for that inode. This way it ensures
> > > > > that __kernel_map_pages() is always called on PMD boundaries.
> > > > 
> > > > Oh, you unmap the 2m page upfront? I read it like you did the
> > > > unmap at the sekrit page alloc, not the pool alloc side of
> > > > things.
> > > > 
> > > > Then yes, but then you're wasting gobs of memory. Basically you
> > > > can pin 2M per inode while only accounting a single page.
> > > 
> > > Right, quite like THP :)
> > > 
> > > I considered using a global pool of 2M pages for secretmem and
> > > handing 4K pages to each inode from that global pool. But I've
> > > decided to waste memory in favor of simplicity.
> > 
> > I can also add that the user space consumer of this we wrote does
> > its user pool allocation at a 2M granularity, so nothing is
> > actually wasted.
> 
> ... for that specific user space consumer. (or am I missing
> something?)

I'm not sure I understand what you mean?  It's designed to be either
the standard wrapper or an example of how to do the standard wrapper
for the syscall.  It uses the same allocator system glibc uses for
malloc/free ... which pretty much everyone uses instead of calling
sys_brk directly.  If you look at the granularity glibc uses for
sys_brk, it's not 4k either.

James
David Hildenbrand Sept. 30, 2020, 3:25 p.m. UTC | #22
On 30.09.20 17:17, James Bottomley wrote:
> On Wed, 2020-09-30 at 16:45 +0200, David Hildenbrand wrote:
>> On 30.09.20 16:39, James Bottomley wrote:
>>> On Wed, 2020-09-30 at 13:27 +0300, Mike Rapoport wrote:
>>>> On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
>>>>> On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
>>>>>> On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra
>>>>>> wrote:
>>>>>>> It will drop them down to 4k pages. Given enough inodes,
>>>>>>> and allocating only a single sekrit page per pmd, we'll
>>>>>>> shatter the directmap into 4k.
>>>>>>
>>>>>> Why? Secretmem allocates PMD-size page per inode and uses it
>>>>>> as a pool of 4K pages for that inode. This way it ensures
>>>>>> that __kernel_map_pages() is always called on PMD boundaries.
>>>>>
>>>>> Oh, you unmap the 2m page upfront? I read it like you did the
>>>>> unmap at the sekrit page alloc, not the pool alloc side of
>>>>> things.
>>>>>
>>>>> Then yes, but then you're wasting gobs of memory. Basically you
>>>>> can pin 2M per inode while only accounting a single page.
>>>>
>>>> Right, quite like THP :)
>>>>
>>>> I considered using a global pool of 2M pages for secretmem and
>>>> handing 4K pages to each inode from that global pool. But I've
>>>> decided to waste memory in favor of simplicity.
>>>
>>> I can also add that the user space consumer of this we wrote does
>>> its user pool allocation at a 2M granularity, so nothing is
>>> actually wasted.
>>
>> ... for that specific user space consumer. (or am I missing
>> something?)
> 
> I'm not sure I understand what you mean?  It's designed to be either
> the standard wrapper or an example of how to do the standard wrapper
> for the syscall.  It uses the same allocator system glibc uses for
> malloc/free ... which pretty much everyone uses instead of calling
> sys_brk directly.  If you look at the granularity glibc uses for
> sys_brk, it's not 4k either.

Okay thanks, "the user space consumer of this we wrote" didn't sound as
generic to me as "the standard wrapper".
Mike Rapoport Oct. 1, 2020, 8:14 a.m. UTC | #23
On Wed, Sep 30, 2020 at 04:09:28PM +0100, Matthew Wilcox wrote:
> On Wed, Sep 30, 2020 at 01:27:45PM +0300, Mike Rapoport wrote:
> > On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
> > > On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
> > > > On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
> > > 
> > > > > It will drop them down to 4k pages. Given enough inodes, and allocating
> > > > > only a single sekrit page per pmd, we'll shatter the directmap into 4k.
> > > > 
> > > > Why? Secretmem allocates PMD-size page per inode and uses it as a pool
> > > > of 4K pages for that inode. This way it ensures that
> > > > __kernel_map_pages() is always called on PMD boundaries.
> > > 
> > > Oh, you unmap the 2m page upfront? I read it like you did the unmap at
> > > the sekrit page alloc, not the pool alloc side of things.
> > > 
> > > Then yes, but then you're wasting gobs of memory. Basically you can pin
> > > 2M per inode while only accounting a single page.
> > 
> > Right, quite like THP :)
> 
> Huh?  THP accounts every page it allocates.  If you allocate 2MB,
> it accounts 512 pages.

I meant that secremem allocates 2M in advance like THP and not that it
similar because only page is accounted.
Anyway, the intention was to account the entrire 2M chunk (512 pages),
so I'll recheck the accounting and I'll fix it if I missed something.

> And THP are reclaimable by vmscan, this is obviously not.

True, this is more like mlock in that sense.
diff mbox series

Patch

diff --git a/mm/secretmem.c b/mm/secretmem.c
index 3293f761076e..333eb18fb483 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -12,6 +12,7 @@ 
 #include <linux/bitops.h>
 #include <linux/printk.h>
 #include <linux/pagemap.h>
+#include <linux/genalloc.h>
 #include <linux/syscalls.h>
 #include <linux/pseudo_fs.h>
 #include <linux/set_memory.h>
@@ -40,24 +41,66 @@ 
 #define SECRETMEM_FLAGS_MASK	SECRETMEM_MODE_MASK
 
 struct secretmem_ctx {
+	struct gen_pool *pool;
 	unsigned int mode;
 };
 
-static struct page *secretmem_alloc_page(gfp_t gfp)
+static int secretmem_pool_increase(struct secretmem_ctx *ctx, gfp_t gfp)
 {
-	/*
-	 * FIXME: use a cache of large pages to reduce the direct map
-	 * fragmentation
-	 */
-	return alloc_page(gfp);
+	unsigned long nr_pages = (1 << PMD_PAGE_ORDER);
+	struct gen_pool *pool = ctx->pool;
+	unsigned long addr;
+	struct page *page;
+	int err;
+
+	page = alloc_pages(gfp, PMD_PAGE_ORDER);
+	if (!page)
+		return -ENOMEM;
+
+	addr = (unsigned long)page_address(page);
+	split_page(page, PMD_PAGE_ORDER);
+
+	err = gen_pool_add(pool, addr, PMD_SIZE, NUMA_NO_NODE);
+	if (err) {
+		__free_pages(page, PMD_PAGE_ORDER);
+		return err;
+	}
+
+	__kernel_map_pages(page, nr_pages, 0);
+
+	return 0;
+}
+
+static struct page *secretmem_alloc_page(struct secretmem_ctx *ctx,
+					 gfp_t gfp)
+{
+	struct gen_pool *pool = ctx->pool;
+	unsigned long addr;
+	struct page *page;
+	int err;
+
+	if (gen_pool_avail(pool) < PAGE_SIZE) {
+		err = secretmem_pool_increase(ctx, gfp);
+		if (err)
+			return NULL;
+	}
+
+	addr = gen_pool_alloc(pool, PAGE_SIZE);
+	if (!addr)
+		return NULL;
+
+	page = virt_to_page(addr);
+	get_page(page);
+
+	return page;
 }
 
 static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 {
+	struct secretmem_ctx *ctx = vmf->vma->vm_file->private_data;
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	pgoff_t offset = vmf->pgoff;
-	unsigned long addr;
 	struct page *page;
 	int ret = 0;
 
@@ -66,7 +109,7 @@  static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 
 	page = find_get_entry(mapping, offset);
 	if (!page) {
-		page = secretmem_alloc_page(vmf->gfp_mask);
+		page = secretmem_alloc_page(ctx, vmf->gfp_mask);
 		if (!page)
 			return vmf_error(-ENOMEM);
 
@@ -74,14 +117,8 @@  static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 		if (unlikely(ret))
 			goto err_put_page;
 
-		ret = set_direct_map_invalid_noflush(page);
-		if (ret)
-			goto err_del_page_cache;
-
-		addr = (unsigned long)page_address(page);
-		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
-
 		__SetPageUptodate(page);
+		set_page_private(page, (unsigned long)ctx);
 
 		ret = VM_FAULT_LOCKED;
 	}
@@ -89,8 +126,6 @@  static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 	vmf->page = page;
 	return ret;
 
-err_del_page_cache:
-	delete_from_page_cache(page);
 err_put_page:
 	put_page(page);
 	return vmf_error(ret);
@@ -138,7 +173,11 @@  static int secretmem_migratepage(struct address_space *mapping,
 
 static void secretmem_freepage(struct page *page)
 {
-	set_direct_map_default_noflush(page);
+	unsigned long addr = (unsigned long)page_address(page);
+	struct secretmem_ctx *ctx = (struct secretmem_ctx *)page_private(page);
+	struct gen_pool *pool = ctx->pool;
+
+	gen_pool_free(pool, addr, PAGE_SIZE);
 }
 
 static const struct address_space_operations secretmem_aops = {
@@ -163,13 +202,18 @@  static struct file *secretmem_file_create(unsigned long flags)
 	if (!ctx)
 		goto err_free_inode;
 
+	ctx->pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
+	if (!ctx->pool)
+		goto err_free_ctx;
+
 	file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
 				 O_RDWR, &secretmem_fops);
 	if (IS_ERR(file))
-		goto err_free_ctx;
+		goto err_free_pool;
 
 	mapping_set_unevictable(inode->i_mapping);
 
+	inode->i_private = ctx;
 	inode->i_mapping->private_data = ctx;
 	inode->i_mapping->a_ops = &secretmem_aops;
 
@@ -183,6 +227,8 @@  static struct file *secretmem_file_create(unsigned long flags)
 
 	return file;
 
+err_free_pool:
+	gen_pool_destroy(ctx->pool);
 err_free_ctx:
 	kfree(ctx);
 err_free_inode:
@@ -221,11 +267,34 @@  SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
 	return err;
 }
 
+static void secretmem_cleanup_chunk(struct gen_pool *pool,
+				    struct gen_pool_chunk *chunk, void *data)
+{
+	unsigned long start = chunk->start_addr;
+	unsigned long end = chunk->end_addr;
+	unsigned long nr_pages, addr;
+
+	nr_pages = (end - start + 1) / PAGE_SIZE;
+	__kernel_map_pages(virt_to_page(start), nr_pages, 1);
+
+	for (addr = start; addr < end; addr += PAGE_SIZE)
+		put_page(virt_to_page(addr));
+}
+
+static void secretmem_cleanup_pool(struct secretmem_ctx *ctx)
+{
+	struct gen_pool *pool = ctx->pool;
+
+	gen_pool_for_each_chunk(pool, secretmem_cleanup_chunk, ctx);
+	gen_pool_destroy(pool);
+}
+
 static void secretmem_evict_inode(struct inode *inode)
 {
 	struct secretmem_ctx *ctx = inode->i_private;
 
 	truncate_inode_pages_final(&inode->i_data);
+	secretmem_cleanup_pool(ctx);
 	clear_inode(inode);
 	kfree(ctx);
 }