mbox series

[v17,00/10] mm: introduce memfd_secret system call to create "secret" memory areas

Message ID 20210208084920.2884-1-rppt@kernel.org (mailing list archive)
Headers show
Series mm: introduce memfd_secret system call to create "secret" memory areas | expand

Message

Mike Rapoport Feb. 8, 2021, 8:49 a.m. UTC
From: Mike Rapoport <rppt@linux.ibm.com>

Hi,

@Andrew, this is based on v5.11-rc5-mmotm-2021-01-27-23-30, with secretmem
and related patches dropped from there, I can rebase whatever way you
prefer.

This is an implementation of "secret" mappings backed by a file descriptor.

The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will be present only in the page table of the owning mm.

Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.

Additionally, in the future the secret mappings may be used as a mean to
protect guest memory in a virtual machine host.

For demonstration of secret memory usage we've created a userspace library

https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git

that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.

Hiding secret memory mappings behind an anonymous file allows usage of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.

The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

In addition, there is also a long term goal to improve management of the
direct map.

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

v17:
* Remove pool of large pages backing secretmem allocations, per Michal Hocko
* Add secretmem pages to unevictable LRU, per Michal Hocko
* Use GFP_HIGHUSER as secretmem mapping mask, per Michal Hocko
* Make secretmem an opt-in feature that is disabled by default
 
v16:
* Fix memory leak intorduced in v15
* Clean the data left from previous page user before handing the page to
  the userspace

v15: https://lore.kernel.org/lkml/20210120180612.1058-1-rppt@kernel.org
* Add riscv/Kconfig update to disable set_memory operations for nommu
  builds (patch 3)
* Update the code around add_to_page_cache() per Matthew's comments
  (patches 6,7)
* Add fixups for build/checkpatch errors discovered by CI systems

v14: https://lore.kernel.org/lkml/20201203062949.5484-1-rppt@kernel.org
* Finally s/mod_node_page_state/mod_lruvec_page_state/

v13: https://lore.kernel.org/lkml/20201201074559.27742-1-rppt@kernel.org
* Added Reviewed-by, thanks Catalin and David
* s/mod_node_page_state/mod_lruvec_page_state/ as Shakeel suggested

Older history:
v12: https://lore.kernel.org/lkml/20201125092208.12544-1-rppt@kernel.org
v11: https://lore.kernel.org/lkml/20201124092556.12009-1-rppt@kernel.org
v10: https://lore.kernel.org/lkml/20201123095432.5860-1-rppt@kernel.org
v9: https://lore.kernel.org/lkml/20201117162932.13649-1-rppt@kernel.org
v8: https://lore.kernel.org/lkml/20201110151444.20662-1-rppt@kernel.org
v7: https://lore.kernel.org/lkml/20201026083752.13267-1-rppt@kernel.org
v6: https://lore.kernel.org/lkml/20200924132904.1391-1-rppt@kernel.org
v5: https://lore.kernel.org/lkml/20200916073539.3552-1-rppt@kernel.org
v4: https://lore.kernel.org/lkml/20200818141554.13945-1-rppt@kernel.org
v3: https://lore.kernel.org/lkml/20200804095035.18778-1-rppt@kernel.org
v2: https://lore.kernel.org/lkml/20200727162935.31714-1-rppt@kernel.org
v1: https://lore.kernel.org/lkml/20200720092435.17469-1-rppt@kernel.org
rfc-v2: https://lore.kernel.org/lkml/20200706172051.19465-1-rppt@kernel.org/
rfc-v1: https://lore.kernel.org/lkml/20200130162340.GA14232@rapoport-lnx/
rfc-v0: https://lore.kernel.org/lkml/1572171452-7958-1-git-send-email-rppt@kernel.org/

Arnd Bergmann (1):
  arm64: kfence: fix header inclusion

Mike Rapoport (9):
  mm: add definition of PMD_PAGE_ORDER
  mmap: make mlock_future_check() global
  riscv/Kconfig: make direct map manipulation options depend on MMU
  set_memory: allow set_direct_map_*_noflush() for multiple pages
  set_memory: allow querying whether set_direct_map_*() is actually enabled
  mm: introduce memfd_secret system call to create "secret" memory areas
  PM: hibernate: disable when there are active secretmem users
  arch, mm: wire up memfd_secret system call where relevant
  secretmem: test: add basic selftest for memfd_secret(2)

 arch/arm64/include/asm/Kbuild             |   1 -
 arch/arm64/include/asm/cacheflush.h       |   6 -
 arch/arm64/include/asm/kfence.h           |   2 +-
 arch/arm64/include/asm/set_memory.h       |  17 ++
 arch/arm64/include/uapi/asm/unistd.h      |   1 +
 arch/arm64/kernel/machine_kexec.c         |   1 +
 arch/arm64/mm/mmu.c                       |   6 +-
 arch/arm64/mm/pageattr.c                  |  23 +-
 arch/riscv/Kconfig                        |   4 +-
 arch/riscv/include/asm/set_memory.h       |   4 +-
 arch/riscv/include/asm/unistd.h           |   1 +
 arch/riscv/mm/pageattr.c                  |   8 +-
 arch/x86/entry/syscalls/syscall_32.tbl    |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl    |   1 +
 arch/x86/include/asm/set_memory.h         |   4 +-
 arch/x86/mm/pat/set_memory.c              |   8 +-
 fs/dax.c                                  |  11 +-
 include/linux/pgtable.h                   |   3 +
 include/linux/secretmem.h                 |  30 +++
 include/linux/set_memory.h                |  16 +-
 include/linux/syscalls.h                  |   1 +
 include/uapi/asm-generic/unistd.h         |   6 +-
 include/uapi/linux/magic.h                |   1 +
 kernel/power/hibernate.c                  |   5 +-
 kernel/power/snapshot.c                   |   4 +-
 kernel/sys_ni.c                           |   2 +
 mm/Kconfig                                |   3 +
 mm/Makefile                               |   1 +
 mm/gup.c                                  |  10 +
 mm/internal.h                             |   3 +
 mm/mlock.c                                |   3 +-
 mm/mmap.c                                 |   5 +-
 mm/secretmem.c                            | 261 +++++++++++++++++++
 mm/vmalloc.c                              |   5 +-
 scripts/checksyscalls.sh                  |   4 +
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/Makefile       |   3 +-
 tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests    |  17 ++
 39 files changed, 726 insertions(+), 53 deletions(-)
 create mode 100644 arch/arm64/include/asm/set_memory.h
 create mode 100644 include/linux/secretmem.h
 create mode 100644 mm/secretmem.c
 create mode 100644 tools/testing/selftests/vm/memfd_secret.c

Comments

David Hildenbrand Feb. 8, 2021, 9:27 a.m. UTC | #1
On 08.02.21 09:49, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Hi,
> 
> @Andrew, this is based on v5.11-rc5-mmotm-2021-01-27-23-30, with secretmem
> and related patches dropped from there, I can rebase whatever way you
> prefer.
> 
> This is an implementation of "secret" mappings backed by a file descriptor.
> 
> The file descriptor backing secret memory mappings is created using a
> dedicated memfd_secret system call The desired protection mode for the
> memory is configured using flags parameter of the system call. The mmap()
> of the file descriptor created with memfd_secret() will create a "secret"
> memory mapping. The pages in that mapping will be marked as not present in
> the direct map and will be present only in the page table of the owning mm.
> 
> Although normally Linux userspace mappings are protected from other users,
> such secret mappings are useful for environments where a hostile tenant is
> trying to trick the kernel into giving them access to other tenants
> mappings.
> 
> Additionally, in the future the secret mappings may be used as a mean to
> protect guest memory in a virtual machine host.
> 
> For demonstration of secret memory usage we've created a userspace library
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git
> 
> that does two things: the first is act as a preloader for openssl to
> redirect all the OPENSSL_malloc calls to secret memory meaning any secret
> keys get automatically protected this way and the other thing it does is
> expose the API to the user who needs it. We anticipate that a lot of the
> use cases would be like the openssl one: many toolkits that deal with
> secret keys already have special handling for the memory to try to give
> them greater protection, so this would simply be pluggable into the
> toolkits without any need for user application modification.
> 
> Hiding secret memory mappings behind an anonymous file allows usage of
> the page cache for tracking pages allocated for the "secret" mappings as
> well as using address_space_operations for e.g. page migration callbacks.
> 
> The anonymous file may be also used implicitly, like hugetlb files, to
> implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
> ABIs in the future.
> 
> Removing of the pages from the direct map may cause its fragmentation on
> architectures that use large pages to map the physical memory which affects
> the system performance. However, the original Kconfig text for
> CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
> improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
> ("x86: add gbpages switches")) and the recent report [1] showed that "...
> although 1G mappings are a good default choice, there is no compelling
> evidence that it must be the only choice". Hence, it is sufficient to have
> secretmem disabled by default with the ability of a system administrator to
> enable it at boot time.
> 
> In addition, there is also a long term goal to improve management of the
> direct map.

Some questions (and request to document the answers) as we now allow to 
have unmovable allocations all over the place and I don't see a single 
comment regarding that in the cover letter:

1. How will the issue of plenty of unmovable allocations for user space 
be tackled in the future?

2. How has this issue been documented? E.g., interaction with 
ZONE_MOVABLE and CMA, alloc_conig_range()/alloc_contig_pages?.

3. How are the plans to support migration in the future and which 
interface changes will be required? (Michal mentioned some good points 
to make this configurable via the interface, we should plan ahead and 
document)

Thanks!
Mike Rapoport Feb. 8, 2021, 9:13 p.m. UTC | #2
On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
> On 08.02.21 09:49, Mike Rapoport wrote:
> 
> Some questions (and request to document the answers) as we now allow to have
> unmovable allocations all over the place and I don't see a single comment
> regarding that in the cover letter:
> 
> 1. How will the issue of plenty of unmovable allocations for user space be
> tackled in the future?
> 
> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
> and CMA, alloc_conig_range()/alloc_contig_pages?.

Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
allocate movable pages at the first place.
 
> 3. How are the plans to support migration in the future and which interface
> changes will be required? (Michal mentioned some good points to make this
> configurable via the interface, we should plan ahead and document)

The only interface change required is an addition of bit value for syscall
flags, I really think it can be documented with the addition of migration
or any other feature for that sake.
David Hildenbrand Feb. 8, 2021, 9:38 p.m. UTC | #3
> Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
> 
> On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
>> On 08.02.21 09:49, Mike Rapoport wrote:
>> 
>> Some questions (and request to document the answers) as we now allow to have
>> unmovable allocations all over the place and I don't see a single comment
>> regarding that in the cover letter:
>> 
>> 1. How will the issue of plenty of unmovable allocations for user space be
>> tackled in the future?
>> 
>> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
>> and CMA, alloc_conig_range()/alloc_contig_pages?.
> 
> Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> allocate movable pages at the first place.

That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE memory and behaves like long-term pinnings in that sense. This is a real issue when using a lot of sectremem.

Please have a look at what Pavel documents regarding long term pinnings and ZONE_MOVABLE in his patches currently on the list.
Michal Hocko Feb. 9, 2021, 8:59 a.m. UTC | #4
On Mon 08-02-21 22:38:03, David Hildenbrand wrote:
> 
> > Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
> > 
> > On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
> >> On 08.02.21 09:49, Mike Rapoport wrote:
> >> 
> >> Some questions (and request to document the answers) as we now allow to have
> >> unmovable allocations all over the place and I don't see a single comment
> >> regarding that in the cover letter:
> >> 
> >> 1. How will the issue of plenty of unmovable allocations for user space be
> >> tackled in the future?
> >> 
> >> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
> >> and CMA, alloc_conig_range()/alloc_contig_pages?.
> > 
> > Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> > allocate movable pages at the first place.
> 
> That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
> memory and behaves like long-term pinnings in that sense. This is a
> real issue when using a lot of sectremem.

A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
As I've said it is quite easy to land at the similar situation even with
tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
really uncommon. It would be even worse that those would be allowed to
consume both CMA/ZONE_MOVABLE.

One has to be very careful when relying on CMA or movable zones. This is
definitely worth a comment in the kernel command line parameter
documentation. But this is not a new problem.
David Hildenbrand Feb. 9, 2021, 9:15 a.m. UTC | #5
On 09.02.21 09:59, Michal Hocko wrote:
> On Mon 08-02-21 22:38:03, David Hildenbrand wrote:
>>
>>> Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
>>>
>>> On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
>>>> On 08.02.21 09:49, Mike Rapoport wrote:
>>>>
>>>> Some questions (and request to document the answers) as we now allow to have
>>>> unmovable allocations all over the place and I don't see a single comment
>>>> regarding that in the cover letter:
>>>>
>>>> 1. How will the issue of plenty of unmovable allocations for user space be
>>>> tackled in the future?
>>>>
>>>> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
>>>> and CMA, alloc_conig_range()/alloc_contig_pages?.
>>>
>>> Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
>>> allocate movable pages at the first place.
>>
>> That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
>> memory and behaves like long-term pinnings in that sense. This is a
>> real issue when using a lot of sectremem.
> 
> A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
> As I've said it is quite easy to land at the similar situation even with
> tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
> really uncommon. It would be even worse that those would be allowed to
> consume both CMA/ZONE_MOVABLE.

IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
a) Is movable, can land in ZONE_MOVABLE/CMA
b) Can be limited by sizing tmpfs appropriately

AFAIK, what you describe is a problem with memory overcommit, not with 
zone imbalances (below). Or what am I missing?

> 
> One has to be very careful when relying on CMA or movable zones. This is
> definitely worth a comment in the kernel command line parameter
> documentation. But this is not a new problem.

I see the following thing worth documenting:

Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of 
ZONE_MOVABLE/CMA.

Assume you make use of 1.5GB of secretmem. Your system might run into 
OOM any time although you still have plenty of memory on ZONE_MOVAVLE 
(and even swap!), simply because you are making excessive use of 
unmovable allocations (for user space!) in an environment where you 
should not make excessive use of unmovable allocations (e.g., where 
should page tables go?).

The existing controls (mlock limit) don't really match the current 
semantics of that memory. I repeat it once again: secretmem *currently* 
resembles long-term pinned memory, not mlocked memory. Things will 
change when implementing migration support for secretmem pages. Until 
then, the semantics are different and this should be spelled out.

For long-term pinnings this is kind of obvious, still we're now 
documenting it because it's dangerous to not be aware of. Secretmem 
behaves exactly the same and I think this is worth spelling out: 
secretmem has the potential of being used much more often than fairly 
special vfio/rdma/ ...

Looking at a cover letter that doesn't even mention the issue of 
unmovable allocations makes me thing that we are either trying to ignore 
the problem or are not aware of the problem.
Michal Hocko Feb. 9, 2021, 9:53 a.m. UTC | #6
On Tue 09-02-21 10:15:17, David Hildenbrand wrote:
> On 09.02.21 09:59, Michal Hocko wrote:
> > On Mon 08-02-21 22:38:03, David Hildenbrand wrote:
> > > 
> > > > Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
> > > > 
> > > > On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
> > > > > On 08.02.21 09:49, Mike Rapoport wrote:
> > > > > 
> > > > > Some questions (and request to document the answers) as we now allow to have
> > > > > unmovable allocations all over the place and I don't see a single comment
> > > > > regarding that in the cover letter:
> > > > > 
> > > > > 1. How will the issue of plenty of unmovable allocations for user space be
> > > > > tackled in the future?
> > > > > 
> > > > > 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
> > > > > and CMA, alloc_conig_range()/alloc_contig_pages?.
> > > > 
> > > > Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> > > > allocate movable pages at the first place.
> > > 
> > > That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
> > > memory and behaves like long-term pinnings in that sense. This is a
> > > real issue when using a lot of sectremem.
> > 
> > A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
> > As I've said it is quite easy to land at the similar situation even with
> > tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
> > really uncommon. It would be even worse that those would be allowed to
> > consume both CMA/ZONE_MOVABLE.
> 
> IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
> a) Is movable, can land in ZONE_MOVABLE/CMA
> b) Can be limited by sizing tmpfs appropriately
> 
> AFAIK, what you describe is a problem with memory overcommit, not with zone
> imbalances (below). Or what am I missing?

It can be problem for both. If you have just too much of shm (do not
forget about MAP_SHARED|MAP_ANON which is much harder to size from an
admin POV) then migrateability doesn't really help because you need a
free memory to migrate. Without reclaimability this can easily become a
problem. That is why I am saying this is not really a new problem.
Swapless systems are not all that uncommon.
 
> > One has to be very careful when relying on CMA or movable zones. This is
> > definitely worth a comment in the kernel command line parameter
> > documentation. But this is not a new problem.
> 
> I see the following thing worth documenting:
> 
> Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
> ZONE_MOVABLE/CMA.
> 
> Assume you make use of 1.5GB of secretmem. Your system might run into OOM
> any time although you still have plenty of memory on ZONE_MOVAVLE (and even
> swap!), simply because you are making excessive use of unmovable allocations
> (for user space!) in an environment where you should not make excessive use
> of unmovable allocations (e.g., where should page tables go?).

yes, you are right of course and I am not really disputing this. But I
would argue that 2:1 Movable/Normal is something to expect problems
already. "Lowmem" allocations can easily trigger OOM even without secret
mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
room and one has to be really careful when relying on them.
 
> The existing controls (mlock limit) don't really match the current semantics
> of that memory. I repeat it once again: secretmem *currently* resembles
> long-term pinned memory, not mlocked memory.

Well, if we had a proper user space pinning accounting then I would
agree that there is a better model to use. But we don't. And previous
attempts to achieve that have failed. So I am afraid that we do not have
much choice left than using mlock as a model.

> Things will change when
> implementing migration support for secretmem pages. Until then, the
> semantics are different and this should be spelled out.
> 
> For long-term pinnings this is kind of obvious, still we're now documenting
> it because it's dangerous to not be aware of. Secretmem behaves exactly the
> same and I think this is worth spelling out: secretmem has the potential of
> being used much more often than fairly special vfio/rdma/ ...

yeah I do agree that pinning is a problem for movable/CMA but most
people simply do not care about those. Movable is the thing for hoptlug
and a really weird fragmentation avoidance IIRC and CMA is mostly to
handle crap HW. If those are to be used along with secret mem or
longterm GUP then they will constantly bump into corner cases. Do not
take me wrong, we should be looking at those problems, we should even
document them but I do not see this as anything new. We should probably
have a central place in Documentation explaining all those problems. I
would be even happy to see an explicit note in the tunables - e.g.
configuring movable/normal in 2:1 will get you back to 32b times wrt.
low mem problems.
David Hildenbrand Feb. 9, 2021, 10:23 a.m. UTC | #7
>>> A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
>>> As I've said it is quite easy to land at the similar situation even with
>>> tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
>>> really uncommon. It would be even worse that those would be allowed to
>>> consume both CMA/ZONE_MOVABLE.
>>
>> IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
>> a) Is movable, can land in ZONE_MOVABLE/CMA
>> b) Can be limited by sizing tmpfs appropriately
>>
>> AFAIK, what you describe is a problem with memory overcommit, not with zone
>> imbalances (below). Or what am I missing?
> 
> It can be problem for both. If you have just too much of shm (do not
> forget about MAP_SHARED|MAP_ANON which is much harder to size from an
> admin POV) then migrateability doesn't really help because you need a
> free memory to migrate. Without reclaimability this can easily become a
> problem. That is why I am saying this is not really a new problem.
> Swapless systems are not all that uncommon.

I get your point, it's similar but still different. "no memory in the 
system" vs. "plenty of unusable free memory available in the system".

In many setups, memory for user space applications can go to 
ZONE_MOVABLE just fine. ZONE_NORMAL etc. can be used for supporting user 
space memory (e.g., page tables) and other kernel stuff.

Like, have 4GB of ZONE_MOVABLE with 2GB of ZONE_NORMAL. Have an 
application (database) that allocates 4GB of memory. Works just fine. 
The zone ratio ends up being a problem for example with many processes 
(-> many page tables).

Not being able to put user space memory into the movable zone is a 
special case. And we are introducing yet another special case here 
(besides vfio, rdma, unmigratable huge pages like gigantic pages).

With plenty of secretmem, looking at /proc/meminfo Total vs. Free can be 
a big lie of how your system behaves.

>   
>>> One has to be very careful when relying on CMA or movable zones. This is
>>> definitely worth a comment in the kernel command line parameter
>>> documentation. But this is not a new problem.
>>
>> I see the following thing worth documenting:
>>
>> Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
>> ZONE_MOVABLE/CMA.
>>
>> Assume you make use of 1.5GB of secretmem. Your system might run into OOM
>> any time although you still have plenty of memory on ZONE_MOVAVLE (and even
>> swap!), simply because you are making excessive use of unmovable allocations
>> (for user space!) in an environment where you should not make excessive use
>> of unmovable allocations (e.g., where should page tables go?).
> 
> yes, you are right of course and I am not really disputing this. But I
> would argue that 2:1 Movable/Normal is something to expect problems
> already. "Lowmem" allocations can easily trigger OOM even without secret
> mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
> even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
> room and one has to be really careful when relying on them.

Right, it's all about what the setup actually needs. Sure, there are 
cases where you need significantly more GFP_KERNEL/GFP_{HIGH}USER such 
that a 2:1 ratio is not feasible. But I claim that these are corner cases.

Secretmem gives user space the option to allocate a lot of 
GFP_{HIGH}USER memory. If I am not wrong, "ulimit -a" tells me that each 
application on F33 can allocate 16 GiB (!) of secretmem.

Which other ways do you know where random user space can do something 
similar? I'd be curious what other scenarios there are where user space 
can easily allocate a lot of unmovable memory.

>   
>> The existing controls (mlock limit) don't really match the current semantics
>> of that memory. I repeat it once again: secretmem *currently* resembles
>> long-term pinned memory, not mlocked memory.
> 
> Well, if we had a proper user space pinning accounting then I would
> agree that there is a better model to use. But we don't. And previous
> attempts to achieve that have failed. So I am afraid that we do not have
> much choice left than using mlock as a model.

Yes, I agree.

> 
>> Things will change when
>> implementing migration support for secretmem pages. Until then, the
>> semantics are different and this should be spelled out.
>>
>> For long-term pinnings this is kind of obvious, still we're now documenting
>> it because it's dangerous to not be aware of. Secretmem behaves exactly the
>> same and I think this is worth spelling out: secretmem has the potential of
>> being used much more often than fairly special vfio/rdma/ ...
> 
> yeah I do agree that pinning is a problem for movable/CMA but most
> people simply do not care about those. Movable is the thing for hoptlug
> and a really weird fragmentation avoidance IIRC and CMA is mostly to

+ handling gigantic pages dynamically

> handle crap HW. If those are to be used along with secret mem or
> longterm GUP then they will constantly bump into corner cases. Do not
> take me wrong, we should be looking at those problems, we should even
> document them but I do not see this as anything new. We should probably
> have a central place in Documentation explaining all those problems. I

Exactly.

> would be even happy to see an explicit note in the tunables - e.g.
> configuring movable/normal in 2:1 will get you back to 32b times wrt.
> low mem problems.

In most setups, ratios of 1:1 up to 4:1 work reasonably well. Of course, 
it's not applicable to all setups (obviously).

For example, oVirt has been using ratios of 3:1 for a long time. (online 
all memory to ZONE_MOVABLE in the guest, never hotplug more than 3x boot 
memory size). Most distros end up onlining all hotplugged memory on bare 
metal to ZONE_MOVABLE, and I've seen basically no bug reports related to 
that.

Highmem was a little different, yet similar. RHEL provided the bigmem 
kernel with ratios of 60:4, which worked in many setups. The main 
difference to highmem was that e.g., pagetables could be placed onto it. 
So ratios like 18:1 are completely insane with ZONE_MOVABLE.

I am constantly trying to fight for making more stuff MOVABLE instead of 
going into the other direction (e.g., because it's easier to implement, 
which feels like the wrong direction).

Maybe I am the only person that really cares about ZONE_MOVABLE these 
days :) I can't stop such new stuff from popping up, so at least I want 
it to be documented.
David Hildenbrand Feb. 9, 2021, 10:30 a.m. UTC | #8
On 09.02.21 11:23, David Hildenbrand wrote:
>>>> A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
>>>> As I've said it is quite easy to land at the similar situation even with
>>>> tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
>>>> really uncommon. It would be even worse that those would be allowed to
>>>> consume both CMA/ZONE_MOVABLE.
>>>
>>> IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
>>> a) Is movable, can land in ZONE_MOVABLE/CMA
>>> b) Can be limited by sizing tmpfs appropriately
>>>
>>> AFAIK, what you describe is a problem with memory overcommit, not with zone
>>> imbalances (below). Or what am I missing?
>>
>> It can be problem for both. If you have just too much of shm (do not
>> forget about MAP_SHARED|MAP_ANON which is much harder to size from an
>> admin POV) then migrateability doesn't really help because you need a
>> free memory to migrate. Without reclaimability this can easily become a
>> problem. That is why I am saying this is not really a new problem.
>> Swapless systems are not all that uncommon.
> 
> I get your point, it's similar but still different. "no memory in the
> system" vs. "plenty of unusable free memory available in the system".
> 
> In many setups, memory for user space applications can go to
> ZONE_MOVABLE just fine. ZONE_NORMAL etc. can be used for supporting user
> space memory (e.g., page tables) and other kernel stuff.
> 
> Like, have 4GB of ZONE_MOVABLE with 2GB of ZONE_NORMAL. Have an
> application (database) that allocates 4GB of memory. Works just fine.
> The zone ratio ends up being a problem for example with many processes
> (-> many page tables).
> 
> Not being able to put user space memory into the movable zone is a
> special case. And we are introducing yet another special case here
> (besides vfio, rdma, unmigratable huge pages like gigantic pages).
> 
> With plenty of secretmem, looking at /proc/meminfo Total vs. Free can be
> a big lie of how your system behaves.
> 
>>    
>>>> One has to be very careful when relying on CMA or movable zones. This is
>>>> definitely worth a comment in the kernel command line parameter
>>>> documentation. But this is not a new problem.
>>>
>>> I see the following thing worth documenting:
>>>
>>> Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
>>> ZONE_MOVABLE/CMA.
>>>
>>> Assume you make use of 1.5GB of secretmem. Your system might run into OOM
>>> any time although you still have plenty of memory on ZONE_MOVAVLE (and even
>>> swap!), simply because you are making excessive use of unmovable allocations
>>> (for user space!) in an environment where you should not make excessive use
>>> of unmovable allocations (e.g., where should page tables go?).
>>
>> yes, you are right of course and I am not really disputing this. But I
>> would argue that 2:1 Movable/Normal is something to expect problems
>> already. "Lowmem" allocations can easily trigger OOM even without secret
>> mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
>> even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
>> room and one has to be really careful when relying on them.
> 
> Right, it's all about what the setup actually needs. Sure, there are
> cases where you need significantly more GFP_KERNEL/GFP_{HIGH}USER such
> that a 2:1 ratio is not feasible. But I claim that these are corner cases.
> 
> Secretmem gives user space the option to allocate a lot of
> GFP_{HIGH}USER memory. If I am not wrong, "ulimit -a" tells me that each
> application on F33 can allocate 16 GiB (!) of secretmem.

Got to learn to do my math. It's 16 MiB - so as a default it's less 
dangerous than I thought!
Michal Hocko Feb. 9, 2021, 1:25 p.m. UTC | #9
On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
[...]
> I am constantly trying to fight for making more stuff MOVABLE instead of
> going into the other direction (e.g., because it's easier to implement,
> which feels like the wrong direction).
> 
> Maybe I am the only person that really cares about ZONE_MOVABLE these days
> :) I can't stop such new stuff from popping up, so at least I want it to be
> documented.

MOVABLE zone is certainly an important thing to keep working. And there
is still quite a lot of work on the way. But as I've said this is more
of a outlier than a norm. On the other hand movable zone is kinda hard
requirement for a lot of application and it is to be expected that
many features will be less than 100% compatible.  Some usecases even
impossible. That's why I am arguing that we should have a central
document where the movable zone is documented with all the potential
problems we have encountered over time and explicitly state which
features are fully/partially incompatible.
David Hildenbrand Feb. 9, 2021, 4:17 p.m. UTC | #10
On 09.02.21 14:25, Michal Hocko wrote:
> On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
> [...]
>> I am constantly trying to fight for making more stuff MOVABLE instead of
>> going into the other direction (e.g., because it's easier to implement,
>> which feels like the wrong direction).
>>
>> Maybe I am the only person that really cares about ZONE_MOVABLE these days
>> :) I can't stop such new stuff from popping up, so at least I want it to be
>> documented.
> 
> MOVABLE zone is certainly an important thing to keep working. And there
> is still quite a lot of work on the way. But as I've said this is more
> of a outlier than a norm. On the other hand movable zone is kinda hard
> requirement for a lot of application and it is to be expected that
> many features will be less than 100% compatible.  Some usecases even
> impossible. That's why I am arguing that we should have a central
> document where the movable zone is documented with all the potential
> problems we have encountered over time and explicitly state which
> features are fully/partially incompatible.
> 

I'll send a mail during the next weeks to gather current restrictions to 
document them (and include my brain dump). We might see more excessive 
use of ZONE_MOVABLE in the future and as history told us, of CMA as 
well. We really should start documenting/caring.

@Mike, it would be sufficient for me if one of your patches at least 
mention the situation in the description like

"Please note that secretmem currently behaves much more like long-term 
GUP instead of mlocked memory; secretmem is unmovable memory directly 
consumed/controlled by user space. secretmem cannot be placed onto 
ZONE_MOVABLE/CMA.

As long as there is no excessive use of secretmem (e.g., maximum of 16 
MiB for selected processes) in combination with ZONE_MOVABLE/CMA, this 
is barely a real issue. However, it is something to keep in mind when a 
significant amount of system RAM might be used for secretmem. In the 
future, we might support migration of secretmem and make it look much 
more like mlocked memory instead."

Just a suggestion.
Michal Hocko Feb. 9, 2021, 8:08 p.m. UTC | #11
On Tue 09-02-21 17:17:22, David Hildenbrand wrote:
> On 09.02.21 14:25, Michal Hocko wrote:
> > On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
> > [...]
> > > I am constantly trying to fight for making more stuff MOVABLE instead of
> > > going into the other direction (e.g., because it's easier to implement,
> > > which feels like the wrong direction).
> > > 
> > > Maybe I am the only person that really cares about ZONE_MOVABLE these days
> > > :) I can't stop such new stuff from popping up, so at least I want it to be
> > > documented.
> > 
> > MOVABLE zone is certainly an important thing to keep working. And there
> > is still quite a lot of work on the way. But as I've said this is more
> > of a outlier than a norm. On the other hand movable zone is kinda hard
> > requirement for a lot of application and it is to be expected that
> > many features will be less than 100% compatible.  Some usecases even
> > impossible. That's why I am arguing that we should have a central
> > document where the movable zone is documented with all the potential
> > problems we have encountered over time and explicitly state which
> > features are fully/partially incompatible.
> > 
> 
> I'll send a mail during the next weeks to gather current restrictions to
> document them (and include my brain dump). We might see more excessive use
> of ZONE_MOVABLE in the future and as history told us, of CMA as well. We
> really should start documenting/caring.

Excellent! Thanks a lot. I will do my best to help reviewing that.

> @Mike, it would be sufficient for me if one of your patches at least mention
> the situation in the description like
> 
> "Please note that secretmem currently behaves much more like long-term GUP
> instead of mlocked memory; secretmem is unmovable memory directly
> consumed/controlled by user space. secretmem cannot be placed onto
> ZONE_MOVABLE/CMA.

Sounds good to me.