mbox series

[00/16] mm: Introduce MAP_BELOW_HINT

Message ID 20240827-patches-below_hint_mmap-v1-0-46ff2eb9022d@rivosinc.com (mailing list archive)
Headers show
Series mm: Introduce MAP_BELOW_HINT | expand

Message

Charlie Jenkins Aug. 28, 2024, 5:49 a.m. UTC
Some applications rely on placing data in free bits addresses allocated
by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
address returned by mmap to be less than the maximum address space,
unless the hint address is greater than this value.

On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This
flag allows applications a way to specify exactly how many bits they
want to be left unused by mmap. This eliminates the need for
applications to know the page table hierarchy of the system to be able
to reason which addresses mmap will be allowed to return.

---
riscv made this feature of mmap returning addresses less than the hint
address the default behavior. This was in contrast to the implementation
of x86/arm64 that have a single boundary at the 5-level page table
region. However this restriction proved too great -- the reduced
address space when using a hint address was too small.

A patch for riscv [1] reverts the behavior that broke userspace. This
series serves to make this feature available to all architectures.

I have only tested on riscv and x86. There is a tremendous amount of
duplicated code in mmap so the implementations across architectures I
believe should be mostly consistent. I added this feature to all
architectures that implement either
arch_get_mmap_end()/arch_get_mmap_base() or
arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added
it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base().

Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.com/T/ [1]

To: Arnd Bergmann <arnd@arndb.de>
To: Paul Walmsley <paul.walmsley@sifive.com>
To: Palmer Dabbelt <palmer@dabbelt.com>
To: Albert Ou <aou@eecs.berkeley.edu>
To: Catalin Marinas <catalin.marinas@arm.com>
To: Will Deacon <will@kernel.org>
To: Michael Ellerman <mpe@ellerman.id.au>
To: Nicholas Piggin <npiggin@gmail.com>
To: Christophe Leroy <christophe.leroy@csgroup.eu>
To: Naveen N Rao <naveen@kernel.org>
To: Muchun Song <muchun.song@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>
To: Liam R. Howlett <Liam.Howlett@oracle.com>
To: Vlastimil Babka <vbabka@suse.cz>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Thomas Gleixner <tglx@linutronix.de>
To: Ingo Molnar <mingo@redhat.com>
To: Borislav Petkov <bp@alien8.de>
To: Dave Hansen <dave.hansen@linux.intel.com>
To: x86@kernel.org
To: H. Peter Anvin <hpa@zytor.com>
To: Huacai Chen <chenhuacai@kernel.org>
To: WANG Xuerui <kernel@xen0n.name>
To: Russell King <linux@armlinux.org.uk>
To: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
To: Helge Deller <deller@gmx.de>
To: Alexander Gordeev <agordeev@linux.ibm.com>
To: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
To: Heiko Carstens <hca@linux.ibm.com>
To: Vasily Gorbik <gor@linux.ibm.com>
To: Christian Borntraeger <borntraeger@linux.ibm.com>
To: Sven Schnelle <svens@linux.ibm.com>
To: Yoshinori Sato <ysato@users.sourceforge.jp>
To: Rich Felker <dalias@libc.org>
To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
To: David S. Miller <davem@davemloft.net>
To: Andreas Larsson <andreas@gaisler.com>
To: Shuah Khan <shuah@kernel.org>
To: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: linux-riscv@lists.infradead.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: loongarch@lists.linux.dev
Cc: linux-mips@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>

---
Charlie Jenkins (16):
      mm: Add MAP_BELOW_HINT
      riscv: mm: Do not restrict mmap address based on hint
      mm: Add flag and len param to arch_get_mmap_base()
      mm: Add generic MAP_BELOW_HINT
      riscv: mm: Support MAP_BELOW_HINT
      arm64: mm: Support MAP_BELOW_HINT
      powerpc: mm: Support MAP_BELOW_HINT
      x86: mm: Support MAP_BELOW_HINT
      loongarch: mm: Support MAP_BELOW_HINT
      arm: mm: Support MAP_BELOW_HINT
      mips: mm: Support MAP_BELOW_HINT
      parisc: mm: Support MAP_BELOW_HINT
      s390: mm: Support MAP_BELOW_HINT
      sh: mm: Support MAP_BELOW_HINT
      sparc: mm: Support MAP_BELOW_HINT
      selftests/mm: Create MAP_BELOW_HINT test

 arch/arm/mm/mmap.c                           | 10 ++++++++
 arch/arm64/include/asm/processor.h           | 34 ++++++++++++++++++++++----
 arch/loongarch/mm/mmap.c                     | 11 +++++++++
 arch/mips/mm/mmap.c                          |  9 +++++++
 arch/parisc/include/uapi/asm/mman.h          |  1 +
 arch/parisc/kernel/sys_parisc.c              |  9 +++++++
 arch/powerpc/include/asm/task_size_64.h      | 36 +++++++++++++++++++++++-----
 arch/riscv/include/asm/processor.h           | 32 -------------------------
 arch/s390/mm/mmap.c                          | 10 ++++++++
 arch/sh/mm/mmap.c                            | 10 ++++++++
 arch/sparc/kernel/sys_sparc_64.c             |  8 +++++++
 arch/x86/kernel/sys_x86_64.c                 | 25 ++++++++++++++++---
 fs/hugetlbfs/inode.c                         |  2 +-
 include/linux/sched/mm.h                     | 34 ++++++++++++++++++++++++--
 include/uapi/asm-generic/mman-common.h       |  1 +
 mm/mmap.c                                    |  2 +-
 tools/arch/parisc/include/uapi/asm/mman.h    |  1 +
 tools/include/uapi/asm-generic/mman-common.h |  1 +
 tools/testing/selftests/mm/Makefile          |  1 +
 tools/testing/selftests/mm/map_below_hint.c  | 29 ++++++++++++++++++++++
 20 files changed, 216 insertions(+), 50 deletions(-)
---
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55

Comments

Lorenzo Stoakes Aug. 28, 2024, 6:19 p.m. UTC | #1
On Tue, Aug 27, 2024 at 10:49:06PM GMT, Charlie Jenkins wrote:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the maximum address space,
> unless the hint address is greater than this value.
>
> On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This
> flag allows applications a way to specify exactly how many bits they
> want to be left unused by mmap. This eliminates the need for
> applications to know the page table hierarchy of the system to be able
> to reason which addresses mmap will be allowed to return.
>
> ---
> riscv made this feature of mmap returning addresses less than the hint
> address the default behavior. This was in contrast to the implementation
> of x86/arm64 that have a single boundary at the 5-level page table
> region. However this restriction proved too great -- the reduced
> address space when using a hint address was too small.
>
> A patch for riscv [1] reverts the behavior that broke userspace. This
> series serves to make this feature available to all architectures.

I'm a little confused as to the justification for this - you broke RISC V by
doing this, and have now reverted it, but now offer the same behaviour that
broke RISC V to all other architectures?

I mean this is how this reads, so I might be being ungenerous here :) but would
be good to clarify what the value-add is here.

I also wonder at use of a new MAP_ flag, they're a limited resource and we
should only ever add them if we _really_ need to. This seems a bit niche and
specific to be making such a big change for including touching a bunch of pretty
sensitive arch-specific code.

We have the ability to change how mmap() functions through 'personalities'
though of course this would impact every mmap() call in the process.

Overall I'm really not hugely convinced by this, it feels like userland
could find better ways of doing this (mostly you'd do a PROT_NONE mmap() to
reserve a domain and mprotect() it on allocation or mmap() over it).

So I just struggle to see the purpose myself. BUT absolutely I may be
missing context/others may have a view on the value of this. So happy to
stand corrected.

>
> I have only tested on riscv and x86. There is a tremendous amount of

Yeah, OK this is crazy, you can't really submit something as non-RFC that
touches every single arch and not test it.

I also feel like we need more justification than 'this is a neat thing that
we use in RISC V sometimes' conceptually for such a big change.

Also your test program is currently completely broken afaict (have
commented on it directly). I also feel like your test program is a little
rudimentary, and should test some edge cases close to the limit etc.

So I think this is a NACK until there is testing across the board and a little
more justification.

Feel free to respin, but I think any future revisions should be RFC until
we're absolutely sure on testing/justification.

I appreciate your efforts here so sorry to be negative, but just obviously
want to make sure this is functional and trades off added complexity for
value for the kernel and userland :)

Thanks!

> duplicated code in mmap so the implementations across architectures I
> believe should be mostly consistent. I added this feature to all
> architectures that implement either
> arch_get_mmap_end()/arch_get_mmap_base() or
> arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added
> it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base().
>
> Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.com/T/ [1]
>
> To: Arnd Bergmann <arnd@arndb.de>
> To: Paul Walmsley <paul.walmsley@sifive.com>
> To: Palmer Dabbelt <palmer@dabbelt.com>
> To: Albert Ou <aou@eecs.berkeley.edu>
> To: Catalin Marinas <catalin.marinas@arm.com>
> To: Will Deacon <will@kernel.org>
> To: Michael Ellerman <mpe@ellerman.id.au>
> To: Nicholas Piggin <npiggin@gmail.com>
> To: Christophe Leroy <christophe.leroy@csgroup.eu>
> To: Naveen N Rao <naveen@kernel.org>
> To: Muchun Song <muchun.song@linux.dev>
> To: Andrew Morton <akpm@linux-foundation.org>
> To: Liam R. Howlett <Liam.Howlett@oracle.com>
> To: Vlastimil Babka <vbabka@suse.cz>
> To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> To: Thomas Gleixner <tglx@linutronix.de>
> To: Ingo Molnar <mingo@redhat.com>
> To: Borislav Petkov <bp@alien8.de>
> To: Dave Hansen <dave.hansen@linux.intel.com>
> To: x86@kernel.org
> To: H. Peter Anvin <hpa@zytor.com>
> To: Huacai Chen <chenhuacai@kernel.org>
> To: WANG Xuerui <kernel@xen0n.name>
> To: Russell King <linux@armlinux.org.uk>
> To: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
> To: Helge Deller <deller@gmx.de>
> To: Alexander Gordeev <agordeev@linux.ibm.com>
> To: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> To: Heiko Carstens <hca@linux.ibm.com>
> To: Vasily Gorbik <gor@linux.ibm.com>
> To: Christian Borntraeger <borntraeger@linux.ibm.com>
> To: Sven Schnelle <svens@linux.ibm.com>
> To: Yoshinori Sato <ysato@users.sourceforge.jp>
> To: Rich Felker <dalias@libc.org>
> To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
> To: David S. Miller <davem@davemloft.net>
> To: Andreas Larsson <andreas@gaisler.com>
> To: Shuah Khan <shuah@kernel.org>
> To: Alexandre Ghiti <alexghiti@rivosinc.com>
> Cc: linux-arch@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Palmer Dabbelt <palmer@rivosinc.com>
> Cc: linux-riscv@lists.infradead.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-mm@kvack.org
> Cc: loongarch@lists.linux.dev
> Cc: linux-mips@vger.kernel.org
> Cc: linux-parisc@vger.kernel.org
> Cc: linux-s390@vger.kernel.org
> Cc: linux-sh@vger.kernel.org
> Cc: sparclinux@vger.kernel.org
> Cc: linux-kselftest@vger.kernel.org
> Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
>
> ---
> Charlie Jenkins (16):
>       mm: Add MAP_BELOW_HINT
>       riscv: mm: Do not restrict mmap address based on hint
>       mm: Add flag and len param to arch_get_mmap_base()
>       mm: Add generic MAP_BELOW_HINT
>       riscv: mm: Support MAP_BELOW_HINT
>       arm64: mm: Support MAP_BELOW_HINT
>       powerpc: mm: Support MAP_BELOW_HINT
>       x86: mm: Support MAP_BELOW_HINT
>       loongarch: mm: Support MAP_BELOW_HINT
>       arm: mm: Support MAP_BELOW_HINT
>       mips: mm: Support MAP_BELOW_HINT
>       parisc: mm: Support MAP_BELOW_HINT
>       s390: mm: Support MAP_BELOW_HINT
>       sh: mm: Support MAP_BELOW_HINT
>       sparc: mm: Support MAP_BELOW_HINT
>       selftests/mm: Create MAP_BELOW_HINT test
>
>  arch/arm/mm/mmap.c                           | 10 ++++++++
>  arch/arm64/include/asm/processor.h           | 34 ++++++++++++++++++++++----
>  arch/loongarch/mm/mmap.c                     | 11 +++++++++
>  arch/mips/mm/mmap.c                          |  9 +++++++
>  arch/parisc/include/uapi/asm/mman.h          |  1 +
>  arch/parisc/kernel/sys_parisc.c              |  9 +++++++
>  arch/powerpc/include/asm/task_size_64.h      | 36 +++++++++++++++++++++++-----
>  arch/riscv/include/asm/processor.h           | 32 -------------------------
>  arch/s390/mm/mmap.c                          | 10 ++++++++
>  arch/sh/mm/mmap.c                            | 10 ++++++++
>  arch/sparc/kernel/sys_sparc_64.c             |  8 +++++++
>  arch/x86/kernel/sys_x86_64.c                 | 25 ++++++++++++++++---
>  fs/hugetlbfs/inode.c                         |  2 +-
>  include/linux/sched/mm.h                     | 34 ++++++++++++++++++++++++--
>  include/uapi/asm-generic/mman-common.h       |  1 +
>  mm/mmap.c                                    |  2 +-
>  tools/arch/parisc/include/uapi/asm/mman.h    |  1 +
>  tools/include/uapi/asm-generic/mman-common.h |  1 +
>  tools/testing/selftests/mm/Makefile          |  1 +
>  tools/testing/selftests/mm/map_below_hint.c  | 29 ++++++++++++++++++++++
>  20 files changed, 216 insertions(+), 50 deletions(-)
> ---
> base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
> change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
> --
> - Charlie
>
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Dave Hansen Aug. 28, 2024, 6:29 p.m. UTC | #2
On 8/27/24 22:49, Charlie Jenkins wrote:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the maximum address space,
> unless the hint address is greater than this value.

Which applications are these, btw?

Is this the same crowd as the folks who are using the address tagging
features like X86_FEATURE_LAM?

Even if they are different, I also wonder if a per-mmap() thing
MAP_BELOW_HINT is really what we want.  Or should the applications
you're trying to service here use a similar mechanism to how LAM affects
the *whole* address space as opposed to an individual mmap().
Liam R. Howlett Aug. 28, 2024, 6:31 p.m. UTC | #3
* Charlie Jenkins <charlie@rivosinc.com> [240828 01:49]:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the maximum address space,
> unless the hint address is greater than this value.

Wait, what arch(s) allows for greater than the max?  The passed hint
should be where we start searching, but we go to the lower limit then
start at the hint and search up (or vice-versa on the directions).

I don't understand how unmapping works on a higher address; we would
fail to free it on termination of the application.

Also, there are archs that map outside of the VMAs, which are freed by
freeing from the prev->vm_end to next->vm_start, so I don't understand
what that looks like in this reality as well.

> 
> On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This
> flag allows applications a way to specify exactly how many bits they
> want to be left unused by mmap. This eliminates the need for
> applications to know the page table hierarchy of the system to be able
> to reason which addresses mmap will be allowed to return.

But, why do they need to know today?  We have a limit for this don't we?

Also, these upper limits are how some archs use the upper bits that you
are trying to use.

> 
> ---
> riscv made this feature of mmap returning addresses less than the hint
> address the default behavior. This was in contrast to the implementation
> of x86/arm64 that have a single boundary at the 5-level page table
> region. However this restriction proved too great -- the reduced
> address space when using a hint address was too small.

Yes, the hint is used to group things close together so it would
literally be random chance on if you have enough room or not (aslr and
all).

> 
> A patch for riscv [1] reverts the behavior that broke userspace. This
> series serves to make this feature available to all architectures.

I don't fully understand this statement, you say it broke userspace so
now you are porting it to everyone?  This reads as if you are braking
the userspace on all architectures :)

If you fail to find room below, then your application fails as there is
no way to get the upper bits you need.  It would be better to fix this
in userspace - if your application is returned too high an address, then
free it and exit because it's going to fail anyways.

> 
> I have only tested on riscv and x86.

This should be an RFC then.

> There is a tremendous amount of
> duplicated code in mmap so the implementations across architectures I
> believe should be mostly consistent. I added this feature to all
> architectures that implement either
> arch_get_mmap_end()/arch_get_mmap_base() or
> arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added
> it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base().

Way too much duplicate code.  We should be figuring out how to make this
all work with the same code.

This is going to make the cloned code problem worse.

> 
> Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.com/T/ [1]
> 
> To: Arnd Bergmann <arnd@arndb.de>
> To: Paul Walmsley <paul.walmsley@sifive.com>
> To: Palmer Dabbelt <palmer@dabbelt.com>
> To: Albert Ou <aou@eecs.berkeley.edu>
> To: Catalin Marinas <catalin.marinas@arm.com>
> To: Will Deacon <will@kernel.org>
> To: Michael Ellerman <mpe@ellerman.id.au>
> To: Nicholas Piggin <npiggin@gmail.com>
> To: Christophe Leroy <christophe.leroy@csgroup.eu>
> To: Naveen N Rao <naveen@kernel.org>
> To: Muchun Song <muchun.song@linux.dev>
> To: Andrew Morton <akpm@linux-foundation.org>
> To: Liam R. Howlett <Liam.Howlett@oracle.com>
> To: Vlastimil Babka <vbabka@suse.cz>
> To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> To: Thomas Gleixner <tglx@linutronix.de>
> To: Ingo Molnar <mingo@redhat.com>
> To: Borislav Petkov <bp@alien8.de>
> To: Dave Hansen <dave.hansen@linux.intel.com>
> To: x86@kernel.org
> To: H. Peter Anvin <hpa@zytor.com>
> To: Huacai Chen <chenhuacai@kernel.org>
> To: WANG Xuerui <kernel@xen0n.name>
> To: Russell King <linux@armlinux.org.uk>
> To: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
> To: Helge Deller <deller@gmx.de>
> To: Alexander Gordeev <agordeev@linux.ibm.com>
> To: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> To: Heiko Carstens <hca@linux.ibm.com>
> To: Vasily Gorbik <gor@linux.ibm.com>
> To: Christian Borntraeger <borntraeger@linux.ibm.com>
> To: Sven Schnelle <svens@linux.ibm.com>
> To: Yoshinori Sato <ysato@users.sourceforge.jp>
> To: Rich Felker <dalias@libc.org>
> To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
> To: David S. Miller <davem@davemloft.net>
> To: Andreas Larsson <andreas@gaisler.com>
> To: Shuah Khan <shuah@kernel.org>
> To: Alexandre Ghiti <alexghiti@rivosinc.com>
> Cc: linux-arch@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Palmer Dabbelt <palmer@rivosinc.com>
> Cc: linux-riscv@lists.infradead.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-mm@kvack.org
> Cc: loongarch@lists.linux.dev
> Cc: linux-mips@vger.kernel.org
> Cc: linux-parisc@vger.kernel.org
> Cc: linux-s390@vger.kernel.org
> Cc: linux-sh@vger.kernel.org
> Cc: sparclinux@vger.kernel.org
> Cc: linux-kselftest@vger.kernel.org
> Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
> 
> ---
> Charlie Jenkins (16):
>       mm: Add MAP_BELOW_HINT
>       riscv: mm: Do not restrict mmap address based on hint
>       mm: Add flag and len param to arch_get_mmap_base()
>       mm: Add generic MAP_BELOW_HINT
>       riscv: mm: Support MAP_BELOW_HINT
>       arm64: mm: Support MAP_BELOW_HINT
>       powerpc: mm: Support MAP_BELOW_HINT
>       x86: mm: Support MAP_BELOW_HINT
>       loongarch: mm: Support MAP_BELOW_HINT
>       arm: mm: Support MAP_BELOW_HINT
>       mips: mm: Support MAP_BELOW_HINT
>       parisc: mm: Support MAP_BELOW_HINT
>       s390: mm: Support MAP_BELOW_HINT
>       sh: mm: Support MAP_BELOW_HINT
>       sparc: mm: Support MAP_BELOW_HINT
>       selftests/mm: Create MAP_BELOW_HINT test
> 
>  arch/arm/mm/mmap.c                           | 10 ++++++++
>  arch/arm64/include/asm/processor.h           | 34 ++++++++++++++++++++++----
>  arch/loongarch/mm/mmap.c                     | 11 +++++++++
>  arch/mips/mm/mmap.c                          |  9 +++++++
>  arch/parisc/include/uapi/asm/mman.h          |  1 +
>  arch/parisc/kernel/sys_parisc.c              |  9 +++++++
>  arch/powerpc/include/asm/task_size_64.h      | 36 +++++++++++++++++++++++-----
>  arch/riscv/include/asm/processor.h           | 32 -------------------------
>  arch/s390/mm/mmap.c                          | 10 ++++++++
>  arch/sh/mm/mmap.c                            | 10 ++++++++
>  arch/sparc/kernel/sys_sparc_64.c             |  8 +++++++
>  arch/x86/kernel/sys_x86_64.c                 | 25 ++++++++++++++++---
>  fs/hugetlbfs/inode.c                         |  2 +-
>  include/linux/sched/mm.h                     | 34 ++++++++++++++++++++++++--
>  include/uapi/asm-generic/mman-common.h       |  1 +
>  mm/mmap.c                                    |  2 +-
>  tools/arch/parisc/include/uapi/asm/mman.h    |  1 +
>  tools/include/uapi/asm-generic/mman-common.h |  1 +
>  tools/testing/selftests/mm/Makefile          |  1 +
>  tools/testing/selftests/mm/map_below_hint.c  | 29 ++++++++++++++++++++++
>  20 files changed, 216 insertions(+), 50 deletions(-)
> ---
> base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
> change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
> -- 
> - Charlie
>
Charlie Jenkins Aug. 28, 2024, 8:15 p.m. UTC | #4
On Wed, Aug 28, 2024 at 11:29:56AM -0700, Dave Hansen wrote:
> On 8/27/24 22:49, Charlie Jenkins wrote:
> > Some applications rely on placing data in free bits addresses allocated
> > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > address returned by mmap to be less than the maximum address space,
> > unless the hint address is greater than this value.
> 
> Which applications are these, btw?

Java and Go require this feature. These applications store flags that
represent the type of data a pointer holds in the upper bits of the
pointer itself.

> 
> Is this the same crowd as the folks who are using the address tagging
> features like X86_FEATURE_LAM?

Yes it is. LAM helps to mask the bits out on x86, and this feature could
be used to ensure that mmap() doesn't return an address with bits that
would be masked out. I chose not to tie this feature to x86 LAM which
only has masking boundaries at 57 and 48 bits to allow it to be
independent of architecture specific address masking.

> 
> Even if they are different, I also wonder if a per-mmap() thing
> MAP_BELOW_HINT is really what we want.  Or should the applications
> you're trying to service here use a similar mechanism to how LAM affects
> the *whole* address space as opposed to an individual mmap().

LAM is required to be enabled for entire address spaces because the
hardware needs to be configured to mask out the bits. It is not possible
to influence the granularity of LAM in the current implementation.
However mmap() does not require any of this hardware configuration so it
is possible to have finer granularity.

A way to restrict mmap() to return LAM compliant addresses in an entire
address space also doesn't have to be mutually exclusive with this flag.
This flag allows for the greatest degree of control from applications.
I don't believe there is additionally performance saving that could be
achieved by having this be on a per address space basis.

Link: https://cdrdv2.intel.com/v1/dl/getContent/671368 [1]

- Charlie
Charlie Jenkins Aug. 28, 2024, 8:59 p.m. UTC | #5
On Wed, Aug 28, 2024 at 02:31:42PM -0400, Liam R. Howlett wrote:
> * Charlie Jenkins <charlie@rivosinc.com> [240828 01:49]:
> > Some applications rely on placing data in free bits addresses allocated
> > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > address returned by mmap to be less than the maximum address space,
> > unless the hint address is greater than this value.
> 
> Wait, what arch(s) allows for greater than the max?  The passed hint
> should be where we start searching, but we go to the lower limit then
> start at the hint and search up (or vice-versa on the directions).
> 

I worded this awkwardly. On arm64 there is a page-table boundary at 48
bits and at 52 bits. On x86 the boundaries are at 48 bits and 57 bits.
The max value mmap is able to return on arm64 is 48 bits if the hint
address uses 48 bits or less, even if the architecture supports 5-level
paging and thus addresses can be 52 bits. Applications can opt-in to
using up to 52-bits in an address by using a hint address greater than
48 bits. x86 has the same behavior but with 57 bits instead of 52.

This reason this exists is because some applications arbitrarily replace
bits in virtual addresses with data with an assumption that the address
will not be using any of the bits above bit 48 in the virtual address.
As hardware with larger address spaces was released, x86 decided to
build safety guards into the kernel to allow the applications that made
these assumptions to continue to work on this different hardware.

This causes all application that use a hint address to silently be
restricted to 48-bit addresses. The goal of this flag is to have a way
for applications to explicitly request how many bits they want mmap to
use.

> I don't understand how unmapping works on a higher address; we would
> fail to free it on termination of the application.
> 
> Also, there are archs that map outside of the VMAs, which are freed by
> freeing from the prev->vm_end to next->vm_start, so I don't understand
> what that looks like in this reality as well.
> 
> > 
> > On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This
> > flag allows applications a way to specify exactly how many bits they
> > want to be left unused by mmap. This eliminates the need for
> > applications to know the page table hierarchy of the system to be able
> > to reason which addresses mmap will be allowed to return.
> 
> But, why do they need to know today?  We have a limit for this don't we?

The limit is different for different architectures. On x86 the limit is
57 bits, and on arm64 it is 52 bits. So in the theoretical case that an
application requires 10 bits free in a virtual address, the application
would always work on arm64 regardless of the hint address, but on x86 if
the hint address is greater than 48 bits then the application will not
work.

The goal of this flag is to have consistent and tunable behavior of
mmap() when it is desired to ensure that mmap() only returns addresses
that use some number of bits.

> 
> Also, these upper limits are how some archs use the upper bits that you
> are trying to use.
> 

It does not eliminate the existing behavior of the architectures to
place this upper limits, it instead provides a way to have consistent
behavior across all architectures.

> > 
> > ---
> > riscv made this feature of mmap returning addresses less than the hint
> > address the default behavior. This was in contrast to the implementation
> > of x86/arm64 that have a single boundary at the 5-level page table
> > region. However this restriction proved too great -- the reduced
> > address space when using a hint address was too small.
> 
> Yes, the hint is used to group things close together so it would
> literally be random chance on if you have enough room or not (aslr and
> all).
> 
> > 
> > A patch for riscv [1] reverts the behavior that broke userspace. This
> > series serves to make this feature available to all architectures.
> 
> I don't fully understand this statement, you say it broke userspace so
> now you are porting it to everyone?  This reads as if you are braking
> the userspace on all architectures :)

It was the default for mmap on riscv. The difference here is that it is now
enabled by a flag instead. Instead of making the flag specific to riscv,
I figured that other architectures might find it useful as well.

> 
> If you fail to find room below, then your application fails as there is
> no way to get the upper bits you need.  It would be better to fix this
> in userspace - if your application is returned too high an address, then
> free it and exit because it's going to fail anyways.
> 

This flag is trying to define an API that is more robust than the
current behavior on that x86 and arm64 which implicitly restricts mmap()
addresses to 48 bits. A solution could be to just write in the docs that
mmap() will always exhaust all addresses below the hint address before
returning an address that is above the hint address. However a flag that
defines this behavior seems more intuitive.

> > 
> > I have only tested on riscv and x86.
> 
> This should be an RFC then.

Fair enough.

> 
> > There is a tremendous amount of
> > duplicated code in mmap so the implementations across architectures I
> > believe should be mostly consistent. I added this feature to all
> > architectures that implement either
> > arch_get_mmap_end()/arch_get_mmap_base() or
> > arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added
> > it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base().
> 
> Way too much duplicate code.  We should be figuring out how to make this
> all work with the same code.
> 
> This is going to make the cloned code problem worse.

That would require standardizing every architecture with the generic
mmap() framework that arm64 has developed. That is far outside the scope
of this patch, but would be a great area to research for each of the
architectures that do not use the generic framework.

- Charlie

> 
> > 
> > Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.com/T/ [1]
> > 
> > To: Arnd Bergmann <arnd@arndb.de>
> > To: Paul Walmsley <paul.walmsley@sifive.com>
> > To: Palmer Dabbelt <palmer@dabbelt.com>
> > To: Albert Ou <aou@eecs.berkeley.edu>
> > To: Catalin Marinas <catalin.marinas@arm.com>
> > To: Will Deacon <will@kernel.org>
> > To: Michael Ellerman <mpe@ellerman.id.au>
> > To: Nicholas Piggin <npiggin@gmail.com>
> > To: Christophe Leroy <christophe.leroy@csgroup.eu>
> > To: Naveen N Rao <naveen@kernel.org>
> > To: Muchun Song <muchun.song@linux.dev>
> > To: Andrew Morton <akpm@linux-foundation.org>
> > To: Liam R. Howlett <Liam.Howlett@oracle.com>
> > To: Vlastimil Babka <vbabka@suse.cz>
> > To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > To: Thomas Gleixner <tglx@linutronix.de>
> > To: Ingo Molnar <mingo@redhat.com>
> > To: Borislav Petkov <bp@alien8.de>
> > To: Dave Hansen <dave.hansen@linux.intel.com>
> > To: x86@kernel.org
> > To: H. Peter Anvin <hpa@zytor.com>
> > To: Huacai Chen <chenhuacai@kernel.org>
> > To: WANG Xuerui <kernel@xen0n.name>
> > To: Russell King <linux@armlinux.org.uk>
> > To: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> > To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
> > To: Helge Deller <deller@gmx.de>
> > To: Alexander Gordeev <agordeev@linux.ibm.com>
> > To: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> > To: Heiko Carstens <hca@linux.ibm.com>
> > To: Vasily Gorbik <gor@linux.ibm.com>
> > To: Christian Borntraeger <borntraeger@linux.ibm.com>
> > To: Sven Schnelle <svens@linux.ibm.com>
> > To: Yoshinori Sato <ysato@users.sourceforge.jp>
> > To: Rich Felker <dalias@libc.org>
> > To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
> > To: David S. Miller <davem@davemloft.net>
> > To: Andreas Larsson <andreas@gaisler.com>
> > To: Shuah Khan <shuah@kernel.org>
> > To: Alexandre Ghiti <alexghiti@rivosinc.com>
> > Cc: linux-arch@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: Palmer Dabbelt <palmer@rivosinc.com>
> > Cc: linux-riscv@lists.infradead.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: linux-mm@kvack.org
> > Cc: loongarch@lists.linux.dev
> > Cc: linux-mips@vger.kernel.org
> > Cc: linux-parisc@vger.kernel.org
> > Cc: linux-s390@vger.kernel.org
> > Cc: linux-sh@vger.kernel.org
> > Cc: sparclinux@vger.kernel.org
> > Cc: linux-kselftest@vger.kernel.org
> > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
> > 
> > ---
> > Charlie Jenkins (16):
> >       mm: Add MAP_BELOW_HINT
> >       riscv: mm: Do not restrict mmap address based on hint
> >       mm: Add flag and len param to arch_get_mmap_base()
> >       mm: Add generic MAP_BELOW_HINT
> >       riscv: mm: Support MAP_BELOW_HINT
> >       arm64: mm: Support MAP_BELOW_HINT
> >       powerpc: mm: Support MAP_BELOW_HINT
> >       x86: mm: Support MAP_BELOW_HINT
> >       loongarch: mm: Support MAP_BELOW_HINT
> >       arm: mm: Support MAP_BELOW_HINT
> >       mips: mm: Support MAP_BELOW_HINT
> >       parisc: mm: Support MAP_BELOW_HINT
> >       s390: mm: Support MAP_BELOW_HINT
> >       sh: mm: Support MAP_BELOW_HINT
> >       sparc: mm: Support MAP_BELOW_HINT
> >       selftests/mm: Create MAP_BELOW_HINT test
> > 
> >  arch/arm/mm/mmap.c                           | 10 ++++++++
> >  arch/arm64/include/asm/processor.h           | 34 ++++++++++++++++++++++----
> >  arch/loongarch/mm/mmap.c                     | 11 +++++++++
> >  arch/mips/mm/mmap.c                          |  9 +++++++
> >  arch/parisc/include/uapi/asm/mman.h          |  1 +
> >  arch/parisc/kernel/sys_parisc.c              |  9 +++++++
> >  arch/powerpc/include/asm/task_size_64.h      | 36 +++++++++++++++++++++++-----
> >  arch/riscv/include/asm/processor.h           | 32 -------------------------
> >  arch/s390/mm/mmap.c                          | 10 ++++++++
> >  arch/sh/mm/mmap.c                            | 10 ++++++++
> >  arch/sparc/kernel/sys_sparc_64.c             |  8 +++++++
> >  arch/x86/kernel/sys_x86_64.c                 | 25 ++++++++++++++++---
> >  fs/hugetlbfs/inode.c                         |  2 +-
> >  include/linux/sched/mm.h                     | 34 ++++++++++++++++++++++++--
> >  include/uapi/asm-generic/mman-common.h       |  1 +
> >  mm/mmap.c                                    |  2 +-
> >  tools/arch/parisc/include/uapi/asm/mman.h    |  1 +
> >  tools/include/uapi/asm-generic/mman-common.h |  1 +
> >  tools/testing/selftests/mm/Makefile          |  1 +
> >  tools/testing/selftests/mm/map_below_hint.c  | 29 ++++++++++++++++++++++
> >  20 files changed, 216 insertions(+), 50 deletions(-)
> > ---
> > base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
> > change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
> > -- 
> > - Charlie
> >
Charlie Jenkins Aug. 28, 2024, 9:39 p.m. UTC | #6
On Wed, Aug 28, 2024 at 01:59:18PM -0700, Charlie Jenkins wrote:
> On Wed, Aug 28, 2024 at 02:31:42PM -0400, Liam R. Howlett wrote:
> > * Charlie Jenkins <charlie@rivosinc.com> [240828 01:49]:
> > > Some applications rely on placing data in free bits addresses allocated
> > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > > address returned by mmap to be less than the maximum address space,
> > > unless the hint address is greater than this value.
> > 
> > Wait, what arch(s) allows for greater than the max?  The passed hint
> > should be where we start searching, but we go to the lower limit then
> > start at the hint and search up (or vice-versa on the directions).
> > 
> 
> I worded this awkwardly. On arm64 there is a page-table boundary at 48
> bits and at 52 bits. On x86 the boundaries are at 48 bits and 57 bits.
> The max value mmap is able to return on arm64 is 48 bits if the hint
> address uses 48 bits or less, even if the architecture supports 5-level
> paging and thus addresses can be 52 bits. Applications can opt-in to
> using up to 52-bits in an address by using a hint address greater than
> 48 bits. x86 has the same behavior but with 57 bits instead of 52.
> 
> This reason this exists is because some applications arbitrarily replace
> bits in virtual addresses with data with an assumption that the address
> will not be using any of the bits above bit 48 in the virtual address.
> As hardware with larger address spaces was released, x86 decided to
> build safety guards into the kernel to allow the applications that made
> these assumptions to continue to work on this different hardware.
> 
> This causes all application that use a hint address to silently be
> restricted to 48-bit addresses. The goal of this flag is to have a way
> for applications to explicitly request how many bits they want mmap to
> use.
> 
> > I don't understand how unmapping works on a higher address; we would
> > fail to free it on termination of the application.
> > 
> > Also, there are archs that map outside of the VMAs, which are freed by
> > freeing from the prev->vm_end to next->vm_start, so I don't understand
> > what that looks like in this reality as well.
> > 
> > > 
> > > On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This
> > > flag allows applications a way to specify exactly how many bits they
> > > want to be left unused by mmap. This eliminates the need for
> > > applications to know the page table hierarchy of the system to be able
> > > to reason which addresses mmap will be allowed to return.
> > 
> > But, why do they need to know today?  We have a limit for this don't we?
> 
> The limit is different for different architectures. On x86 the limit is
> 57 bits, and on arm64 it is 52 bits. So in the theoretical case that an
> application requires 10 bits free in a virtual address, the application
> would always work on arm64 regardless of the hint address, but on x86 if
> the hint address is greater than 48 bits then the application will not
> work.
> 
> The goal of this flag is to have consistent and tunable behavior of
> mmap() when it is desired to ensure that mmap() only returns addresses
> that use some number of bits.
> 
> > 
> > Also, these upper limits are how some archs use the upper bits that you
> > are trying to use.
> > 
> 
> It does not eliminate the existing behavior of the architectures to
> place this upper limits, it instead provides a way to have consistent
> behavior across all architectures.
> 
> > > 
> > > ---
> > > riscv made this feature of mmap returning addresses less than the hint
> > > address the default behavior. This was in contrast to the implementation
> > > of x86/arm64 that have a single boundary at the 5-level page table
> > > region. However this restriction proved too great -- the reduced
> > > address space when using a hint address was too small.
> > 
> > Yes, the hint is used to group things close together so it would
> > literally be random chance on if you have enough room or not (aslr and
> > all).
> > 
> > > 
> > > A patch for riscv [1] reverts the behavior that broke userspace. This
> > > series serves to make this feature available to all architectures.
> > 
> > I don't fully understand this statement, you say it broke userspace so
> > now you are porting it to everyone?  This reads as if you are braking
> > the userspace on all architectures :)
> 
> It was the default for mmap on riscv. The difference here is that it is now
> enabled by a flag instead. Instead of making the flag specific to riscv,
> I figured that other architectures might find it useful as well.
> 
> > 
> > If you fail to find room below, then your application fails as there is
> > no way to get the upper bits you need.  It would be better to fix this
> > in userspace - if your application is returned too high an address, then
> > free it and exit because it's going to fail anyways.
> > 
> 
> This flag is trying to define an API that is more robust than the
> current behavior on that x86 and arm64 which implicitly restricts mmap()
> addresses to 48 bits. A solution could be to just write in the docs that
> mmap() will always exhaust all addresses below the hint address before
> returning an address that is above the hint address. However a flag that
> defines this behavior seems more intuitive.
> 
> > > 
> > > I have only tested on riscv and x86.
> > 
> > This should be an RFC then.
> 
> Fair enough.
> 
> > 
> > > There is a tremendous amount of
> > > duplicated code in mmap so the implementations across architectures I
> > > believe should be mostly consistent. I added this feature to all
> > > architectures that implement either
> > > arch_get_mmap_end()/arch_get_mmap_base() or
> > > arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added
> > > it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base().
> > 
> > Way too much duplicate code.  We should be figuring out how to make this
> > all work with the same code.
> > 
> > This is going to make the cloned code problem worse.
> 
> That would require standardizing every architecture with the generic
> mmap() framework that arm64 has developed. That is far outside the scope
> of this patch, but would be a great area to research for each of the
> architectures that do not use the generic framework.

Thinking about this again, I could drop support for all architectures
that do not implement arch_get_mmap_base()/arch_get_mmap_end().

> 
> - Charlie
> 
> > 
> > > 
> > > Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.com/T/ [1]
> > > 
> > > To: Arnd Bergmann <arnd@arndb.de>
> > > To: Paul Walmsley <paul.walmsley@sifive.com>
> > > To: Palmer Dabbelt <palmer@dabbelt.com>
> > > To: Albert Ou <aou@eecs.berkeley.edu>
> > > To: Catalin Marinas <catalin.marinas@arm.com>
> > > To: Will Deacon <will@kernel.org>
> > > To: Michael Ellerman <mpe@ellerman.id.au>
> > > To: Nicholas Piggin <npiggin@gmail.com>
> > > To: Christophe Leroy <christophe.leroy@csgroup.eu>
> > > To: Naveen N Rao <naveen@kernel.org>
> > > To: Muchun Song <muchun.song@linux.dev>
> > > To: Andrew Morton <akpm@linux-foundation.org>
> > > To: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > To: Vlastimil Babka <vbabka@suse.cz>
> > > To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > To: Thomas Gleixner <tglx@linutronix.de>
> > > To: Ingo Molnar <mingo@redhat.com>
> > > To: Borislav Petkov <bp@alien8.de>
> > > To: Dave Hansen <dave.hansen@linux.intel.com>
> > > To: x86@kernel.org
> > > To: H. Peter Anvin <hpa@zytor.com>
> > > To: Huacai Chen <chenhuacai@kernel.org>
> > > To: WANG Xuerui <kernel@xen0n.name>
> > > To: Russell King <linux@armlinux.org.uk>
> > > To: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> > > To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
> > > To: Helge Deller <deller@gmx.de>
> > > To: Alexander Gordeev <agordeev@linux.ibm.com>
> > > To: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> > > To: Heiko Carstens <hca@linux.ibm.com>
> > > To: Vasily Gorbik <gor@linux.ibm.com>
> > > To: Christian Borntraeger <borntraeger@linux.ibm.com>
> > > To: Sven Schnelle <svens@linux.ibm.com>
> > > To: Yoshinori Sato <ysato@users.sourceforge.jp>
> > > To: Rich Felker <dalias@libc.org>
> > > To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
> > > To: David S. Miller <davem@davemloft.net>
> > > To: Andreas Larsson <andreas@gaisler.com>
> > > To: Shuah Khan <shuah@kernel.org>
> > > To: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > Cc: linux-arch@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Cc: Palmer Dabbelt <palmer@rivosinc.com>
> > > Cc: linux-riscv@lists.infradead.org
> > > Cc: linux-arm-kernel@lists.infradead.org
> > > Cc: linuxppc-dev@lists.ozlabs.org
> > > Cc: linux-mm@kvack.org
> > > Cc: loongarch@lists.linux.dev
> > > Cc: linux-mips@vger.kernel.org
> > > Cc: linux-parisc@vger.kernel.org
> > > Cc: linux-s390@vger.kernel.org
> > > Cc: linux-sh@vger.kernel.org
> > > Cc: sparclinux@vger.kernel.org
> > > Cc: linux-kselftest@vger.kernel.org
> > > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
> > > 
> > > ---
> > > Charlie Jenkins (16):
> > >       mm: Add MAP_BELOW_HINT
> > >       riscv: mm: Do not restrict mmap address based on hint
> > >       mm: Add flag and len param to arch_get_mmap_base()
> > >       mm: Add generic MAP_BELOW_HINT
> > >       riscv: mm: Support MAP_BELOW_HINT
> > >       arm64: mm: Support MAP_BELOW_HINT
> > >       powerpc: mm: Support MAP_BELOW_HINT
> > >       x86: mm: Support MAP_BELOW_HINT
> > >       loongarch: mm: Support MAP_BELOW_HINT
> > >       arm: mm: Support MAP_BELOW_HINT
> > >       mips: mm: Support MAP_BELOW_HINT
> > >       parisc: mm: Support MAP_BELOW_HINT
> > >       s390: mm: Support MAP_BELOW_HINT
> > >       sh: mm: Support MAP_BELOW_HINT
> > >       sparc: mm: Support MAP_BELOW_HINT
> > >       selftests/mm: Create MAP_BELOW_HINT test
> > > 
> > >  arch/arm/mm/mmap.c                           | 10 ++++++++
> > >  arch/arm64/include/asm/processor.h           | 34 ++++++++++++++++++++++----
> > >  arch/loongarch/mm/mmap.c                     | 11 +++++++++
> > >  arch/mips/mm/mmap.c                          |  9 +++++++
> > >  arch/parisc/include/uapi/asm/mman.h          |  1 +
> > >  arch/parisc/kernel/sys_parisc.c              |  9 +++++++
> > >  arch/powerpc/include/asm/task_size_64.h      | 36 +++++++++++++++++++++++-----
> > >  arch/riscv/include/asm/processor.h           | 32 -------------------------
> > >  arch/s390/mm/mmap.c                          | 10 ++++++++
> > >  arch/sh/mm/mmap.c                            | 10 ++++++++
> > >  arch/sparc/kernel/sys_sparc_64.c             |  8 +++++++
> > >  arch/x86/kernel/sys_x86_64.c                 | 25 ++++++++++++++++---
> > >  fs/hugetlbfs/inode.c                         |  2 +-
> > >  include/linux/sched/mm.h                     | 34 ++++++++++++++++++++++++--
> > >  include/uapi/asm-generic/mman-common.h       |  1 +
> > >  mm/mmap.c                                    |  2 +-
> > >  tools/arch/parisc/include/uapi/asm/mman.h    |  1 +
> > >  tools/include/uapi/asm-generic/mman-common.h |  1 +
> > >  tools/testing/selftests/mm/Makefile          |  1 +
> > >  tools/testing/selftests/mm/map_below_hint.c  | 29 ++++++++++++++++++++++
> > >  20 files changed, 216 insertions(+), 50 deletions(-)
> > > ---
> > > base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
> > > change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
> > > -- 
> > > - Charlie
> > >
Charlie Jenkins Aug. 29, 2024, 7:14 a.m. UTC | #7
On Wed, Aug 28, 2024 at 07:19:36PM +0100, Lorenzo Stoakes wrote:
> On Tue, Aug 27, 2024 at 10:49:06PM GMT, Charlie Jenkins wrote:
> > Some applications rely on placing data in free bits addresses allocated
> > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > address returned by mmap to be less than the maximum address space,
> > unless the hint address is greater than this value.
> >
> > On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This
> > flag allows applications a way to specify exactly how many bits they
> > want to be left unused by mmap. This eliminates the need for
> > applications to know the page table hierarchy of the system to be able
> > to reason which addresses mmap will be allowed to return.
> >
> > ---
> > riscv made this feature of mmap returning addresses less than the hint
> > address the default behavior. This was in contrast to the implementation
> > of x86/arm64 that have a single boundary at the 5-level page table
> > region. However this restriction proved too great -- the reduced
> > address space when using a hint address was too small.
> >
> > A patch for riscv [1] reverts the behavior that broke userspace. This
> > series serves to make this feature available to all architectures.
> 
> I'm a little confused as to the justification for this - you broke RISC V by
> doing this, and have now reverted it, but now offer the same behaviour that
> broke RISC V to all other architectures?
> 
> I mean this is how this reads, so I might be being ungenerous here :) but would
> be good to clarify what the value-add is here.

Yeah I did not do a good job of explaining this! Having this be the
default behavior was broken, not that this feature in general was
broken.

> 
> I also wonder at use of a new MAP_ flag, they're a limited resource and we
> should only ever add them if we _really_ need to. This seems a bit niche and
> specific to be making such a big change for including touching a bunch of pretty
> sensitive arch-specific code.
>
> We have the ability to change how mmap() functions through 'personalities'
> though of course this would impact every mmap() call in the process.
> 
> Overall I'm really not hugely convinced by this, it feels like userland
> could find better ways of doing this (mostly you'd do a PROT_NONE mmap() to
> reserve a domain and mprotect() it on allocation or mmap() over it).
> 
> So I just struggle to see the purpose myself. BUT absolutely I may be
> missing context/others may have a view on the value of this. So happy to
> stand corrected.
> 
> >
> > I have only tested on riscv and x86. There is a tremendous amount of
> 
> Yeah, OK this is crazy, you can't really submit something as non-RFC that
> touches every single arch and not test it.
> 
> I also feel like we need more justification than 'this is a neat thing that
> we use in RISC V sometimes' conceptually for such a big change.

I will send out a new version that does a much better job at explaining!
This is not something that is done on riscv ever currently. This is
something that is done on other architectures such as x86 and arm64.
This flag is to make similar behavior (the ability to force mmap to
return addresses that have a constrained address space) available to all
architectures.

> 
> Also your test program is currently completely broken afaict (have
> commented on it directly). I also feel like your test program is a little
> rudimentary, and should test some edge cases close to the limit etc.
> 
> So I think this is a NACK until there is testing across the board and a little
> more justification.
> 
> Feel free to respin, but I think any future revisions should be RFC until
> we're absolutely sure on testing/justification.
> 
> I appreciate your efforts here so sorry to be negative, but just obviously
> want to make sure this is functional and trades off added complexity for
> value for the kernel and userland :)
> 

Totally understand thank you! After reviewing comments I have realized
that I made this much more complicated than it needs to be. This should
be able to be done without changing any architecture specific code. That
will mostly eliminate all of the complexity, but still has the downside
of consuming a MAP_ flag.

- Charlie

> Thanks!
> 
> > duplicated code in mmap so the implementations across architectures I
> > believe should be mostly consistent. I added this feature to all
> > architectures that implement either
> > arch_get_mmap_end()/arch_get_mmap_base() or
> > arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added
> > it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base().
> >
> > Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.com/T/ [1]
> >
> > To: Arnd Bergmann <arnd@arndb.de>
> > To: Paul Walmsley <paul.walmsley@sifive.com>
> > To: Palmer Dabbelt <palmer@dabbelt.com>
> > To: Albert Ou <aou@eecs.berkeley.edu>
> > To: Catalin Marinas <catalin.marinas@arm.com>
> > To: Will Deacon <will@kernel.org>
> > To: Michael Ellerman <mpe@ellerman.id.au>
> > To: Nicholas Piggin <npiggin@gmail.com>
> > To: Christophe Leroy <christophe.leroy@csgroup.eu>
> > To: Naveen N Rao <naveen@kernel.org>
> > To: Muchun Song <muchun.song@linux.dev>
> > To: Andrew Morton <akpm@linux-foundation.org>
> > To: Liam R. Howlett <Liam.Howlett@oracle.com>
> > To: Vlastimil Babka <vbabka@suse.cz>
> > To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > To: Thomas Gleixner <tglx@linutronix.de>
> > To: Ingo Molnar <mingo@redhat.com>
> > To: Borislav Petkov <bp@alien8.de>
> > To: Dave Hansen <dave.hansen@linux.intel.com>
> > To: x86@kernel.org
> > To: H. Peter Anvin <hpa@zytor.com>
> > To: Huacai Chen <chenhuacai@kernel.org>
> > To: WANG Xuerui <kernel@xen0n.name>
> > To: Russell King <linux@armlinux.org.uk>
> > To: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> > To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
> > To: Helge Deller <deller@gmx.de>
> > To: Alexander Gordeev <agordeev@linux.ibm.com>
> > To: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> > To: Heiko Carstens <hca@linux.ibm.com>
> > To: Vasily Gorbik <gor@linux.ibm.com>
> > To: Christian Borntraeger <borntraeger@linux.ibm.com>
> > To: Sven Schnelle <svens@linux.ibm.com>
> > To: Yoshinori Sato <ysato@users.sourceforge.jp>
> > To: Rich Felker <dalias@libc.org>
> > To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
> > To: David S. Miller <davem@davemloft.net>
> > To: Andreas Larsson <andreas@gaisler.com>
> > To: Shuah Khan <shuah@kernel.org>
> > To: Alexandre Ghiti <alexghiti@rivosinc.com>
> > Cc: linux-arch@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: Palmer Dabbelt <palmer@rivosinc.com>
> > Cc: linux-riscv@lists.infradead.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: linux-mm@kvack.org
> > Cc: loongarch@lists.linux.dev
> > Cc: linux-mips@vger.kernel.org
> > Cc: linux-parisc@vger.kernel.org
> > Cc: linux-s390@vger.kernel.org
> > Cc: linux-sh@vger.kernel.org
> > Cc: sparclinux@vger.kernel.org
> > Cc: linux-kselftest@vger.kernel.org
> > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
> >
> > ---
> > Charlie Jenkins (16):
> >       mm: Add MAP_BELOW_HINT
> >       riscv: mm: Do not restrict mmap address based on hint
> >       mm: Add flag and len param to arch_get_mmap_base()
> >       mm: Add generic MAP_BELOW_HINT
> >       riscv: mm: Support MAP_BELOW_HINT
> >       arm64: mm: Support MAP_BELOW_HINT
> >       powerpc: mm: Support MAP_BELOW_HINT
> >       x86: mm: Support MAP_BELOW_HINT
> >       loongarch: mm: Support MAP_BELOW_HINT
> >       arm: mm: Support MAP_BELOW_HINT
> >       mips: mm: Support MAP_BELOW_HINT
> >       parisc: mm: Support MAP_BELOW_HINT
> >       s390: mm: Support MAP_BELOW_HINT
> >       sh: mm: Support MAP_BELOW_HINT
> >       sparc: mm: Support MAP_BELOW_HINT
> >       selftests/mm: Create MAP_BELOW_HINT test
> >
> >  arch/arm/mm/mmap.c                           | 10 ++++++++
> >  arch/arm64/include/asm/processor.h           | 34 ++++++++++++++++++++++----
> >  arch/loongarch/mm/mmap.c                     | 11 +++++++++
> >  arch/mips/mm/mmap.c                          |  9 +++++++
> >  arch/parisc/include/uapi/asm/mman.h          |  1 +
> >  arch/parisc/kernel/sys_parisc.c              |  9 +++++++
> >  arch/powerpc/include/asm/task_size_64.h      | 36 +++++++++++++++++++++++-----
> >  arch/riscv/include/asm/processor.h           | 32 -------------------------
> >  arch/s390/mm/mmap.c                          | 10 ++++++++
> >  arch/sh/mm/mmap.c                            | 10 ++++++++
> >  arch/sparc/kernel/sys_sparc_64.c             |  8 +++++++
> >  arch/x86/kernel/sys_x86_64.c                 | 25 ++++++++++++++++---
> >  fs/hugetlbfs/inode.c                         |  2 +-
> >  include/linux/sched/mm.h                     | 34 ++++++++++++++++++++++++--
> >  include/uapi/asm-generic/mman-common.h       |  1 +
> >  mm/mmap.c                                    |  2 +-
> >  tools/arch/parisc/include/uapi/asm/mman.h    |  1 +
> >  tools/include/uapi/asm-generic/mman-common.h |  1 +
> >  tools/testing/selftests/mm/Makefile          |  1 +
> >  tools/testing/selftests/mm/map_below_hint.c  | 29 ++++++++++++++++++++++
> >  20 files changed, 216 insertions(+), 50 deletions(-)
> > ---
> > base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
> > change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
> > --
> > - Charlie
> >
> >
> > _______________________________________________
> > linux-riscv mailing list
> > linux-riscv@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-riscv
Dave Hansen Aug. 29, 2024, 4:54 p.m. UTC | #8
On 8/28/24 13:15, Charlie Jenkins wrote:
> A way to restrict mmap() to return LAM compliant addresses in an entire
> address space also doesn't have to be mutually exclusive with this flag.
> This flag allows for the greatest degree of control from applications.
> I don't believe there is additionally performance saving that could be
> achieved by having this be on a per address space basis.

I agree with you in general.  The MAP_BELOW_HINT _is_ the most flexible.
 But it's also rather complicated.

My _hope_ would be that a per-address-space property could share at
least some infrastructure with what x86/LAM and arm/TBI do to the
address space.  Basically put the restrictions in place for purely
software reasons instead of the mostly hardware reasons for LAM/TBI.

Lorenzo also raised some very valid points about a having a generic
address-restriction ABI.  I'm certainly not discounting those concerns.
It's not something that can be done lightly.
Liam R. Howlett Aug. 29, 2024, 7:36 p.m. UTC | #9
* Dave Hansen <dave.hansen@intel.com> [240829 12:54]:
> On 8/28/24 13:15, Charlie Jenkins wrote:
> > A way to restrict mmap() to return LAM compliant addresses in an entire
> > address space also doesn't have to be mutually exclusive with this flag.
> > This flag allows for the greatest degree of control from applications.
> > I don't believe there is additionally performance saving that could be
> > achieved by having this be on a per address space basis.
> 
> I agree with you in general.  The MAP_BELOW_HINT _is_ the most flexible.
>  But it's also rather complicated.

There is a (seldom used?) feature of mmap_min_addr, it seems like we
could have an mmap_max_addr.  Would something like that work for your
use case?  Perhaps it would be less intrusive to do something in this
way?  I haven't looked at it in depth and this affects all address
spaces as well (new allocations only).

There is a note on mmap_min_addr about applications that require the
lower addresses, would this mean we'll now have a note about upper
limits?

I really don't understand why you need this at all, to be honest.  If
you know the upper limit you could just MAP_FIXED map a huge guard at
the top of your address space then do whatever you want with those bits.

This will create an entry in the vma tree that no one else will be able
to use, and you can do this in any process you want, for as many bits as
you want.

> 
> My _hope_ would be that a per-address-space property could share at
> least some infrastructure with what x86/LAM and arm/TBI do to the
> address space.  Basically put the restrictions in place for purely
> software reasons instead of the mostly hardware reasons for LAM/TBI.
> 
> Lorenzo also raised some very valid points about a having a generic
> address-restriction ABI.  I'm certainly not discounting those concerns.
> It's not something that can be done lightly.

Yes, I am concerned about supporting this (probably forever) and dancing
around special code that may cause issues, perhaps on an arch that few
have for testing.  I already have so many qemu images for testing, some
of which no longer have valid install media - and basically none of them
use the same code in this area (or have special cases already).  I think
you understand what we are dealing with considering your comments in
your cover letter.

Thanks,
Liam
Charlie Jenkins Aug. 30, 2024, 1 a.m. UTC | #10
On Thu, Aug 29, 2024 at 09:54:08AM -0700, Dave Hansen wrote:
> On 8/28/24 13:15, Charlie Jenkins wrote:
> > A way to restrict mmap() to return LAM compliant addresses in an entire
> > address space also doesn't have to be mutually exclusive with this flag.
> > This flag allows for the greatest degree of control from applications.
> > I don't believe there is additionally performance saving that could be
> > achieved by having this be on a per address space basis.
> 
> I agree with you in general.  The MAP_BELOW_HINT _is_ the most flexible.
>  But it's also rather complicated.

Can you expand upon what you mean by it being complicated? Complicated
for the kernel or complicated for a user?

> 
> My _hope_ would be that a per-address-space property could share at
> least some infrastructure with what x86/LAM and arm/TBI do to the
> address space.  Basically put the restrictions in place for purely
> software reasons instead of the mostly hardware reasons for LAM/TBI.

That is a good point, perhaps that would be a way to hook this into LAM,
TBI, and any other architecture's specific address masking feature.

- Charlie

> 
> Lorenzo also raised some very valid points about a having a generic
> address-restriction ABI.  I'm certainly not discounting those concerns.
> It's not something that can be done lightly.
Charlie Jenkins Aug. 30, 2024, 1:10 a.m. UTC | #11
On Thu, Aug 29, 2024 at 03:36:43PM -0400, Liam R. Howlett wrote:
> * Dave Hansen <dave.hansen@intel.com> [240829 12:54]:
> > On 8/28/24 13:15, Charlie Jenkins wrote:
> > > A way to restrict mmap() to return LAM compliant addresses in an entire
> > > address space also doesn't have to be mutually exclusive with this flag.
> > > This flag allows for the greatest degree of control from applications.
> > > I don't believe there is additionally performance saving that could be
> > > achieved by having this be on a per address space basis.
> > 
> > I agree with you in general.  The MAP_BELOW_HINT _is_ the most flexible.
> >  But it's also rather complicated.
> 
> There is a (seldom used?) feature of mmap_min_addr, it seems like we
> could have an mmap_max_addr.  Would something like that work for your
> use case?  Perhaps it would be less intrusive to do something in this
> way?  I haven't looked at it in depth and this affects all address
> spaces as well (new allocations only).
> 
> There is a note on mmap_min_addr about applications that require the
> lower addresses, would this mean we'll now have a note about upper
> limits?

I don't think that's a viable solution because that would change the
mmap behavior for all applications running on the system, and wouldn't
allow individual applications to have different configurations.

> 
> I really don't understand why you need this at all, to be honest.  If
> you know the upper limit you could just MAP_FIXED map a huge guard at
> the top of your address space then do whatever you want with those bits.
> 
> This will create an entry in the vma tree that no one else will be able
> to use, and you can do this in any process you want, for as many bits as
> you want.

Oh that's an interesting idea. I am not sure how that could work in
practice though. The application would need to know it allocated all of
the addresses in the upper address space, how would it be able to do
that?

> 
> > 
> > My _hope_ would be that a per-address-space property could share at
> > least some infrastructure with what x86/LAM and arm/TBI do to the
> > address space.  Basically put the restrictions in place for purely
> > software reasons instead of the mostly hardware reasons for LAM/TBI.
> > 
> > Lorenzo also raised some very valid points about a having a generic
> > address-restriction ABI.  I'm certainly not discounting those concerns.
> > It's not something that can be done lightly.
> 
> Yes, I am concerned about supporting this (probably forever) and dancing
> around special code that may cause issues, perhaps on an arch that few
> have for testing.  I already have so many qemu images for testing, some
> of which no longer have valid install media - and basically none of them
> use the same code in this area (or have special cases already).  I think
> you understand what we are dealing with considering your comments in
> your cover letter.

It is definitely not something to be taken lightly. The version 2 of
this is completely generic so that should eliminate most of the concern
of "special code" on various architectures. Unless I am misunderstanding
something.

- Charlie

> 
> Thanks,
> Liam