Message ID | 20240827-patches-below_hint_mmap-v1-0-46ff2eb9022d@rivosinc.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: Introduce MAP_BELOW_HINT | expand |
On Tue, Aug 27, 2024 at 10:49:06PM GMT, Charlie Jenkins wrote: > Some applications rely on placing data in free bits addresses allocated > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > address returned by mmap to be less than the maximum address space, > unless the hint address is greater than this value. > > On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This > flag allows applications a way to specify exactly how many bits they > want to be left unused by mmap. This eliminates the need for > applications to know the page table hierarchy of the system to be able > to reason which addresses mmap will be allowed to return. > > --- > riscv made this feature of mmap returning addresses less than the hint > address the default behavior. This was in contrast to the implementation > of x86/arm64 that have a single boundary at the 5-level page table > region. However this restriction proved too great -- the reduced > address space when using a hint address was too small. > > A patch for riscv [1] reverts the behavior that broke userspace. This > series serves to make this feature available to all architectures. I'm a little confused as to the justification for this - you broke RISC V by doing this, and have now reverted it, but now offer the same behaviour that broke RISC V to all other architectures? I mean this is how this reads, so I might be being ungenerous here :) but would be good to clarify what the value-add is here. I also wonder at use of a new MAP_ flag, they're a limited resource and we should only ever add them if we _really_ need to. This seems a bit niche and specific to be making such a big change for including touching a bunch of pretty sensitive arch-specific code. We have the ability to change how mmap() functions through 'personalities' though of course this would impact every mmap() call in the process. Overall I'm really not hugely convinced by this, it feels like userland could find better ways of doing this (mostly you'd do a PROT_NONE mmap() to reserve a domain and mprotect() it on allocation or mmap() over it). So I just struggle to see the purpose myself. BUT absolutely I may be missing context/others may have a view on the value of this. So happy to stand corrected. > > I have only tested on riscv and x86. There is a tremendous amount of Yeah, OK this is crazy, you can't really submit something as non-RFC that touches every single arch and not test it. I also feel like we need more justification than 'this is a neat thing that we use in RISC V sometimes' conceptually for such a big change. Also your test program is currently completely broken afaict (have commented on it directly). I also feel like your test program is a little rudimentary, and should test some edge cases close to the limit etc. So I think this is a NACK until there is testing across the board and a little more justification. Feel free to respin, but I think any future revisions should be RFC until we're absolutely sure on testing/justification. I appreciate your efforts here so sorry to be negative, but just obviously want to make sure this is functional and trades off added complexity for value for the kernel and userland :) Thanks! > duplicated code in mmap so the implementations across architectures I > believe should be mostly consistent. I added this feature to all > architectures that implement either > arch_get_mmap_end()/arch_get_mmap_base() or > arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added > it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base(). > > Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.com/T/ [1] > > To: Arnd Bergmann <arnd@arndb.de> > To: Paul Walmsley <paul.walmsley@sifive.com> > To: Palmer Dabbelt <palmer@dabbelt.com> > To: Albert Ou <aou@eecs.berkeley.edu> > To: Catalin Marinas <catalin.marinas@arm.com> > To: Will Deacon <will@kernel.org> > To: Michael Ellerman <mpe@ellerman.id.au> > To: Nicholas Piggin <npiggin@gmail.com> > To: Christophe Leroy <christophe.leroy@csgroup.eu> > To: Naveen N Rao <naveen@kernel.org> > To: Muchun Song <muchun.song@linux.dev> > To: Andrew Morton <akpm@linux-foundation.org> > To: Liam R. Howlett <Liam.Howlett@oracle.com> > To: Vlastimil Babka <vbabka@suse.cz> > To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > To: Thomas Gleixner <tglx@linutronix.de> > To: Ingo Molnar <mingo@redhat.com> > To: Borislav Petkov <bp@alien8.de> > To: Dave Hansen <dave.hansen@linux.intel.com> > To: x86@kernel.org > To: H. Peter Anvin <hpa@zytor.com> > To: Huacai Chen <chenhuacai@kernel.org> > To: WANG Xuerui <kernel@xen0n.name> > To: Russell King <linux@armlinux.org.uk> > To: Thomas Bogendoerfer <tsbogend@alpha.franken.de> > To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> > To: Helge Deller <deller@gmx.de> > To: Alexander Gordeev <agordeev@linux.ibm.com> > To: Gerald Schaefer <gerald.schaefer@linux.ibm.com> > To: Heiko Carstens <hca@linux.ibm.com> > To: Vasily Gorbik <gor@linux.ibm.com> > To: Christian Borntraeger <borntraeger@linux.ibm.com> > To: Sven Schnelle <svens@linux.ibm.com> > To: Yoshinori Sato <ysato@users.sourceforge.jp> > To: Rich Felker <dalias@libc.org> > To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> > To: David S. Miller <davem@davemloft.net> > To: Andreas Larsson <andreas@gaisler.com> > To: Shuah Khan <shuah@kernel.org> > To: Alexandre Ghiti <alexghiti@rivosinc.com> > Cc: linux-arch@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: Palmer Dabbelt <palmer@rivosinc.com> > Cc: linux-riscv@lists.infradead.org > Cc: linux-arm-kernel@lists.infradead.org > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-mm@kvack.org > Cc: loongarch@lists.linux.dev > Cc: linux-mips@vger.kernel.org > Cc: linux-parisc@vger.kernel.org > Cc: linux-s390@vger.kernel.org > Cc: linux-sh@vger.kernel.org > Cc: sparclinux@vger.kernel.org > Cc: linux-kselftest@vger.kernel.org > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> > > --- > Charlie Jenkins (16): > mm: Add MAP_BELOW_HINT > riscv: mm: Do not restrict mmap address based on hint > mm: Add flag and len param to arch_get_mmap_base() > mm: Add generic MAP_BELOW_HINT > riscv: mm: Support MAP_BELOW_HINT > arm64: mm: Support MAP_BELOW_HINT > powerpc: mm: Support MAP_BELOW_HINT > x86: mm: Support MAP_BELOW_HINT > loongarch: mm: Support MAP_BELOW_HINT > arm: mm: Support MAP_BELOW_HINT > mips: mm: Support MAP_BELOW_HINT > parisc: mm: Support MAP_BELOW_HINT > s390: mm: Support MAP_BELOW_HINT > sh: mm: Support MAP_BELOW_HINT > sparc: mm: Support MAP_BELOW_HINT > selftests/mm: Create MAP_BELOW_HINT test > > arch/arm/mm/mmap.c | 10 ++++++++ > arch/arm64/include/asm/processor.h | 34 ++++++++++++++++++++++---- > arch/loongarch/mm/mmap.c | 11 +++++++++ > arch/mips/mm/mmap.c | 9 +++++++ > arch/parisc/include/uapi/asm/mman.h | 1 + > arch/parisc/kernel/sys_parisc.c | 9 +++++++ > arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++----- > arch/riscv/include/asm/processor.h | 32 ------------------------- > arch/s390/mm/mmap.c | 10 ++++++++ > arch/sh/mm/mmap.c | 10 ++++++++ > arch/sparc/kernel/sys_sparc_64.c | 8 +++++++ > arch/x86/kernel/sys_x86_64.c | 25 ++++++++++++++++--- > fs/hugetlbfs/inode.c | 2 +- > include/linux/sched/mm.h | 34 ++++++++++++++++++++++++-- > include/uapi/asm-generic/mman-common.h | 1 + > mm/mmap.c | 2 +- > tools/arch/parisc/include/uapi/asm/mman.h | 1 + > tools/include/uapi/asm-generic/mman-common.h | 1 + > tools/testing/selftests/mm/Makefile | 1 + > tools/testing/selftests/mm/map_below_hint.c | 29 ++++++++++++++++++++++ > 20 files changed, 216 insertions(+), 50 deletions(-) > --- > base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 > change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 > -- > - Charlie > > > _______________________________________________ > linux-riscv mailing list > linux-riscv@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv
On 8/27/24 22:49, Charlie Jenkins wrote: > Some applications rely on placing data in free bits addresses allocated > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > address returned by mmap to be less than the maximum address space, > unless the hint address is greater than this value. Which applications are these, btw? Is this the same crowd as the folks who are using the address tagging features like X86_FEATURE_LAM? Even if they are different, I also wonder if a per-mmap() thing MAP_BELOW_HINT is really what we want. Or should the applications you're trying to service here use a similar mechanism to how LAM affects the *whole* address space as opposed to an individual mmap().
* Charlie Jenkins <charlie@rivosinc.com> [240828 01:49]: > Some applications rely on placing data in free bits addresses allocated > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > address returned by mmap to be less than the maximum address space, > unless the hint address is greater than this value. Wait, what arch(s) allows for greater than the max? The passed hint should be where we start searching, but we go to the lower limit then start at the hint and search up (or vice-versa on the directions). I don't understand how unmapping works on a higher address; we would fail to free it on termination of the application. Also, there are archs that map outside of the VMAs, which are freed by freeing from the prev->vm_end to next->vm_start, so I don't understand what that looks like in this reality as well. > > On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This > flag allows applications a way to specify exactly how many bits they > want to be left unused by mmap. This eliminates the need for > applications to know the page table hierarchy of the system to be able > to reason which addresses mmap will be allowed to return. But, why do they need to know today? We have a limit for this don't we? Also, these upper limits are how some archs use the upper bits that you are trying to use. > > --- > riscv made this feature of mmap returning addresses less than the hint > address the default behavior. This was in contrast to the implementation > of x86/arm64 that have a single boundary at the 5-level page table > region. However this restriction proved too great -- the reduced > address space when using a hint address was too small. Yes, the hint is used to group things close together so it would literally be random chance on if you have enough room or not (aslr and all). > > A patch for riscv [1] reverts the behavior that broke userspace. This > series serves to make this feature available to all architectures. I don't fully understand this statement, you say it broke userspace so now you are porting it to everyone? This reads as if you are braking the userspace on all architectures :) If you fail to find room below, then your application fails as there is no way to get the upper bits you need. It would be better to fix this in userspace - if your application is returned too high an address, then free it and exit because it's going to fail anyways. > > I have only tested on riscv and x86. This should be an RFC then. > There is a tremendous amount of > duplicated code in mmap so the implementations across architectures I > believe should be mostly consistent. I added this feature to all > architectures that implement either > arch_get_mmap_end()/arch_get_mmap_base() or > arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added > it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base(). Way too much duplicate code. We should be figuring out how to make this all work with the same code. This is going to make the cloned code problem worse. > > Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.com/T/ [1] > > To: Arnd Bergmann <arnd@arndb.de> > To: Paul Walmsley <paul.walmsley@sifive.com> > To: Palmer Dabbelt <palmer@dabbelt.com> > To: Albert Ou <aou@eecs.berkeley.edu> > To: Catalin Marinas <catalin.marinas@arm.com> > To: Will Deacon <will@kernel.org> > To: Michael Ellerman <mpe@ellerman.id.au> > To: Nicholas Piggin <npiggin@gmail.com> > To: Christophe Leroy <christophe.leroy@csgroup.eu> > To: Naveen N Rao <naveen@kernel.org> > To: Muchun Song <muchun.song@linux.dev> > To: Andrew Morton <akpm@linux-foundation.org> > To: Liam R. Howlett <Liam.Howlett@oracle.com> > To: Vlastimil Babka <vbabka@suse.cz> > To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > To: Thomas Gleixner <tglx@linutronix.de> > To: Ingo Molnar <mingo@redhat.com> > To: Borislav Petkov <bp@alien8.de> > To: Dave Hansen <dave.hansen@linux.intel.com> > To: x86@kernel.org > To: H. Peter Anvin <hpa@zytor.com> > To: Huacai Chen <chenhuacai@kernel.org> > To: WANG Xuerui <kernel@xen0n.name> > To: Russell King <linux@armlinux.org.uk> > To: Thomas Bogendoerfer <tsbogend@alpha.franken.de> > To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> > To: Helge Deller <deller@gmx.de> > To: Alexander Gordeev <agordeev@linux.ibm.com> > To: Gerald Schaefer <gerald.schaefer@linux.ibm.com> > To: Heiko Carstens <hca@linux.ibm.com> > To: Vasily Gorbik <gor@linux.ibm.com> > To: Christian Borntraeger <borntraeger@linux.ibm.com> > To: Sven Schnelle <svens@linux.ibm.com> > To: Yoshinori Sato <ysato@users.sourceforge.jp> > To: Rich Felker <dalias@libc.org> > To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> > To: David S. Miller <davem@davemloft.net> > To: Andreas Larsson <andreas@gaisler.com> > To: Shuah Khan <shuah@kernel.org> > To: Alexandre Ghiti <alexghiti@rivosinc.com> > Cc: linux-arch@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: Palmer Dabbelt <palmer@rivosinc.com> > Cc: linux-riscv@lists.infradead.org > Cc: linux-arm-kernel@lists.infradead.org > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-mm@kvack.org > Cc: loongarch@lists.linux.dev > Cc: linux-mips@vger.kernel.org > Cc: linux-parisc@vger.kernel.org > Cc: linux-s390@vger.kernel.org > Cc: linux-sh@vger.kernel.org > Cc: sparclinux@vger.kernel.org > Cc: linux-kselftest@vger.kernel.org > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> > > --- > Charlie Jenkins (16): > mm: Add MAP_BELOW_HINT > riscv: mm: Do not restrict mmap address based on hint > mm: Add flag and len param to arch_get_mmap_base() > mm: Add generic MAP_BELOW_HINT > riscv: mm: Support MAP_BELOW_HINT > arm64: mm: Support MAP_BELOW_HINT > powerpc: mm: Support MAP_BELOW_HINT > x86: mm: Support MAP_BELOW_HINT > loongarch: mm: Support MAP_BELOW_HINT > arm: mm: Support MAP_BELOW_HINT > mips: mm: Support MAP_BELOW_HINT > parisc: mm: Support MAP_BELOW_HINT > s390: mm: Support MAP_BELOW_HINT > sh: mm: Support MAP_BELOW_HINT > sparc: mm: Support MAP_BELOW_HINT > selftests/mm: Create MAP_BELOW_HINT test > > arch/arm/mm/mmap.c | 10 ++++++++ > arch/arm64/include/asm/processor.h | 34 ++++++++++++++++++++++---- > arch/loongarch/mm/mmap.c | 11 +++++++++ > arch/mips/mm/mmap.c | 9 +++++++ > arch/parisc/include/uapi/asm/mman.h | 1 + > arch/parisc/kernel/sys_parisc.c | 9 +++++++ > arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++----- > arch/riscv/include/asm/processor.h | 32 ------------------------- > arch/s390/mm/mmap.c | 10 ++++++++ > arch/sh/mm/mmap.c | 10 ++++++++ > arch/sparc/kernel/sys_sparc_64.c | 8 +++++++ > arch/x86/kernel/sys_x86_64.c | 25 ++++++++++++++++--- > fs/hugetlbfs/inode.c | 2 +- > include/linux/sched/mm.h | 34 ++++++++++++++++++++++++-- > include/uapi/asm-generic/mman-common.h | 1 + > mm/mmap.c | 2 +- > tools/arch/parisc/include/uapi/asm/mman.h | 1 + > tools/include/uapi/asm-generic/mman-common.h | 1 + > tools/testing/selftests/mm/Makefile | 1 + > tools/testing/selftests/mm/map_below_hint.c | 29 ++++++++++++++++++++++ > 20 files changed, 216 insertions(+), 50 deletions(-) > --- > base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 > change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 > -- > - Charlie >
On Wed, Aug 28, 2024 at 11:29:56AM -0700, Dave Hansen wrote: > On 8/27/24 22:49, Charlie Jenkins wrote: > > Some applications rely on placing data in free bits addresses allocated > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > address returned by mmap to be less than the maximum address space, > > unless the hint address is greater than this value. > > Which applications are these, btw? Java and Go require this feature. These applications store flags that represent the type of data a pointer holds in the upper bits of the pointer itself. > > Is this the same crowd as the folks who are using the address tagging > features like X86_FEATURE_LAM? Yes it is. LAM helps to mask the bits out on x86, and this feature could be used to ensure that mmap() doesn't return an address with bits that would be masked out. I chose not to tie this feature to x86 LAM which only has masking boundaries at 57 and 48 bits to allow it to be independent of architecture specific address masking. > > Even if they are different, I also wonder if a per-mmap() thing > MAP_BELOW_HINT is really what we want. Or should the applications > you're trying to service here use a similar mechanism to how LAM affects > the *whole* address space as opposed to an individual mmap(). LAM is required to be enabled for entire address spaces because the hardware needs to be configured to mask out the bits. It is not possible to influence the granularity of LAM in the current implementation. However mmap() does not require any of this hardware configuration so it is possible to have finer granularity. A way to restrict mmap() to return LAM compliant addresses in an entire address space also doesn't have to be mutually exclusive with this flag. This flag allows for the greatest degree of control from applications. I don't believe there is additionally performance saving that could be achieved by having this be on a per address space basis. Link: https://cdrdv2.intel.com/v1/dl/getContent/671368 [1] - Charlie
On Wed, Aug 28, 2024 at 02:31:42PM -0400, Liam R. Howlett wrote: > * Charlie Jenkins <charlie@rivosinc.com> [240828 01:49]: > > Some applications rely on placing data in free bits addresses allocated > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > address returned by mmap to be less than the maximum address space, > > unless the hint address is greater than this value. > > Wait, what arch(s) allows for greater than the max? The passed hint > should be where we start searching, but we go to the lower limit then > start at the hint and search up (or vice-versa on the directions). > I worded this awkwardly. On arm64 there is a page-table boundary at 48 bits and at 52 bits. On x86 the boundaries are at 48 bits and 57 bits. The max value mmap is able to return on arm64 is 48 bits if the hint address uses 48 bits or less, even if the architecture supports 5-level paging and thus addresses can be 52 bits. Applications can opt-in to using up to 52-bits in an address by using a hint address greater than 48 bits. x86 has the same behavior but with 57 bits instead of 52. This reason this exists is because some applications arbitrarily replace bits in virtual addresses with data with an assumption that the address will not be using any of the bits above bit 48 in the virtual address. As hardware with larger address spaces was released, x86 decided to build safety guards into the kernel to allow the applications that made these assumptions to continue to work on this different hardware. This causes all application that use a hint address to silently be restricted to 48-bit addresses. The goal of this flag is to have a way for applications to explicitly request how many bits they want mmap to use. > I don't understand how unmapping works on a higher address; we would > fail to free it on termination of the application. > > Also, there are archs that map outside of the VMAs, which are freed by > freeing from the prev->vm_end to next->vm_start, so I don't understand > what that looks like in this reality as well. > > > > > On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This > > flag allows applications a way to specify exactly how many bits they > > want to be left unused by mmap. This eliminates the need for > > applications to know the page table hierarchy of the system to be able > > to reason which addresses mmap will be allowed to return. > > But, why do they need to know today? We have a limit for this don't we? The limit is different for different architectures. On x86 the limit is 57 bits, and on arm64 it is 52 bits. So in the theoretical case that an application requires 10 bits free in a virtual address, the application would always work on arm64 regardless of the hint address, but on x86 if the hint address is greater than 48 bits then the application will not work. The goal of this flag is to have consistent and tunable behavior of mmap() when it is desired to ensure that mmap() only returns addresses that use some number of bits. > > Also, these upper limits are how some archs use the upper bits that you > are trying to use. > It does not eliminate the existing behavior of the architectures to place this upper limits, it instead provides a way to have consistent behavior across all architectures. > > > > --- > > riscv made this feature of mmap returning addresses less than the hint > > address the default behavior. This was in contrast to the implementation > > of x86/arm64 that have a single boundary at the 5-level page table > > region. However this restriction proved too great -- the reduced > > address space when using a hint address was too small. > > Yes, the hint is used to group things close together so it would > literally be random chance on if you have enough room or not (aslr and > all). > > > > > A patch for riscv [1] reverts the behavior that broke userspace. This > > series serves to make this feature available to all architectures. > > I don't fully understand this statement, you say it broke userspace so > now you are porting it to everyone? This reads as if you are braking > the userspace on all architectures :) It was the default for mmap on riscv. The difference here is that it is now enabled by a flag instead. Instead of making the flag specific to riscv, I figured that other architectures might find it useful as well. > > If you fail to find room below, then your application fails as there is > no way to get the upper bits you need. It would be better to fix this > in userspace - if your application is returned too high an address, then > free it and exit because it's going to fail anyways. > This flag is trying to define an API that is more robust than the current behavior on that x86 and arm64 which implicitly restricts mmap() addresses to 48 bits. A solution could be to just write in the docs that mmap() will always exhaust all addresses below the hint address before returning an address that is above the hint address. However a flag that defines this behavior seems more intuitive. > > > > I have only tested on riscv and x86. > > This should be an RFC then. Fair enough. > > > There is a tremendous amount of > > duplicated code in mmap so the implementations across architectures I > > believe should be mostly consistent. I added this feature to all > > architectures that implement either > > arch_get_mmap_end()/arch_get_mmap_base() or > > arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added > > it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base(). > > Way too much duplicate code. We should be figuring out how to make this > all work with the same code. > > This is going to make the cloned code problem worse. That would require standardizing every architecture with the generic mmap() framework that arm64 has developed. That is far outside the scope of this patch, but would be a great area to research for each of the architectures that do not use the generic framework. - Charlie > > > > > Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.com/T/ [1] > > > > To: Arnd Bergmann <arnd@arndb.de> > > To: Paul Walmsley <paul.walmsley@sifive.com> > > To: Palmer Dabbelt <palmer@dabbelt.com> > > To: Albert Ou <aou@eecs.berkeley.edu> > > To: Catalin Marinas <catalin.marinas@arm.com> > > To: Will Deacon <will@kernel.org> > > To: Michael Ellerman <mpe@ellerman.id.au> > > To: Nicholas Piggin <npiggin@gmail.com> > > To: Christophe Leroy <christophe.leroy@csgroup.eu> > > To: Naveen N Rao <naveen@kernel.org> > > To: Muchun Song <muchun.song@linux.dev> > > To: Andrew Morton <akpm@linux-foundation.org> > > To: Liam R. Howlett <Liam.Howlett@oracle.com> > > To: Vlastimil Babka <vbabka@suse.cz> > > To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > To: Thomas Gleixner <tglx@linutronix.de> > > To: Ingo Molnar <mingo@redhat.com> > > To: Borislav Petkov <bp@alien8.de> > > To: Dave Hansen <dave.hansen@linux.intel.com> > > To: x86@kernel.org > > To: H. Peter Anvin <hpa@zytor.com> > > To: Huacai Chen <chenhuacai@kernel.org> > > To: WANG Xuerui <kernel@xen0n.name> > > To: Russell King <linux@armlinux.org.uk> > > To: Thomas Bogendoerfer <tsbogend@alpha.franken.de> > > To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> > > To: Helge Deller <deller@gmx.de> > > To: Alexander Gordeev <agordeev@linux.ibm.com> > > To: Gerald Schaefer <gerald.schaefer@linux.ibm.com> > > To: Heiko Carstens <hca@linux.ibm.com> > > To: Vasily Gorbik <gor@linux.ibm.com> > > To: Christian Borntraeger <borntraeger@linux.ibm.com> > > To: Sven Schnelle <svens@linux.ibm.com> > > To: Yoshinori Sato <ysato@users.sourceforge.jp> > > To: Rich Felker <dalias@libc.org> > > To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> > > To: David S. Miller <davem@davemloft.net> > > To: Andreas Larsson <andreas@gaisler.com> > > To: Shuah Khan <shuah@kernel.org> > > To: Alexandre Ghiti <alexghiti@rivosinc.com> > > Cc: linux-arch@vger.kernel.org > > Cc: linux-kernel@vger.kernel.org > > Cc: Palmer Dabbelt <palmer@rivosinc.com> > > Cc: linux-riscv@lists.infradead.org > > Cc: linux-arm-kernel@lists.infradead.org > > Cc: linuxppc-dev@lists.ozlabs.org > > Cc: linux-mm@kvack.org > > Cc: loongarch@lists.linux.dev > > Cc: linux-mips@vger.kernel.org > > Cc: linux-parisc@vger.kernel.org > > Cc: linux-s390@vger.kernel.org > > Cc: linux-sh@vger.kernel.org > > Cc: sparclinux@vger.kernel.org > > Cc: linux-kselftest@vger.kernel.org > > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> > > > > --- > > Charlie Jenkins (16): > > mm: Add MAP_BELOW_HINT > > riscv: mm: Do not restrict mmap address based on hint > > mm: Add flag and len param to arch_get_mmap_base() > > mm: Add generic MAP_BELOW_HINT > > riscv: mm: Support MAP_BELOW_HINT > > arm64: mm: Support MAP_BELOW_HINT > > powerpc: mm: Support MAP_BELOW_HINT > > x86: mm: Support MAP_BELOW_HINT > > loongarch: mm: Support MAP_BELOW_HINT > > arm: mm: Support MAP_BELOW_HINT > > mips: mm: Support MAP_BELOW_HINT > > parisc: mm: Support MAP_BELOW_HINT > > s390: mm: Support MAP_BELOW_HINT > > sh: mm: Support MAP_BELOW_HINT > > sparc: mm: Support MAP_BELOW_HINT > > selftests/mm: Create MAP_BELOW_HINT test > > > > arch/arm/mm/mmap.c | 10 ++++++++ > > arch/arm64/include/asm/processor.h | 34 ++++++++++++++++++++++---- > > arch/loongarch/mm/mmap.c | 11 +++++++++ > > arch/mips/mm/mmap.c | 9 +++++++ > > arch/parisc/include/uapi/asm/mman.h | 1 + > > arch/parisc/kernel/sys_parisc.c | 9 +++++++ > > arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++----- > > arch/riscv/include/asm/processor.h | 32 ------------------------- > > arch/s390/mm/mmap.c | 10 ++++++++ > > arch/sh/mm/mmap.c | 10 ++++++++ > > arch/sparc/kernel/sys_sparc_64.c | 8 +++++++ > > arch/x86/kernel/sys_x86_64.c | 25 ++++++++++++++++--- > > fs/hugetlbfs/inode.c | 2 +- > > include/linux/sched/mm.h | 34 ++++++++++++++++++++++++-- > > include/uapi/asm-generic/mman-common.h | 1 + > > mm/mmap.c | 2 +- > > tools/arch/parisc/include/uapi/asm/mman.h | 1 + > > tools/include/uapi/asm-generic/mman-common.h | 1 + > > tools/testing/selftests/mm/Makefile | 1 + > > tools/testing/selftests/mm/map_below_hint.c | 29 ++++++++++++++++++++++ > > 20 files changed, 216 insertions(+), 50 deletions(-) > > --- > > base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 > > change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 > > -- > > - Charlie > >
On Wed, Aug 28, 2024 at 01:59:18PM -0700, Charlie Jenkins wrote: > On Wed, Aug 28, 2024 at 02:31:42PM -0400, Liam R. Howlett wrote: > > * Charlie Jenkins <charlie@rivosinc.com> [240828 01:49]: > > > Some applications rely on placing data in free bits addresses allocated > > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > > address returned by mmap to be less than the maximum address space, > > > unless the hint address is greater than this value. > > > > Wait, what arch(s) allows for greater than the max? The passed hint > > should be where we start searching, but we go to the lower limit then > > start at the hint and search up (or vice-versa on the directions). > > > > I worded this awkwardly. On arm64 there is a page-table boundary at 48 > bits and at 52 bits. On x86 the boundaries are at 48 bits and 57 bits. > The max value mmap is able to return on arm64 is 48 bits if the hint > address uses 48 bits or less, even if the architecture supports 5-level > paging and thus addresses can be 52 bits. Applications can opt-in to > using up to 52-bits in an address by using a hint address greater than > 48 bits. x86 has the same behavior but with 57 bits instead of 52. > > This reason this exists is because some applications arbitrarily replace > bits in virtual addresses with data with an assumption that the address > will not be using any of the bits above bit 48 in the virtual address. > As hardware with larger address spaces was released, x86 decided to > build safety guards into the kernel to allow the applications that made > these assumptions to continue to work on this different hardware. > > This causes all application that use a hint address to silently be > restricted to 48-bit addresses. The goal of this flag is to have a way > for applications to explicitly request how many bits they want mmap to > use. > > > I don't understand how unmapping works on a higher address; we would > > fail to free it on termination of the application. > > > > Also, there are archs that map outside of the VMAs, which are freed by > > freeing from the prev->vm_end to next->vm_start, so I don't understand > > what that looks like in this reality as well. > > > > > > > > On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This > > > flag allows applications a way to specify exactly how many bits they > > > want to be left unused by mmap. This eliminates the need for > > > applications to know the page table hierarchy of the system to be able > > > to reason which addresses mmap will be allowed to return. > > > > But, why do they need to know today? We have a limit for this don't we? > > The limit is different for different architectures. On x86 the limit is > 57 bits, and on arm64 it is 52 bits. So in the theoretical case that an > application requires 10 bits free in a virtual address, the application > would always work on arm64 regardless of the hint address, but on x86 if > the hint address is greater than 48 bits then the application will not > work. > > The goal of this flag is to have consistent and tunable behavior of > mmap() when it is desired to ensure that mmap() only returns addresses > that use some number of bits. > > > > > Also, these upper limits are how some archs use the upper bits that you > > are trying to use. > > > > It does not eliminate the existing behavior of the architectures to > place this upper limits, it instead provides a way to have consistent > behavior across all architectures. > > > > > > > --- > > > riscv made this feature of mmap returning addresses less than the hint > > > address the default behavior. This was in contrast to the implementation > > > of x86/arm64 that have a single boundary at the 5-level page table > > > region. However this restriction proved too great -- the reduced > > > address space when using a hint address was too small. > > > > Yes, the hint is used to group things close together so it would > > literally be random chance on if you have enough room or not (aslr and > > all). > > > > > > > > A patch for riscv [1] reverts the behavior that broke userspace. This > > > series serves to make this feature available to all architectures. > > > > I don't fully understand this statement, you say it broke userspace so > > now you are porting it to everyone? This reads as if you are braking > > the userspace on all architectures :) > > It was the default for mmap on riscv. The difference here is that it is now > enabled by a flag instead. Instead of making the flag specific to riscv, > I figured that other architectures might find it useful as well. > > > > > If you fail to find room below, then your application fails as there is > > no way to get the upper bits you need. It would be better to fix this > > in userspace - if your application is returned too high an address, then > > free it and exit because it's going to fail anyways. > > > > This flag is trying to define an API that is more robust than the > current behavior on that x86 and arm64 which implicitly restricts mmap() > addresses to 48 bits. A solution could be to just write in the docs that > mmap() will always exhaust all addresses below the hint address before > returning an address that is above the hint address. However a flag that > defines this behavior seems more intuitive. > > > > > > > I have only tested on riscv and x86. > > > > This should be an RFC then. > > Fair enough. > > > > > > There is a tremendous amount of > > > duplicated code in mmap so the implementations across architectures I > > > believe should be mostly consistent. I added this feature to all > > > architectures that implement either > > > arch_get_mmap_end()/arch_get_mmap_base() or > > > arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added > > > it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base(). > > > > Way too much duplicate code. We should be figuring out how to make this > > all work with the same code. > > > > This is going to make the cloned code problem worse. > > That would require standardizing every architecture with the generic > mmap() framework that arm64 has developed. That is far outside the scope > of this patch, but would be a great area to research for each of the > architectures that do not use the generic framework. Thinking about this again, I could drop support for all architectures that do not implement arch_get_mmap_base()/arch_get_mmap_end(). > > - Charlie > > > > > > > > > Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.com/T/ [1] > > > > > > To: Arnd Bergmann <arnd@arndb.de> > > > To: Paul Walmsley <paul.walmsley@sifive.com> > > > To: Palmer Dabbelt <palmer@dabbelt.com> > > > To: Albert Ou <aou@eecs.berkeley.edu> > > > To: Catalin Marinas <catalin.marinas@arm.com> > > > To: Will Deacon <will@kernel.org> > > > To: Michael Ellerman <mpe@ellerman.id.au> > > > To: Nicholas Piggin <npiggin@gmail.com> > > > To: Christophe Leroy <christophe.leroy@csgroup.eu> > > > To: Naveen N Rao <naveen@kernel.org> > > > To: Muchun Song <muchun.song@linux.dev> > > > To: Andrew Morton <akpm@linux-foundation.org> > > > To: Liam R. Howlett <Liam.Howlett@oracle.com> > > > To: Vlastimil Babka <vbabka@suse.cz> > > > To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > To: Thomas Gleixner <tglx@linutronix.de> > > > To: Ingo Molnar <mingo@redhat.com> > > > To: Borislav Petkov <bp@alien8.de> > > > To: Dave Hansen <dave.hansen@linux.intel.com> > > > To: x86@kernel.org > > > To: H. Peter Anvin <hpa@zytor.com> > > > To: Huacai Chen <chenhuacai@kernel.org> > > > To: WANG Xuerui <kernel@xen0n.name> > > > To: Russell King <linux@armlinux.org.uk> > > > To: Thomas Bogendoerfer <tsbogend@alpha.franken.de> > > > To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> > > > To: Helge Deller <deller@gmx.de> > > > To: Alexander Gordeev <agordeev@linux.ibm.com> > > > To: Gerald Schaefer <gerald.schaefer@linux.ibm.com> > > > To: Heiko Carstens <hca@linux.ibm.com> > > > To: Vasily Gorbik <gor@linux.ibm.com> > > > To: Christian Borntraeger <borntraeger@linux.ibm.com> > > > To: Sven Schnelle <svens@linux.ibm.com> > > > To: Yoshinori Sato <ysato@users.sourceforge.jp> > > > To: Rich Felker <dalias@libc.org> > > > To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> > > > To: David S. Miller <davem@davemloft.net> > > > To: Andreas Larsson <andreas@gaisler.com> > > > To: Shuah Khan <shuah@kernel.org> > > > To: Alexandre Ghiti <alexghiti@rivosinc.com> > > > Cc: linux-arch@vger.kernel.org > > > Cc: linux-kernel@vger.kernel.org > > > Cc: Palmer Dabbelt <palmer@rivosinc.com> > > > Cc: linux-riscv@lists.infradead.org > > > Cc: linux-arm-kernel@lists.infradead.org > > > Cc: linuxppc-dev@lists.ozlabs.org > > > Cc: linux-mm@kvack.org > > > Cc: loongarch@lists.linux.dev > > > Cc: linux-mips@vger.kernel.org > > > Cc: linux-parisc@vger.kernel.org > > > Cc: linux-s390@vger.kernel.org > > > Cc: linux-sh@vger.kernel.org > > > Cc: sparclinux@vger.kernel.org > > > Cc: linux-kselftest@vger.kernel.org > > > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> > > > > > > --- > > > Charlie Jenkins (16): > > > mm: Add MAP_BELOW_HINT > > > riscv: mm: Do not restrict mmap address based on hint > > > mm: Add flag and len param to arch_get_mmap_base() > > > mm: Add generic MAP_BELOW_HINT > > > riscv: mm: Support MAP_BELOW_HINT > > > arm64: mm: Support MAP_BELOW_HINT > > > powerpc: mm: Support MAP_BELOW_HINT > > > x86: mm: Support MAP_BELOW_HINT > > > loongarch: mm: Support MAP_BELOW_HINT > > > arm: mm: Support MAP_BELOW_HINT > > > mips: mm: Support MAP_BELOW_HINT > > > parisc: mm: Support MAP_BELOW_HINT > > > s390: mm: Support MAP_BELOW_HINT > > > sh: mm: Support MAP_BELOW_HINT > > > sparc: mm: Support MAP_BELOW_HINT > > > selftests/mm: Create MAP_BELOW_HINT test > > > > > > arch/arm/mm/mmap.c | 10 ++++++++ > > > arch/arm64/include/asm/processor.h | 34 ++++++++++++++++++++++---- > > > arch/loongarch/mm/mmap.c | 11 +++++++++ > > > arch/mips/mm/mmap.c | 9 +++++++ > > > arch/parisc/include/uapi/asm/mman.h | 1 + > > > arch/parisc/kernel/sys_parisc.c | 9 +++++++ > > > arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++----- > > > arch/riscv/include/asm/processor.h | 32 ------------------------- > > > arch/s390/mm/mmap.c | 10 ++++++++ > > > arch/sh/mm/mmap.c | 10 ++++++++ > > > arch/sparc/kernel/sys_sparc_64.c | 8 +++++++ > > > arch/x86/kernel/sys_x86_64.c | 25 ++++++++++++++++--- > > > fs/hugetlbfs/inode.c | 2 +- > > > include/linux/sched/mm.h | 34 ++++++++++++++++++++++++-- > > > include/uapi/asm-generic/mman-common.h | 1 + > > > mm/mmap.c | 2 +- > > > tools/arch/parisc/include/uapi/asm/mman.h | 1 + > > > tools/include/uapi/asm-generic/mman-common.h | 1 + > > > tools/testing/selftests/mm/Makefile | 1 + > > > tools/testing/selftests/mm/map_below_hint.c | 29 ++++++++++++++++++++++ > > > 20 files changed, 216 insertions(+), 50 deletions(-) > > > --- > > > base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 > > > change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 > > > -- > > > - Charlie > > >
On Wed, Aug 28, 2024 at 07:19:36PM +0100, Lorenzo Stoakes wrote: > On Tue, Aug 27, 2024 at 10:49:06PM GMT, Charlie Jenkins wrote: > > Some applications rely on placing data in free bits addresses allocated > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > address returned by mmap to be less than the maximum address space, > > unless the hint address is greater than this value. > > > > On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This > > flag allows applications a way to specify exactly how many bits they > > want to be left unused by mmap. This eliminates the need for > > applications to know the page table hierarchy of the system to be able > > to reason which addresses mmap will be allowed to return. > > > > --- > > riscv made this feature of mmap returning addresses less than the hint > > address the default behavior. This was in contrast to the implementation > > of x86/arm64 that have a single boundary at the 5-level page table > > region. However this restriction proved too great -- the reduced > > address space when using a hint address was too small. > > > > A patch for riscv [1] reverts the behavior that broke userspace. This > > series serves to make this feature available to all architectures. > > I'm a little confused as to the justification for this - you broke RISC V by > doing this, and have now reverted it, but now offer the same behaviour that > broke RISC V to all other architectures? > > I mean this is how this reads, so I might be being ungenerous here :) but would > be good to clarify what the value-add is here. Yeah I did not do a good job of explaining this! Having this be the default behavior was broken, not that this feature in general was broken. > > I also wonder at use of a new MAP_ flag, they're a limited resource and we > should only ever add them if we _really_ need to. This seems a bit niche and > specific to be making such a big change for including touching a bunch of pretty > sensitive arch-specific code. > > We have the ability to change how mmap() functions through 'personalities' > though of course this would impact every mmap() call in the process. > > Overall I'm really not hugely convinced by this, it feels like userland > could find better ways of doing this (mostly you'd do a PROT_NONE mmap() to > reserve a domain and mprotect() it on allocation or mmap() over it). > > So I just struggle to see the purpose myself. BUT absolutely I may be > missing context/others may have a view on the value of this. So happy to > stand corrected. > > > > > I have only tested on riscv and x86. There is a tremendous amount of > > Yeah, OK this is crazy, you can't really submit something as non-RFC that > touches every single arch and not test it. > > I also feel like we need more justification than 'this is a neat thing that > we use in RISC V sometimes' conceptually for such a big change. I will send out a new version that does a much better job at explaining! This is not something that is done on riscv ever currently. This is something that is done on other architectures such as x86 and arm64. This flag is to make similar behavior (the ability to force mmap to return addresses that have a constrained address space) available to all architectures. > > Also your test program is currently completely broken afaict (have > commented on it directly). I also feel like your test program is a little > rudimentary, and should test some edge cases close to the limit etc. > > So I think this is a NACK until there is testing across the board and a little > more justification. > > Feel free to respin, but I think any future revisions should be RFC until > we're absolutely sure on testing/justification. > > I appreciate your efforts here so sorry to be negative, but just obviously > want to make sure this is functional and trades off added complexity for > value for the kernel and userland :) > Totally understand thank you! After reviewing comments I have realized that I made this much more complicated than it needs to be. This should be able to be done without changing any architecture specific code. That will mostly eliminate all of the complexity, but still has the downside of consuming a MAP_ flag. - Charlie > Thanks! > > > duplicated code in mmap so the implementations across architectures I > > believe should be mostly consistent. I added this feature to all > > architectures that implement either > > arch_get_mmap_end()/arch_get_mmap_base() or > > arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added > > it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base(). > > > > Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.com/T/ [1] > > > > To: Arnd Bergmann <arnd@arndb.de> > > To: Paul Walmsley <paul.walmsley@sifive.com> > > To: Palmer Dabbelt <palmer@dabbelt.com> > > To: Albert Ou <aou@eecs.berkeley.edu> > > To: Catalin Marinas <catalin.marinas@arm.com> > > To: Will Deacon <will@kernel.org> > > To: Michael Ellerman <mpe@ellerman.id.au> > > To: Nicholas Piggin <npiggin@gmail.com> > > To: Christophe Leroy <christophe.leroy@csgroup.eu> > > To: Naveen N Rao <naveen@kernel.org> > > To: Muchun Song <muchun.song@linux.dev> > > To: Andrew Morton <akpm@linux-foundation.org> > > To: Liam R. Howlett <Liam.Howlett@oracle.com> > > To: Vlastimil Babka <vbabka@suse.cz> > > To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > To: Thomas Gleixner <tglx@linutronix.de> > > To: Ingo Molnar <mingo@redhat.com> > > To: Borislav Petkov <bp@alien8.de> > > To: Dave Hansen <dave.hansen@linux.intel.com> > > To: x86@kernel.org > > To: H. Peter Anvin <hpa@zytor.com> > > To: Huacai Chen <chenhuacai@kernel.org> > > To: WANG Xuerui <kernel@xen0n.name> > > To: Russell King <linux@armlinux.org.uk> > > To: Thomas Bogendoerfer <tsbogend@alpha.franken.de> > > To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> > > To: Helge Deller <deller@gmx.de> > > To: Alexander Gordeev <agordeev@linux.ibm.com> > > To: Gerald Schaefer <gerald.schaefer@linux.ibm.com> > > To: Heiko Carstens <hca@linux.ibm.com> > > To: Vasily Gorbik <gor@linux.ibm.com> > > To: Christian Borntraeger <borntraeger@linux.ibm.com> > > To: Sven Schnelle <svens@linux.ibm.com> > > To: Yoshinori Sato <ysato@users.sourceforge.jp> > > To: Rich Felker <dalias@libc.org> > > To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> > > To: David S. Miller <davem@davemloft.net> > > To: Andreas Larsson <andreas@gaisler.com> > > To: Shuah Khan <shuah@kernel.org> > > To: Alexandre Ghiti <alexghiti@rivosinc.com> > > Cc: linux-arch@vger.kernel.org > > Cc: linux-kernel@vger.kernel.org > > Cc: Palmer Dabbelt <palmer@rivosinc.com> > > Cc: linux-riscv@lists.infradead.org > > Cc: linux-arm-kernel@lists.infradead.org > > Cc: linuxppc-dev@lists.ozlabs.org > > Cc: linux-mm@kvack.org > > Cc: loongarch@lists.linux.dev > > Cc: linux-mips@vger.kernel.org > > Cc: linux-parisc@vger.kernel.org > > Cc: linux-s390@vger.kernel.org > > Cc: linux-sh@vger.kernel.org > > Cc: sparclinux@vger.kernel.org > > Cc: linux-kselftest@vger.kernel.org > > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> > > > > --- > > Charlie Jenkins (16): > > mm: Add MAP_BELOW_HINT > > riscv: mm: Do not restrict mmap address based on hint > > mm: Add flag and len param to arch_get_mmap_base() > > mm: Add generic MAP_BELOW_HINT > > riscv: mm: Support MAP_BELOW_HINT > > arm64: mm: Support MAP_BELOW_HINT > > powerpc: mm: Support MAP_BELOW_HINT > > x86: mm: Support MAP_BELOW_HINT > > loongarch: mm: Support MAP_BELOW_HINT > > arm: mm: Support MAP_BELOW_HINT > > mips: mm: Support MAP_BELOW_HINT > > parisc: mm: Support MAP_BELOW_HINT > > s390: mm: Support MAP_BELOW_HINT > > sh: mm: Support MAP_BELOW_HINT > > sparc: mm: Support MAP_BELOW_HINT > > selftests/mm: Create MAP_BELOW_HINT test > > > > arch/arm/mm/mmap.c | 10 ++++++++ > > arch/arm64/include/asm/processor.h | 34 ++++++++++++++++++++++---- > > arch/loongarch/mm/mmap.c | 11 +++++++++ > > arch/mips/mm/mmap.c | 9 +++++++ > > arch/parisc/include/uapi/asm/mman.h | 1 + > > arch/parisc/kernel/sys_parisc.c | 9 +++++++ > > arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++----- > > arch/riscv/include/asm/processor.h | 32 ------------------------- > > arch/s390/mm/mmap.c | 10 ++++++++ > > arch/sh/mm/mmap.c | 10 ++++++++ > > arch/sparc/kernel/sys_sparc_64.c | 8 +++++++ > > arch/x86/kernel/sys_x86_64.c | 25 ++++++++++++++++--- > > fs/hugetlbfs/inode.c | 2 +- > > include/linux/sched/mm.h | 34 ++++++++++++++++++++++++-- > > include/uapi/asm-generic/mman-common.h | 1 + > > mm/mmap.c | 2 +- > > tools/arch/parisc/include/uapi/asm/mman.h | 1 + > > tools/include/uapi/asm-generic/mman-common.h | 1 + > > tools/testing/selftests/mm/Makefile | 1 + > > tools/testing/selftests/mm/map_below_hint.c | 29 ++++++++++++++++++++++ > > 20 files changed, 216 insertions(+), 50 deletions(-) > > --- > > base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 > > change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 > > -- > > - Charlie > > > > > > _______________________________________________ > > linux-riscv mailing list > > linux-riscv@lists.infradead.org > > http://lists.infradead.org/mailman/listinfo/linux-riscv
On 8/28/24 13:15, Charlie Jenkins wrote: > A way to restrict mmap() to return LAM compliant addresses in an entire > address space also doesn't have to be mutually exclusive with this flag. > This flag allows for the greatest degree of control from applications. > I don't believe there is additionally performance saving that could be > achieved by having this be on a per address space basis. I agree with you in general. The MAP_BELOW_HINT _is_ the most flexible. But it's also rather complicated. My _hope_ would be that a per-address-space property could share at least some infrastructure with what x86/LAM and arm/TBI do to the address space. Basically put the restrictions in place for purely software reasons instead of the mostly hardware reasons for LAM/TBI. Lorenzo also raised some very valid points about a having a generic address-restriction ABI. I'm certainly not discounting those concerns. It's not something that can be done lightly.
* Dave Hansen <dave.hansen@intel.com> [240829 12:54]: > On 8/28/24 13:15, Charlie Jenkins wrote: > > A way to restrict mmap() to return LAM compliant addresses in an entire > > address space also doesn't have to be mutually exclusive with this flag. > > This flag allows for the greatest degree of control from applications. > > I don't believe there is additionally performance saving that could be > > achieved by having this be on a per address space basis. > > I agree with you in general. The MAP_BELOW_HINT _is_ the most flexible. > But it's also rather complicated. There is a (seldom used?) feature of mmap_min_addr, it seems like we could have an mmap_max_addr. Would something like that work for your use case? Perhaps it would be less intrusive to do something in this way? I haven't looked at it in depth and this affects all address spaces as well (new allocations only). There is a note on mmap_min_addr about applications that require the lower addresses, would this mean we'll now have a note about upper limits? I really don't understand why you need this at all, to be honest. If you know the upper limit you could just MAP_FIXED map a huge guard at the top of your address space then do whatever you want with those bits. This will create an entry in the vma tree that no one else will be able to use, and you can do this in any process you want, for as many bits as you want. > > My _hope_ would be that a per-address-space property could share at > least some infrastructure with what x86/LAM and arm/TBI do to the > address space. Basically put the restrictions in place for purely > software reasons instead of the mostly hardware reasons for LAM/TBI. > > Lorenzo also raised some very valid points about a having a generic > address-restriction ABI. I'm certainly not discounting those concerns. > It's not something that can be done lightly. Yes, I am concerned about supporting this (probably forever) and dancing around special code that may cause issues, perhaps on an arch that few have for testing. I already have so many qemu images for testing, some of which no longer have valid install media - and basically none of them use the same code in this area (or have special cases already). I think you understand what we are dealing with considering your comments in your cover letter. Thanks, Liam
On Thu, Aug 29, 2024 at 09:54:08AM -0700, Dave Hansen wrote: > On 8/28/24 13:15, Charlie Jenkins wrote: > > A way to restrict mmap() to return LAM compliant addresses in an entire > > address space also doesn't have to be mutually exclusive with this flag. > > This flag allows for the greatest degree of control from applications. > > I don't believe there is additionally performance saving that could be > > achieved by having this be on a per address space basis. > > I agree with you in general. The MAP_BELOW_HINT _is_ the most flexible. > But it's also rather complicated. Can you expand upon what you mean by it being complicated? Complicated for the kernel or complicated for a user? > > My _hope_ would be that a per-address-space property could share at > least some infrastructure with what x86/LAM and arm/TBI do to the > address space. Basically put the restrictions in place for purely > software reasons instead of the mostly hardware reasons for LAM/TBI. That is a good point, perhaps that would be a way to hook this into LAM, TBI, and any other architecture's specific address masking feature. - Charlie > > Lorenzo also raised some very valid points about a having a generic > address-restriction ABI. I'm certainly not discounting those concerns. > It's not something that can be done lightly.
On Thu, Aug 29, 2024 at 03:36:43PM -0400, Liam R. Howlett wrote: > * Dave Hansen <dave.hansen@intel.com> [240829 12:54]: > > On 8/28/24 13:15, Charlie Jenkins wrote: > > > A way to restrict mmap() to return LAM compliant addresses in an entire > > > address space also doesn't have to be mutually exclusive with this flag. > > > This flag allows for the greatest degree of control from applications. > > > I don't believe there is additionally performance saving that could be > > > achieved by having this be on a per address space basis. > > > > I agree with you in general. The MAP_BELOW_HINT _is_ the most flexible. > > But it's also rather complicated. > > There is a (seldom used?) feature of mmap_min_addr, it seems like we > could have an mmap_max_addr. Would something like that work for your > use case? Perhaps it would be less intrusive to do something in this > way? I haven't looked at it in depth and this affects all address > spaces as well (new allocations only). > > There is a note on mmap_min_addr about applications that require the > lower addresses, would this mean we'll now have a note about upper > limits? I don't think that's a viable solution because that would change the mmap behavior for all applications running on the system, and wouldn't allow individual applications to have different configurations. > > I really don't understand why you need this at all, to be honest. If > you know the upper limit you could just MAP_FIXED map a huge guard at > the top of your address space then do whatever you want with those bits. > > This will create an entry in the vma tree that no one else will be able > to use, and you can do this in any process you want, for as many bits as > you want. Oh that's an interesting idea. I am not sure how that could work in practice though. The application would need to know it allocated all of the addresses in the upper address space, how would it be able to do that? > > > > > My _hope_ would be that a per-address-space property could share at > > least some infrastructure with what x86/LAM and arm/TBI do to the > > address space. Basically put the restrictions in place for purely > > software reasons instead of the mostly hardware reasons for LAM/TBI. > > > > Lorenzo also raised some very valid points about a having a generic > > address-restriction ABI. I'm certainly not discounting those concerns. > > It's not something that can be done lightly. > > Yes, I am concerned about supporting this (probably forever) and dancing > around special code that may cause issues, perhaps on an arch that few > have for testing. I already have so many qemu images for testing, some > of which no longer have valid install media - and basically none of them > use the same code in this area (or have special cases already). I think > you understand what we are dealing with considering your comments in > your cover letter. It is definitely not something to be taken lightly. The version 2 of this is completely generic so that should eliminate most of the concern of "special code" on various architectures. Unless I am misunderstanding something. - Charlie > > Thanks, > Liam
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the maximum address space, unless the hint address is greater than this value. On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This flag allows applications a way to specify exactly how many bits they want to be left unused by mmap. This eliminates the need for applications to know the page table hierarchy of the system to be able to reason which addresses mmap will be allowed to return. --- riscv made this feature of mmap returning addresses less than the hint address the default behavior. This was in contrast to the implementation of x86/arm64 that have a single boundary at the 5-level page table region. However this restriction proved too great -- the reduced address space when using a hint address was too small. A patch for riscv [1] reverts the behavior that broke userspace. This series serves to make this feature available to all architectures. I have only tested on riscv and x86. There is a tremendous amount of duplicated code in mmap so the implementations across architectures I believe should be mostly consistent. I added this feature to all architectures that implement either arch_get_mmap_end()/arch_get_mmap_base() or arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base(). Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.com/T/ [1] To: Arnd Bergmann <arnd@arndb.de> To: Paul Walmsley <paul.walmsley@sifive.com> To: Palmer Dabbelt <palmer@dabbelt.com> To: Albert Ou <aou@eecs.berkeley.edu> To: Catalin Marinas <catalin.marinas@arm.com> To: Will Deacon <will@kernel.org> To: Michael Ellerman <mpe@ellerman.id.au> To: Nicholas Piggin <npiggin@gmail.com> To: Christophe Leroy <christophe.leroy@csgroup.eu> To: Naveen N Rao <naveen@kernel.org> To: Muchun Song <muchun.song@linux.dev> To: Andrew Morton <akpm@linux-foundation.org> To: Liam R. Howlett <Liam.Howlett@oracle.com> To: Vlastimil Babka <vbabka@suse.cz> To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> To: Thomas Gleixner <tglx@linutronix.de> To: Ingo Molnar <mingo@redhat.com> To: Borislav Petkov <bp@alien8.de> To: Dave Hansen <dave.hansen@linux.intel.com> To: x86@kernel.org To: H. Peter Anvin <hpa@zytor.com> To: Huacai Chen <chenhuacai@kernel.org> To: WANG Xuerui <kernel@xen0n.name> To: Russell King <linux@armlinux.org.uk> To: Thomas Bogendoerfer <tsbogend@alpha.franken.de> To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> To: Helge Deller <deller@gmx.de> To: Alexander Gordeev <agordeev@linux.ibm.com> To: Gerald Schaefer <gerald.schaefer@linux.ibm.com> To: Heiko Carstens <hca@linux.ibm.com> To: Vasily Gorbik <gor@linux.ibm.com> To: Christian Borntraeger <borntraeger@linux.ibm.com> To: Sven Schnelle <svens@linux.ibm.com> To: Yoshinori Sato <ysato@users.sourceforge.jp> To: Rich Felker <dalias@libc.org> To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> To: David S. Miller <davem@davemloft.net> To: Andreas Larsson <andreas@gaisler.com> To: Shuah Khan <shuah@kernel.org> To: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: linux-riscv@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-mm@kvack.org Cc: loongarch@lists.linux.dev Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-kselftest@vger.kernel.org Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> --- Charlie Jenkins (16): mm: Add MAP_BELOW_HINT riscv: mm: Do not restrict mmap address based on hint mm: Add flag and len param to arch_get_mmap_base() mm: Add generic MAP_BELOW_HINT riscv: mm: Support MAP_BELOW_HINT arm64: mm: Support MAP_BELOW_HINT powerpc: mm: Support MAP_BELOW_HINT x86: mm: Support MAP_BELOW_HINT loongarch: mm: Support MAP_BELOW_HINT arm: mm: Support MAP_BELOW_HINT mips: mm: Support MAP_BELOW_HINT parisc: mm: Support MAP_BELOW_HINT s390: mm: Support MAP_BELOW_HINT sh: mm: Support MAP_BELOW_HINT sparc: mm: Support MAP_BELOW_HINT selftests/mm: Create MAP_BELOW_HINT test arch/arm/mm/mmap.c | 10 ++++++++ arch/arm64/include/asm/processor.h | 34 ++++++++++++++++++++++---- arch/loongarch/mm/mmap.c | 11 +++++++++ arch/mips/mm/mmap.c | 9 +++++++ arch/parisc/include/uapi/asm/mman.h | 1 + arch/parisc/kernel/sys_parisc.c | 9 +++++++ arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++----- arch/riscv/include/asm/processor.h | 32 ------------------------- arch/s390/mm/mmap.c | 10 ++++++++ arch/sh/mm/mmap.c | 10 ++++++++ arch/sparc/kernel/sys_sparc_64.c | 8 +++++++ arch/x86/kernel/sys_x86_64.c | 25 ++++++++++++++++--- fs/hugetlbfs/inode.c | 2 +- include/linux/sched/mm.h | 34 ++++++++++++++++++++++++-- include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 2 +- tools/arch/parisc/include/uapi/asm/mman.h | 1 + tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 29 ++++++++++++++++++++++ 20 files changed, 216 insertions(+), 50 deletions(-) --- base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55