mbox series

[v10,0/4] RISC-V: mm: Make SV48 the default address space

Message ID 20230809232218.849726-1-charlie@rivosinc.com (mailing list archive)
Headers show
Series RISC-V: mm: Make SV48 the default address space | expand

Message

Charlie Jenkins Aug. 9, 2023, 11:22 p.m. UTC
Make sv48 the default address space for mmap as some applications
currently depend on this assumption. Users can now select a
desired address space using a non-zero hint address to mmap. Previously,
requesting the default address space from mmap by passing zero as the hint
address would result in using the largest address space possible. Some
applications depend on empty bits in the virtual address space, like Go and
Java, so this patch provides more flexibility for application developers.

-Charlie

---
v10:
- Move pgtable.h defintions into a no __ASSEMBLY__ region to resolve compilation
  conflicts (pointed out by Conor)
- Will now compile with allmodconfig

v9:
- Raise the mmap_end default to STACK_TOP_MAX to allow the address space to grow
  beyond the default of sv48 on sv57 machines as suggested by Alexandre
- Some of the mmap macros had unnecessary conditionals that I have removed

v8:
- Fix RV32 and the RV32 compat mode of RV64 (suggested by Conor)
- Extract out addr and base from the mmap macros (suggested by Alexandre)

v7:
- Changing RLIMIT_STACK inside of an executing program does not trigger
  arch_pick_mmap_layout(), so rewrite tests to change RLIMIT_STACK from a
  script before executing tests. RLIMIT_STACK of infinity forces bottomup
  mmap allocation.
- Make arch_get_mmap_base macro more readible by extracting out the rnd
  calculation.
- Use MMAP_MIN_VA_BITS in TASK_UNMAPPED_BASE to support case when mmap
  attempts to allocate address smaller than DEFAULT_MAP_WINDOW.
- Fix incorrect wording in documentation.

v6:
- Rebase onto the correct base

v5:
- Minor wording change in documentation
- Change some parenthesis in arch_get_mmap_ macros
- Added case for addr==0 in arch_get_mmap_ because without this, programs would
  crash if RLIMIT_STACK was modified before executing the program. This was
  tested using the libhugetlbfs tests. 

v4:
- Split testcases/document patch into test cases, in-code documentation, and
  formal documentation patches
- Modified the mmap_base macro to be more legible and better represent memory
  layout
- Fixed documentation to better reflect the implmentation
- Renamed DEFAULT_VA_BITS to MMAP_VA_BITS
- Added additional test case for rlimit changes
---


Charlie Jenkins (4):
  RISC-V: mm: Restrict address space for sv39,sv48,sv57
  RISC-V: mm: Add tests for RISC-V mm
  RISC-V: mm: Update pgtable comment documentation
  RISC-V: mm: Document mmap changes

 Documentation/riscv/vm-layout.rst             | 22 +++++++
 arch/riscv/include/asm/elf.h                  |  2 +-
 arch/riscv/include/asm/pgtable.h              | 33 ++++++++--
 arch/riscv/include/asm/processor.h            | 52 +++++++++++++--
 tools/testing/selftests/riscv/Makefile        |  2 +-
 tools/testing/selftests/riscv/mm/.gitignore   |  2 +
 tools/testing/selftests/riscv/mm/Makefile     | 15 +++++
 .../riscv/mm/testcases/mmap_bottomup.c        | 35 ++++++++++
 .../riscv/mm/testcases/mmap_default.c         | 35 ++++++++++
 .../selftests/riscv/mm/testcases/mmap_test.h  | 64 +++++++++++++++++++
 .../selftests/riscv/mm/testcases/run_mmap.sh  | 12 ++++
 11 files changed, 261 insertions(+), 13 deletions(-)
 create mode 100644 tools/testing/selftests/riscv/mm/.gitignore
 create mode 100644 tools/testing/selftests/riscv/mm/Makefile
 create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_bottomup.c
 create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_default.c
 create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_test.h
 create mode 100755 tools/testing/selftests/riscv/mm/testcases/run_mmap.sh

Comments

patchwork-bot+linux-riscv@kernel.org Aug. 30, 2023, 1:20 p.m. UTC | #1
Hello:

This series was applied to riscv/linux.git (for-next)
by Palmer Dabbelt <palmer@rivosinc.com>:

On Wed,  9 Aug 2023 16:22:00 -0700 you wrote:
> Make sv48 the default address space for mmap as some applications
> currently depend on this assumption. Users can now select a
> desired address space using a non-zero hint address to mmap. Previously,
> requesting the default address space from mmap by passing zero as the hint
> address would result in using the largest address space possible. Some
> applications depend on empty bits in the virtual address space, like Go and
> Java, so this patch provides more flexibility for application developers.
> 
> [...]

Here is the summary with links:
  - [v10,1/4] RISC-V: mm: Restrict address space for sv39,sv48,sv57
    https://git.kernel.org/riscv/c/add2cc6b6515
  - [v10,2/4] RISC-V: mm: Add tests for RISC-V mm
    https://git.kernel.org/riscv/c/4d0c04eac0c2
  - [v10,3/4] RISC-V: mm: Update pgtable comment documentation
    https://git.kernel.org/riscv/c/26eee2bfc477
  - [v10,4/4] RISC-V: mm: Document mmap changes
    https://git.kernel.org/riscv/c/7998abe69d3c

You are awesome, thank you!
Yangyu Chen Jan. 13, 2024, 5:26 p.m. UTC | #2
Hi, Charlie

Although this patchset has been merged I still have some questions about
this patchset. Because it breaks regular mmap if address >= 38 bits on
sv48 / sv57 capable systems like qemu. For example, If a userspace program
wants to mmap an anonymous page to addr=(1<<45) on an sv48 capable system,
it will fail and kernel will mmaped to another sv39 address since it does
not meet the requirement to use sv48 as you wrote:

>	else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \
>		mmap_end = VA_USER_SV48;			\
>	else							\
>		mmap_end = VA_USER_SV39;			\

Then, How can a userspace program create a mmap with a hint if the address
>= (1<<38) after your patch without MAP_FIXED? The only way to do this is
to pass a hint >= (1<<47) on mmap syscall then kernel will return a random
address in sv48 address space but the hint address gets lost. I think this
violate the principle of mmap syscall as kernel should take the hint and
attempt to create the mapping there.

I don't think patching in this way is right. However, if we only revert
this patch, some programs relying on mmap to return address with effective
bits <= 48 will still be an issue and it might expand to other ISAs if
they implement larger virtual address space like RISC-V sv57. A better way
to solve this might be adding a MAP_48BIT flag to mmap like MAP_32BIT has
been introduced for decades.

Thanks,
Yangyu Chen
Charlie Jenkins Jan. 20, 2024, 1:34 a.m. UTC | #3
On Sun, Jan 14, 2024 at 01:26:57AM +0800, Yangyu Chen wrote:
> Hi, Charlie
> 
> Although this patchset has been merged I still have some questions about
> this patchset. Because it breaks regular mmap if address >= 38 bits on
> sv48 / sv57 capable systems like qemu. For example, If a userspace program
> wants to mmap an anonymous page to addr=(1<<45) on an sv48 capable system,
> it will fail and kernel will mmaped to another sv39 address since it does

Thank you for raising this concern. To make sure I am understanding
correctly, you are passing a hint address of (1<<45) and expecting mmap
to return 1<<45 and if it returns a different address you are describing
mmap as failing? If you want an address that is in the sv48 space you
can pass in an address that is greater than 1<<47.

> not meet the requirement to use sv48 as you wrote:
> 
> >	else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \
> >		mmap_end = VA_USER_SV48;			\
> >	else							\
> >		mmap_end = VA_USER_SV39;			\
> 
> Then, How can a userspace program create a mmap with a hint if the address
> >= (1<<38) after your patch without MAP_FIXED? The only way to do this is
> to pass a hint >= (1<<47) on mmap syscall then kernel will return a random
> address in sv48 address space but the hint address gets lost. I think this

In order to force mmap to return the address provided you must use
MAP_FIXED. Otherwise, the address is a "hint" and has no guarantees. The
hint address on riscv is used to mean "don't give me an address that
uses more bits than this". This behavior is not unique to riscv, arm64
and powerpc use a similar scheme. In arch/arm64/include/asm/processor.h
there is the following code:

#define arch_get_mmap_base(addr, base) ((addr > DEFAULT_MAP_WINDOW) ? \
					base + TASK_SIZE - DEFAULT_MAP_WINDOW :\
					base)

arm64/powerpc are only concerned with a single boundary so the code is simpler.

> violate the principle of mmap syscall as kernel should take the hint and
> attempt to create the mapping there.

Although the man page for mmap does say "on Linux, the kernel will pick
a nearby page boundary" it is still a hint address so there is no strict
requirement (and the precedent has already been set by arm64/powerpc).

> 
> I don't think patching in this way is right. However, if we only revert
> this patch, some programs relying on mmap to return address with effective
> bits <= 48 will still be an issue and it might expand to other ISAs if
> they implement larger virtual address space like RISC-V sv57. A better way
> to solve this might be adding a MAP_48BIT flag to mmap like MAP_32BIT has
> been introduced for decades.
> 
> Thanks,
> Yangyu Chen
> 

- Charlie
Yangyu Chen Jan. 20, 2024, 6:13 a.m. UTC | #4
Thanks for your reply.

On 1/20/24 09:34, Charlie Jenkins wrote:
> On Sun, Jan 14, 2024 at 01:26:57AM +0800, Yangyu Chen wrote:
>> Hi, Charlie
>>
>> Although this patchset has been merged I still have some questions about
>> this patchset. Because it breaks regular mmap if address >= 38 bits on
>> sv48 / sv57 capable systems like qemu. For example, If a userspace program
>> wants to mmap an anonymous page to addr=(1<<45) on an sv48 capable system,
>> it will fail and kernel will mmaped to another sv39 address since it does
> 
> Thank you for raising this concern. To make sure I am understanding
> correctly, you are passing a hint address of (1<<45) and expecting mmap
> to return 1<<45 and if it returns a different address you are describing
> mmap as failing? If you want an address that is in the sv48 space you
> can pass in an address that is greater than 1<<47.
> 
>> not meet the requirement to use sv48 as you wrote:
>>
>>> 	else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \
>>> 		mmap_end = VA_USER_SV48;			\
>>> 	else							\
>>> 		mmap_end = VA_USER_SV39;			\
>>
>> Then, How can a userspace program create a mmap with a hint if the address
>>> = (1<<38) after your patch without MAP_FIXED? The only way to do this is
>> to pass a hint >= (1<<47) on mmap syscall then kernel will return a random
>> address in sv48 address space but the hint address gets lost. I think this
> 
> In order to force mmap to return the address provided you must use
> MAP_FIXED. Otherwise, the address is a "hint" and has no guarantees. The
> hint address on riscv is used to mean "don't give me an address that
> uses more bits than this". This behavior is not unique to riscv, arm64
> and powerpc use a similar scheme. In arch/arm64/include/asm/processor.h
> there is the following code:
> 
> #define arch_get_mmap_base(addr, base) ((addr > DEFAULT_MAP_WINDOW) ? \
> 					base + TASK_SIZE - DEFAULT_MAP_WINDOW :\
> 					base)
> 
> arm64/powerpc are only concerned with a single boundary so the code is simpler.
> 

As you say, this code in arm64/powerpc will not meet the issue I 
address. For example, If the addr here is (1<<50) on arm64, the 
arch_get_mmap_base will return base+TASK_SIZE-DEFAULT_MAP_WINDOW which 
is (1<<vabits_actual). And this behavior on arm64/powerpc/x86 does not 
break anything since we will use a larger address space if the hint 
address is specified on the address > DEFAULT_MAP_WINDOW. The 
corresponding behavior on RISC-V should be if the hint address > BIT(47) 
then use Sv57 address space and use Sv48 when the hint address > BIT(38) 
if we want Sv39 by default.

However, your patch needs the address >= BIT(47) rather than BIT(38) to 
use Sv48 and address >= BIT(56) to use Sv57, thus breaking existing 
userspace software to create mapping on the hint address without 
MAP_FIXED set.

>> violate the principle of mmap syscall as kernel should take the hint and
>> attempt to create the mapping there.
> 
> Although the man page for mmap does say "on Linux, the kernel will pick
> a nearby page boundary" it is still a hint address so there is no strict
> requirement (and the precedent has already been set by arm64/powerpc).
> 

Yeah. There is no strict requirement. But currently x86/arm64/powerpc 
works in this situation well. The hint address on these ISAs is not used 
as the upper bound to allocating the address. However, on RISC-V, you 
treat this as the upper bound.

>>
>> I don't think patching in this way is right. However, if we only revert
>> this patch, some programs relying on mmap to return address with effective
>> bits <= 48 will still be an issue and it might expand to other ISAs if
>> they implement larger virtual address space like RISC-V sv57. A better way
>> to solve this might be adding a MAP_48BIT flag to mmap like MAP_32BIT has
>> been introduced for decades.
>>
>> Thanks,
>> Yangyu Chen
>>
> 
> - Charlie
>
Charlie Jenkins Jan. 20, 2024, 6:49 a.m. UTC | #5
On Sat, Jan 20, 2024 at 02:13:14PM +0800, Yangyu Chen wrote:
> Thanks for your reply.
> 
> On 1/20/24 09:34, Charlie Jenkins wrote:
> > On Sun, Jan 14, 2024 at 01:26:57AM +0800, Yangyu Chen wrote:
> > > Hi, Charlie
> > > 
> > > Although this patchset has been merged I still have some questions about
> > > this patchset. Because it breaks regular mmap if address >= 38 bits on
> > > sv48 / sv57 capable systems like qemu. For example, If a userspace program
> > > wants to mmap an anonymous page to addr=(1<<45) on an sv48 capable system,
> > > it will fail and kernel will mmaped to another sv39 address since it does
> > 
> > Thank you for raising this concern. To make sure I am understanding
> > correctly, you are passing a hint address of (1<<45) and expecting mmap
> > to return 1<<45 and if it returns a different address you are describing
> > mmap as failing? If you want an address that is in the sv48 space you
> > can pass in an address that is greater than 1<<47.
> > 
> > > not meet the requirement to use sv48 as you wrote:
> > > 
> > > > 	else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \
> > > > 		mmap_end = VA_USER_SV48;			\
> > > > 	else							\
> > > > 		mmap_end = VA_USER_SV39;			\
> > > 
> > > Then, How can a userspace program create a mmap with a hint if the address
> > > > = (1<<38) after your patch without MAP_FIXED? The only way to do this is
> > > to pass a hint >= (1<<47) on mmap syscall then kernel will return a random
> > > address in sv48 address space but the hint address gets lost. I think this
> > 
> > In order to force mmap to return the address provided you must use
> > MAP_FIXED. Otherwise, the address is a "hint" and has no guarantees. The
> > hint address on riscv is used to mean "don't give me an address that
> > uses more bits than this". This behavior is not unique to riscv, arm64
> > and powerpc use a similar scheme. In arch/arm64/include/asm/processor.h
> > there is the following code:
> > 
> > #define arch_get_mmap_base(addr, base) ((addr > DEFAULT_MAP_WINDOW) ? \
> > 					base + TASK_SIZE - DEFAULT_MAP_WINDOW :\
> > 					base)
> > 
> > arm64/powerpc are only concerned with a single boundary so the code is simpler.
> > 
> 
> As you say, this code in arm64/powerpc will not meet the issue I address.
> For example, If the addr here is (1<<50) on arm64, the arch_get_mmap_base
> will return base+TASK_SIZE-DEFAULT_MAP_WINDOW which is (1<<vabits_actual).
> And this behavior on arm64/powerpc/x86 does not break anything since we will
> use a larger address space if the hint address is specified on the address >
> DEFAULT_MAP_WINDOW. The corresponding behavior on RISC-V should be if the
> hint address > BIT(47) then use Sv57 address space and use Sv48 when the
> hint address > BIT(38) if we want Sv39 by default.
> 
> However, your patch needs the address >= BIT(47) rather than BIT(38) to use
> Sv48 and address >= BIT(56) to use Sv57, thus breaking existing userspace
> software to create mapping on the hint address without MAP_FIXED set.

Code that needs mmap to provide a specific address must use MAP_FIXED.
On riscv, it was decided that the address returned from mmap cannot be
greater than the hint address. This is currently implemented by using
the largest address space that can fit into the hint address. It may be
possible that this range can be extended to use all of the addresses
that are less than or equal to the hint address.

From reading the code even on arm64 if you pass an address that is
greater than DEFAULT_MAP_WINDOW it is not guaranteed that mmap will
return an address that is greater than DEFAULT_MAP_WINDOW. It may still
be provide an address that is less than DEFAULT_MAP_WINDOW if it fails
to find an address above. This seems like this would also break your use
case.

> 
> > > violate the principle of mmap syscall as kernel should take the hint and
> > > attempt to create the mapping there.
> > 
> > Although the man page for mmap does say "on Linux, the kernel will pick
> > a nearby page boundary" it is still a hint address so there is no strict
> > requirement (and the precedent has already been set by arm64/powerpc).
> > 
> 
> Yeah. There is no strict requirement. But currently x86/arm64/powerpc works
> in this situation well. The hint address on these ISAs is not used as the
> upper bound to allocating the address. However, on RISC-V, you treat this as
> the upper bound.
> 
> > > 
> > > I don't think patching in this way is right. However, if we only revert
> > > this patch, some programs relying on mmap to return address with effective
> > > bits <= 48 will still be an issue and it might expand to other ISAs if
> > > they implement larger virtual address space like RISC-V sv57. A better way
> > > to solve this might be adding a MAP_48BIT flag to mmap like MAP_32BIT has
> > > been introduced for decades.
> > > 
> > > Thanks,
> > > Yangyu Chen
> > > 
> > 
> > - Charlie
> > 
>
Yangyu Chen Jan. 20, 2024, 7:09 a.m. UTC | #6
On 1/20/24 14:49, Charlie Jenkins wrote:
> On Sat, Jan 20, 2024 at 02:13:14PM +0800, Yangyu Chen wrote:
>> Thanks for your reply.
>>
>> On 1/20/24 09:34, Charlie Jenkins wrote:
>>> On Sun, Jan 14, 2024 at 01:26:57AM +0800, Yangyu Chen wrote:
>>>> Hi, Charlie
>>>>
>>>> Although this patchset has been merged I still have some questions about
>>>> this patchset. Because it breaks regular mmap if address >= 38 bits on
>>>> sv48 / sv57 capable systems like qemu. For example, If a userspace program
>>>> wants to mmap an anonymous page to addr=(1<<45) on an sv48 capable system,
>>>> it will fail and kernel will mmaped to another sv39 address since it does
>>>
>>> Thank you for raising this concern. To make sure I am understanding
>>> correctly, you are passing a hint address of (1<<45) and expecting mmap
>>> to return 1<<45 and if it returns a different address you are describing
>>> mmap as failing? If you want an address that is in the sv48 space you
>>> can pass in an address that is greater than 1<<47.
>>>
>>>> not meet the requirement to use sv48 as you wrote:
>>>>
>>>>> 	else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \
>>>>> 		mmap_end = VA_USER_SV48;			\
>>>>> 	else							\
>>>>> 		mmap_end = VA_USER_SV39;			\
>>>>
>>>> Then, How can a userspace program create a mmap with a hint if the address
>>>>> = (1<<38) after your patch without MAP_FIXED? The only way to do this is
>>>> to pass a hint >= (1<<47) on mmap syscall then kernel will return a random
>>>> address in sv48 address space but the hint address gets lost. I think this
>>>
>>> In order to force mmap to return the address provided you must use
>>> MAP_FIXED. Otherwise, the address is a "hint" and has no guarantees. The
>>> hint address on riscv is used to mean "don't give me an address that
>>> uses more bits than this". This behavior is not unique to riscv, arm64
>>> and powerpc use a similar scheme. In arch/arm64/include/asm/processor.h
>>> there is the following code:
>>>
>>> #define arch_get_mmap_base(addr, base) ((addr > DEFAULT_MAP_WINDOW) ? \
>>> 					base + TASK_SIZE - DEFAULT_MAP_WINDOW :\
>>> 					base)
>>>
>>> arm64/powerpc are only concerned with a single boundary so the code is simpler.
>>>
>>
>> As you say, this code in arm64/powerpc will not meet the issue I address.
>> For example, If the addr here is (1<<50) on arm64, the arch_get_mmap_base
>> will return base+TASK_SIZE-DEFAULT_MAP_WINDOW which is (1<<vabits_actual).
>> And this behavior on arm64/powerpc/x86 does not break anything since we will
>> use a larger address space if the hint address is specified on the address >
>> DEFAULT_MAP_WINDOW. The corresponding behavior on RISC-V should be if the
>> hint address > BIT(47) then use Sv57 address space and use Sv48 when the
>> hint address > BIT(38) if we want Sv39 by default.
>>
>> However, your patch needs the address >= BIT(47) rather than BIT(38) to use
>> Sv48 and address >= BIT(56) to use Sv57, thus breaking existing userspace
>> software to create mapping on the hint address without MAP_FIXED set.
> 
> Code that needs mmap to provide a specific address must use MAP_FIXED.
> On riscv, it was decided that the address returned from mmap cannot be
> greater than the hint address. This is currently implemented by using
> the largest address space that can fit into the hint address. It may be
> possible that this range can be extended to use all of the addresses
> that are less than or equal to the hint address.
> 

So this decision might be wrong. It requires some userspace software to 
modify their mmap flags to fit with this. For example, a binary 
translate JIT compiler already probes this platform is capable with 
Sv48, then want to create mapping on some address specified on the mmap 
hint to align with foreign binary native address but also provide a 
fallback path with performance overhead. Your patch here will always let 
userspace software use a fallback path with performance overhead until 
the userspace software changes its syscall to use MAP_FIXED. But it is 
not required in x86, arm64, powerpc.

>  From reading the code even on arm64 if you pass an address that is
> greater than DEFAULT_MAP_WINDOW it is not guaranteed that mmap will
> return an address that is greater than DEFAULT_MAP_WINDOW. It may still
> be provide an address that is less than DEFAULT_MAP_WINDOW if it fails
> to find an address above. This seems like this would also break your use
> case.
> 

Yeah. As I said before, this patch will always let userspace software 
use a fallback path and this only happens in RISC-V. Make default sv48 
is right, but RISC-V implementation for this and changing the hint 
address behavior might be wrong. And x86, arm64, powerpc already use
48-bit address space by default but do not change the meaning of hint 
address on mmap.

>>
>>>> violate the principle of mmap syscall as kernel should take the hint and
>>>> attempt to create the mapping there.
>>>
>>> Although the man page for mmap does say "on Linux, the kernel will pick
>>> a nearby page boundary" it is still a hint address so there is no strict
>>> requirement (and the precedent has already been set by arm64/powerpc).
>>>
>>
>> Yeah. There is no strict requirement. But currently x86/arm64/powerpc works
>> in this situation well. The hint address on these ISAs is not used as the
>> upper bound to allocating the address. However, on RISC-V, you treat this as
>> the upper bound.
>>
>>>>
>>>> I don't think patching in this way is right. However, if we only revert
>>>> this patch, some programs relying on mmap to return address with effective
>>>> bits <= 48 will still be an issue and it might expand to other ISAs if
>>>> they implement larger virtual address space like RISC-V sv57. A better way
>>>> to solve this might be adding a MAP_48BIT flag to mmap like MAP_32BIT has
>>>> been introduced for decades.
>>>>
>>>> Thanks,
>>>> Yangyu Chen
>>>>
>>>
>>> - Charlie
>>>
>>
Charlie Jenkins Jan. 22, 2024, 7:56 p.m. UTC | #7
On Sat, Jan 20, 2024 at 03:09:51PM +0800, Yangyu Chen wrote:
> 
> 
> On 1/20/24 14:49, Charlie Jenkins wrote:
> > On Sat, Jan 20, 2024 at 02:13:14PM +0800, Yangyu Chen wrote:
> > > Thanks for your reply.
> > > 
> > > On 1/20/24 09:34, Charlie Jenkins wrote:
> > > > On Sun, Jan 14, 2024 at 01:26:57AM +0800, Yangyu Chen wrote:
> > > > > Hi, Charlie
> > > > > 
> > > > > Although this patchset has been merged I still have some questions about
> > > > > this patchset. Because it breaks regular mmap if address >= 38 bits on
> > > > > sv48 / sv57 capable systems like qemu. For example, If a userspace program
> > > > > wants to mmap an anonymous page to addr=(1<<45) on an sv48 capable system,
> > > > > it will fail and kernel will mmaped to another sv39 address since it does
> > > > 
> > > > Thank you for raising this concern. To make sure I am understanding
> > > > correctly, you are passing a hint address of (1<<45) and expecting mmap
> > > > to return 1<<45 and if it returns a different address you are describing
> > > > mmap as failing? If you want an address that is in the sv48 space you
> > > > can pass in an address that is greater than 1<<47.
> > > > 
> > > > > not meet the requirement to use sv48 as you wrote:
> > > > > 
> > > > > > 	else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \
> > > > > > 		mmap_end = VA_USER_SV48;			\
> > > > > > 	else							\
> > > > > > 		mmap_end = VA_USER_SV39;			\
> > > > > 
> > > > > Then, How can a userspace program create a mmap with a hint if the address
> > > > > > = (1<<38) after your patch without MAP_FIXED? The only way to do this is
> > > > > to pass a hint >= (1<<47) on mmap syscall then kernel will return a random
> > > > > address in sv48 address space but the hint address gets lost. I think this
> > > > 
> > > > In order to force mmap to return the address provided you must use
> > > > MAP_FIXED. Otherwise, the address is a "hint" and has no guarantees. The
> > > > hint address on riscv is used to mean "don't give me an address that
> > > > uses more bits than this". This behavior is not unique to riscv, arm64
> > > > and powerpc use a similar scheme. In arch/arm64/include/asm/processor.h
> > > > there is the following code:
> > > > 
> > > > #define arch_get_mmap_base(addr, base) ((addr > DEFAULT_MAP_WINDOW) ? \
> > > > 					base + TASK_SIZE - DEFAULT_MAP_WINDOW :\
> > > > 					base)
> > > > 
> > > > arm64/powerpc are only concerned with a single boundary so the code is simpler.
> > > > 
> > > 
> > > As you say, this code in arm64/powerpc will not meet the issue I address.
> > > For example, If the addr here is (1<<50) on arm64, the arch_get_mmap_base
> > > will return base+TASK_SIZE-DEFAULT_MAP_WINDOW which is (1<<vabits_actual).
> > > And this behavior on arm64/powerpc/x86 does not break anything since we will
> > > use a larger address space if the hint address is specified on the address >
> > > DEFAULT_MAP_WINDOW. The corresponding behavior on RISC-V should be if the
> > > hint address > BIT(47) then use Sv57 address space and use Sv48 when the
> > > hint address > BIT(38) if we want Sv39 by default.
> > > 
> > > However, your patch needs the address >= BIT(47) rather than BIT(38) to use
> > > Sv48 and address >= BIT(56) to use Sv57, thus breaking existing userspace
> > > software to create mapping on the hint address without MAP_FIXED set.
> > 
> > Code that needs mmap to provide a specific address must use MAP_FIXED.
> > On riscv, it was decided that the address returned from mmap cannot be
> > greater than the hint address. This is currently implemented by using
> > the largest address space that can fit into the hint address. It may be
> > possible that this range can be extended to use all of the addresses
> > that are less than or equal to the hint address.
> > 
> 
> So this decision might be wrong. It requires some userspace software to
> modify their mmap flags to fit with this. For example, a binary translate
> JIT compiler already probes this platform is capable with Sv48, then want to
> create mapping on some address specified on the mmap hint to align with
> foreign binary native address but also provide a fallback path with
> performance overhead. Your patch here will always let userspace software use

I do not follow. This mechanism allows a program to always know how many
bits will be available in the virtual address provided by mmap,
regardless of the size of the underlying virtual address space.

The phrasing "align with foreign binary native address" seems like the
program requires a specific address, which is never guaranteed by mmap
without MAP_FIXED. If the program is relying on mmap to provide the
address without MAP_FIXED, the program is relying on behavior that
cannot be expected to remain constant across Linux releases.

> a fallback path with performance overhead until the userspace software
> changes its syscall to use MAP_FIXED. But it is not required in x86, arm64,
> powerpc.
> 
> >  From reading the code even on arm64 if you pass an address that is
> > greater than DEFAULT_MAP_WINDOW it is not guaranteed that mmap will
> > return an address that is greater than DEFAULT_MAP_WINDOW. It may still
> > be provide an address that is less than DEFAULT_MAP_WINDOW if it fails
> > to find an address above. This seems like this would also break your use
> > case.
> > 
> 
> Yeah. As I said before, this patch will always let userspace software use a
> fallback path and this only happens in RISC-V. Make default sv48 is right,
> but RISC-V implementation for this and changing the hint address behavior
> might be wrong. And x86, arm64, powerpc already use
> 48-bit address space by default but do not change the meaning of hint
> address on mmap.
> 
> > > 
> > > > > violate the principle of mmap syscall as kernel should take the hint and
> > > > > attempt to create the mapping there.
> > > > 
> > > > Although the man page for mmap does say "on Linux, the kernel will pick
> > > > a nearby page boundary" it is still a hint address so there is no strict
> > > > requirement (and the precedent has already been set by arm64/powerpc).
> > > > 
> > > 
> > > Yeah. There is no strict requirement. But currently x86/arm64/powerpc works
> > > in this situation well. The hint address on these ISAs is not used as the
> > > upper bound to allocating the address. However, on RISC-V, you treat this as
> > > the upper bound.
> > > 
> > > > > 
> > > > > I don't think patching in this way is right. However, if we only revert
> > > > > this patch, some programs relying on mmap to return address with effective
> > > > > bits <= 48 will still be an issue and it might expand to other ISAs if
> > > > > they implement larger virtual address space like RISC-V sv57. A better way
> > > > > to solve this might be adding a MAP_48BIT flag to mmap like MAP_32BIT has
> > > > > been introduced for decades.
> > > > > 
> > > > > Thanks,
> > > > > Yangyu Chen
> > > > > 
> > > > 
> > > > - Charlie
> > > > 
> > > 
>