diff mbox series

[RFC,v3,1/2] mm: Add personality flag to limit address to 47 bits

Message ID 20240905-patches-below_hint_mmap-v3-1-3cd5564efbbb@rivosinc.com (mailing list archive)
State New
Headers show
Series mm: Introduce ADDR_LIMIT_47BIT personality flag | expand

Commit Message

Charlie Jenkins Sept. 5, 2024, 9:15 p.m. UTC
Create a personality flag ADDR_LIMIT_47BIT to support applications
that wish to transition from running in environments that support at
most 47-bit VAs to environments that support larger VAs. This
personality can be set to cause all allocations to be below the 47-bit
boundary. Using MAP_FIXED with mmap() will bypass this restriction.

Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
---
 include/uapi/linux/personality.h | 1 +
 mm/mmap.c                        | 3 +++
 2 files changed, 4 insertions(+)

Comments

Michael Ellerman Sept. 6, 2024, 6:59 a.m. UTC | #1
Charlie Jenkins <charlie@rivosinc.com> writes:
> Create a personality flag ADDR_LIMIT_47BIT to support applications
> that wish to transition from running in environments that support at
> most 47-bit VAs to environments that support larger VAs. This
> personality can be set to cause all allocations to be below the 47-bit
> boundary. Using MAP_FIXED with mmap() will bypass this restriction.
>
> Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
> ---
>  include/uapi/linux/personality.h | 1 +
>  mm/mmap.c                        | 3 +++
>  2 files changed, 4 insertions(+)
>
> diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
> index 49796b7756af..cd3b8c154d9b 100644
> --- a/include/uapi/linux/personality.h
> +++ b/include/uapi/linux/personality.h
> @@ -22,6 +22,7 @@ enum {
>  	WHOLE_SECONDS =		0x2000000,
>  	STICKY_TIMEOUTS	=	0x4000000,
>  	ADDR_LIMIT_3GB = 	0x8000000,
> +	ADDR_LIMIT_47BIT = 	0x10000000,
>  };

I wonder if ADDR_LIMIT_128T would be clearer?

Have you looked at writing an update for the personality(2) man page? :)

cheers
Arnd Bergmann Sept. 6, 2024, 7:17 a.m. UTC | #2
On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
> Create a personality flag ADDR_LIMIT_47BIT to support applications
> that wish to transition from running in environments that support at
> most 47-bit VAs to environments that support larger VAs. This
> personality can be set to cause all allocations to be below the 47-bit
> boundary. Using MAP_FIXED with mmap() will bypass this restriction.
>
> Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>

I think having an architecture-independent mechanism to limit the size
of the 64-bit address space is useful in general, and we've discussed
the same thing for arm64 in the past, though we have not actually
reached an agreement on the ABI previously.

> @@ -22,6 +22,7 @@ enum {
>  	WHOLE_SECONDS =		0x2000000,
>  	STICKY_TIMEOUTS	=	0x4000000,
>  	ADDR_LIMIT_3GB = 	0x8000000,
> +	ADDR_LIMIT_47BIT = 	0x10000000,
> };

I'm a bit worried about having this done specifically in the
personality flag bits, as they are rather limited. We obviously
don't want to add many more such flags when there could be
a way to just set the default limit.

It's also unclear to me how we want this flag to interact with
the existing logic in arch_get_mmap_end(), which attempts to
limit the default mapping to a 47-bit address space already.

For some reason, it appears that the arch_get_mmap_end()
logic on RISC-V defaults to the maximum address
space for the 'addr==0' case which is inconsistentn with
the other architectures, so we should probably fix that
part first, possibly moving more of that logic into a
shared implementation.

      Arnd
Lorenzo Stoakes Sept. 6, 2024, 8:02 a.m. UTC | #3
On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote:
> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
> > Create a personality flag ADDR_LIMIT_47BIT to support applications
> > that wish to transition from running in environments that support at
> > most 47-bit VAs to environments that support larger VAs. This
> > personality can be set to cause all allocations to be below the 47-bit
> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
> >
> > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
>
> I think having an architecture-independent mechanism to limit the size
> of the 64-bit address space is useful in general, and we've discussed
> the same thing for arm64 in the past, though we have not actually
> reached an agreement on the ABI previously.

The thread on the original proposals attests to this being rather a fraught
topic, and I think the weight of opinion was more so in favour of opt-in
rather than opt-out.

>
> > @@ -22,6 +22,7 @@ enum {
> >  	WHOLE_SECONDS =		0x2000000,
> >  	STICKY_TIMEOUTS	=	0x4000000,
> >  	ADDR_LIMIT_3GB = 	0x8000000,
> > +	ADDR_LIMIT_47BIT = 	0x10000000,
> > };
>
> I'm a bit worried about having this done specifically in the
> personality flag bits, as they are rather limited. We obviously
> don't want to add many more such flags when there could be
> a way to just set the default limit.

Since I'm the one who suggested it, I feel I should offer some kind of
vague defence here :)

We shouldn't let perfect be the enemy of the good. This is a relatively
straightforward means of achieving the aim (assuming your concern about
arch_get_mmap_end() below isn't a blocker) which has the least impact on
existing code.

Of course we can end up in absurdities where we start doing
ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an
egregious maintenance burden and is entirely opt-in so has things going for
it.

>
> It's also unclear to me how we want this flag to interact with
> the existing logic in arch_get_mmap_end(), which attempts to
> limit the default mapping to a 47-bit address space already.

How does ADDR_LIMIT_3GB presently interact with that?

>
> For some reason, it appears that the arch_get_mmap_end()
> logic on RISC-V defaults to the maximum address
> space for the 'addr==0' case which is inconsistentn with
> the other architectures, so we should probably fix that
> part first, possibly moving more of that logic into a
> shared implementation.
>
>       Arnd
Lorenzo Stoakes Sept. 6, 2024, 8:14 a.m. UTC | #4
On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote:
> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
> > Create a personality flag ADDR_LIMIT_47BIT to support applications
> > that wish to transition from running in environments that support at
> > most 47-bit VAs to environments that support larger VAs. This
> > personality can be set to cause all allocations to be below the 47-bit
> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
> >
> > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
>
> I think having an architecture-independent mechanism to limit the size
> of the 64-bit address space is useful in general, and we've discussed
> the same thing for arm64 in the past, though we have not actually
> reached an agreement on the ABI previously.

The thread on the original proposals attests to this being rather a fraught
topic, and I think the weight of opinion was more so in favour of opt-in
rather than opt-out.

>
> > @@ -22,6 +22,7 @@ enum {
> >  	WHOLE_SECONDS =		0x2000000,
> >  	STICKY_TIMEOUTS	=	0x4000000,
> >  	ADDR_LIMIT_3GB = 	0x8000000,
> > +	ADDR_LIMIT_47BIT = 	0x10000000,
> > };
>
> I'm a bit worried about having this done specifically in the
> personality flag bits, as they are rather limited. We obviously
> don't want to add many more such flags when there could be
> a way to just set the default limit.

Since I'm the one who suggested it, I feel I should offer some kind of
vague defence here :)

We shouldn't let perfect be the enemy of the good. This is a relatively
straightforward means of achieving the aim (assuming your concern about
arch_get_mmap_end() below isn't a blocker) which has the least impact on
existing code.

Of course we can end up in absurdities where we start doing
ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an
egregious maintenance burden and is entirely opt-in so has things going for
it.

>
> It's also unclear to me how we want this flag to interact with
> the existing logic in arch_get_mmap_end(), which attempts to
> limit the default mapping to a 47-bit address space already.

How does ADDR_LIMIT_3GB presently interact with that?

>
> For some reason, it appears that the arch_get_mmap_end()
> logic on RISC-V defaults to the maximum address
> space for the 'addr==0' case which is inconsistentn with
> the other architectures, so we should probably fix that
> part first, possibly moving more of that logic into a
> shared implementation.
>
>       Arnd
Arnd Bergmann Sept. 6, 2024, 9:14 a.m. UTC | #5
On Fri, Sep 6, 2024, at 08:14, Lorenzo Stoakes wrote:
> On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote:
>> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
>> > Create a personality flag ADDR_LIMIT_47BIT to support applications
>> > that wish to transition from running in environments that support at
>> > most 47-bit VAs to environments that support larger VAs. This
>> > personality can be set to cause all allocations to be below the 47-bit
>> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
>> >
>> > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
>>
>> I think having an architecture-independent mechanism to limit the size
>> of the 64-bit address space is useful in general, and we've discussed
>> the same thing for arm64 in the past, though we have not actually
>> reached an agreement on the ABI previously.
>
> The thread on the original proposals attests to this being rather a fraught
> topic, and I think the weight of opinion was more so in favour of opt-in
> rather than opt-out.

You mean opt-in to using the larger addresses like we do on arm64 and
powerpc, while "opt-out" means a limit as Charlie suggested?

>> > @@ -22,6 +22,7 @@ enum {
>> >  	WHOLE_SECONDS =		0x2000000,
>> >  	STICKY_TIMEOUTS	=	0x4000000,
>> >  	ADDR_LIMIT_3GB = 	0x8000000,
>> > +	ADDR_LIMIT_47BIT = 	0x10000000,
>> > };
>>
>> I'm a bit worried about having this done specifically in the
>> personality flag bits, as they are rather limited. We obviously
>> don't want to add many more such flags when there could be
>> a way to just set the default limit.
>
> Since I'm the one who suggested it, I feel I should offer some kind of
> vague defence here :)
>
> We shouldn't let perfect be the enemy of the good. This is a relatively
> straightforward means of achieving the aim (assuming your concern about
> arch_get_mmap_end() below isn't a blocker) which has the least impact on
> existing code.
>
> Of course we can end up in absurdities where we start doing
> ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an
> egregious maintenance burden and is entirely opt-in so has things going for
> it.

I'm more confused now, I think most importantly we should try to
handle this consistently across all architectures. The proposed
implementation seems to completely block addresses above BIT(47)
even for applications that opt in by calling mmap(BIT(47), ...),
which seems to break the existing applications.

If we want this flag for RISC-V and also keep the behavior of
defaulting to >BIT(47) addresses for mmap(0, ...) how about
changing arch_get_mmap_end() to return the limit based on
ADDR_LIMIT_47BIT and then make this default to enabled on
arm64 and powerpc but disabled on riscv?

>> It's also unclear to me how we want this flag to interact with
>> the existing logic in arch_get_mmap_end(), which attempts to
>> limit the default mapping to a 47-bit address space already.
>
> How does ADDR_LIMIT_3GB presently interact with that?

That is x86 specific and only relevant to compat tasks, limiting
them to 3 instead of 4 GB. There is also ADDR_LIMIT_32BIT, which
on arm32 is always set in practice to allow 32-bit addressing 
as opposed to ARMv2 style 26-bit addressing (IIRC ARMv3 supported
both 26-bit and 32-bit addressing, while ARMv4 through ARMv7 are
32-bit only.

      Arnd
Guo Ren Sept. 6, 2024, 9:14 a.m. UTC | #6
On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
> > Create a personality flag ADDR_LIMIT_47BIT to support applications
> > that wish to transition from running in environments that support at
> > most 47-bit VAs to environments that support larger VAs. This
> > personality can be set to cause all allocations to be below the 47-bit
> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
> >
> > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
>
> I think having an architecture-independent mechanism to limit the size
> of the 64-bit address space is useful in general, and we've discussed
> the same thing for arm64 in the past, though we have not actually
> reached an agreement on the ABI previously.
>
> > @@ -22,6 +22,7 @@ enum {
> >       WHOLE_SECONDS =         0x2000000,
> >       STICKY_TIMEOUTS =       0x4000000,
> >       ADDR_LIMIT_3GB =        0x8000000,
> > +     ADDR_LIMIT_47BIT =      0x10000000,
> > };
>
> I'm a bit worried about having this done specifically in the
> personality flag bits, as they are rather limited. We obviously
> don't want to add many more such flags when there could be
> a way to just set the default limit.
>
> It's also unclear to me how we want this flag to interact with
> the existing logic in arch_get_mmap_end(), which attempts to
> limit the default mapping to a 47-bit address space already.

To optimize RISC-V progress, I recommend:

Step 1: Approve the patch.
Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
Step 3: Wait approximately several iterations for Go & OpenJDK
Step 4: Remove the 47-bit constraint in arch_get_mmap_end()


>
> For some reason, it appears that the arch_get_mmap_end()
> logic on RISC-V defaults to the maximum address
> space for the 'addr==0' case which is inconsistentn with
> the other architectures, so we should probably fix that
> part first, possibly moving more of that logic into a
> shared implementation.
>
>       Arnd
>
Lorenzo Stoakes Sept. 6, 2024, 9:52 a.m. UTC | #7
(Sorry having issues with my IPv6 setup that duplicated the original email...

On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote:
> On Fri, Sep 6, 2024, at 08:14, Lorenzo Stoakes wrote:
> > On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote:
> >> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
> >> > Create a personality flag ADDR_LIMIT_47BIT to support applications
> >> > that wish to transition from running in environments that support at
> >> > most 47-bit VAs to environments that support larger VAs. This
> >> > personality can be set to cause all allocations to be below the 47-bit
> >> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
> >> >
> >> > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
> >>
> >> I think having an architecture-independent mechanism to limit the size
> >> of the 64-bit address space is useful in general, and we've discussed
> >> the same thing for arm64 in the past, though we have not actually
> >> reached an agreement on the ABI previously.
> >
> > The thread on the original proposals attests to this being rather a fraught
> > topic, and I think the weight of opinion was more so in favour of opt-in
> > rather than opt-out.
>
> You mean opt-in to using the larger addresses like we do on arm64 and
> powerpc, while "opt-out" means a limit as Charlie suggested?

I guess I'm not using brilliant terminology here haha!

To clarify - the weight of opinion was for a situation where the address
space is limited, except if you set a hint above that (you could call that
opt-out or opt-in depending which way you look at it, so yeah ok very
unclear sorry!).

It was against the MAP_ flag and also I think a _flexible_ per-process
limit is also questionable as you might end up setting a limit which breaks
something else, and this starts getting messy quick.

To be clear, the ADDR_LIMIT_47BIT suggestion is absolutely a compromise and
practical suggestion.

>
> >> > @@ -22,6 +22,7 @@ enum {
> >> >  	WHOLE_SECONDS =		0x2000000,
> >> >  	STICKY_TIMEOUTS	=	0x4000000,
> >> >  	ADDR_LIMIT_3GB = 	0x8000000,
> >> > +	ADDR_LIMIT_47BIT = 	0x10000000,
> >> > };
> >>
> >> I'm a bit worried about having this done specifically in the
> >> personality flag bits, as they are rather limited. We obviously
> >> don't want to add many more such flags when there could be
> >> a way to just set the default limit.
> >
> > Since I'm the one who suggested it, I feel I should offer some kind of
> > vague defence here :)
> >
> > We shouldn't let perfect be the enemy of the good. This is a relatively
> > straightforward means of achieving the aim (assuming your concern about
> > arch_get_mmap_end() below isn't a blocker) which has the least impact on
> > existing code.
> >
> > Of course we can end up in absurdities where we start doing
> > ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an
> > egregious maintenance burden and is entirely opt-in so has things going for
> > it.
>
> I'm more confused now, I think most importantly we should try to
> handle this consistently across all architectures. The proposed
> implementation seems to completely block addresses above BIT(47)
> even for applications that opt in by calling mmap(BIT(47), ...),
> which seems to break the existing applications.

Hm, I thought the commit message suggested the hint overrides it still?

The intent is to optionally be able to run a process that keeps higher bits
free for tagging and to be sure no memory mapping in the process will
clobber these (correct me if I'm wrong Charlie! :)

So you really wouldn't want this if you are using tagged pointers, you'd
want to be sure literally nothing touches the higher bits.

>
> If we want this flag for RISC-V and also keep the behavior of
> defaulting to >BIT(47) addresses for mmap(0, ...) how about
> changing arch_get_mmap_end() to return the limit based on
> ADDR_LIMIT_47BIT and then make this default to enabled on
> arm64 and powerpc but disabled on riscv?

But you wouldn't necessarily want all processes to be so restricted, I
think this is what Charlie's trying to avoid :)

On the ohter hand - I'm not sure there are many processes on any arch
that'd want the higher mappings.

So that'd push us again towards risc v just limiting to 48-bits and only
mapping above this if a hint is provided like x86-64 does (and as you
mentioned via irc - it seems risc v is an outlier in that
DEFAULT_MAP_WINDOW == TASK_SIZE).

This would be more consistent vs. other arches.

>
> >> It's also unclear to me how we want this flag to interact with
> >> the existing logic in arch_get_mmap_end(), which attempts to
> >> limit the default mapping to a 47-bit address space already.
> >
> > How does ADDR_LIMIT_3GB presently interact with that?
>
> That is x86 specific and only relevant to compat tasks, limiting
> them to 3 instead of 4 GB. There is also ADDR_LIMIT_32BIT, which
> on arm32 is always set in practice to allow 32-bit addressing
> as opposed to ARMv2 style 26-bit addressing (IIRC ARMv3 supported
> both 26-bit and 32-bit addressing, while ARMv4 through ARMv7 are
> 32-bit only.

OK, I understand what it's for, I missed it was arch-specific bit, urgh.

I'd say this limit should be min of the arch-specific limit vs. the 48-bit
limit. If you have a 36-bit address space obviously it'd be rather unwise
to try to provide 48 bit addresses..

>
>       Arnd
Arnd Bergmann Sept. 6, 2024, 9:55 a.m. UTC | #8
On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> It's also unclear to me how we want this flag to interact with
>> the existing logic in arch_get_mmap_end(), which attempts to
>> limit the default mapping to a 47-bit address space already.
>
> To optimize RISC-V progress, I recommend:
>
> Step 1: Approve the patch.
> Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> Step 3: Wait approximately several iterations for Go & OpenJDK
> Step 4: Remove the 47-bit constraint in arch_get_mmap_end()

I really want to first see a plausible explanation about why
RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
like all the other major architectures (x86, arm64, powerpc64),
e.g. something like the patch below (untested, probably slightly
wrong but show illustrate my point).

     Arnd

diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h
index 8702b8721a27..de9863be1efd 100644
--- a/arch/riscv/include/asm/processor.h
+++ b/arch/riscv/include/asm/processor.h
@@ -20,17 +20,8 @@
  * mmap_end < addr, being mmap_end the top of that address space.
  * See Documentation/arch/riscv/vm-layout.rst for more details.
  */
-#define arch_get_mmap_end(addr, len, flags)			\
-({								\
-	unsigned long mmap_end;					\
-	typeof(addr) _addr = (addr);				\
-	if ((_addr) == 0 || is_compat_task() ||			\
-	    ((_addr + len) > BIT(VA_BITS - 1)))			\
-		mmap_end = STACK_TOP_MAX;			\
-	else							\
-		mmap_end = (_addr + len);			\
-	mmap_end;						\
-})
+#define arch_get_mmap_end(addr, len, flags) \
+		(((addr) > DEFAULT_MAP_WINDOW) ? TASK_SIZE : DEFAULT_MAP_WINDOW)
 
 #define arch_get_mmap_base(addr, base)				\
 ({								\
@@ -47,7 +38,7 @@
 })
 
 #ifdef CONFIG_64BIT
-#define DEFAULT_MAP_WINDOW     (UL(1) << (MMAP_VA_BITS - 1))
+#define DEFAULT_MAP_WINDOW     (is_compat_task() ? (UL(1) << (MMAP_VA_BITS - 1)) : TASK_SIZE_32)
 #define STACK_TOP_MAX          TASK_SIZE_64
 #else
 #define DEFAULT_MAP_WINDOW     TASK_SIZE
Catalin Marinas Sept. 6, 2024, 11:43 a.m. UTC | #9
On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote:
> On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >> It's also unclear to me how we want this flag to interact with
> >> the existing logic in arch_get_mmap_end(), which attempts to
> >> limit the default mapping to a 47-bit address space already.
> >
> > To optimize RISC-V progress, I recommend:
> >
> > Step 1: Approve the patch.
> > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > Step 3: Wait approximately several iterations for Go & OpenJDK
> > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
> 
> I really want to first see a plausible explanation about why
> RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> like all the other major architectures (x86, arm64, powerpc64),

FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
configuration. We end up with a 47-bit with 16K pages but for a
different reason that has to do with LPA2 support (I doubt we need this
for the user mapping but we need to untangle some of the macros there;
that's for a separate discussion).

That said, we haven't encountered any user space problems with a 48-bit
DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
approach (47 or 48 bit default limit). Better to have some ABI
consistency between architectures. One can still ask for addresses above
this default limit via mmap().
Charlie Jenkins Sept. 9, 2024, 7:07 p.m. UTC | #10
On Fri, Sep 06, 2024 at 04:59:40PM +1000, Michael Ellerman wrote:
> Charlie Jenkins <charlie@rivosinc.com> writes:
> > Create a personality flag ADDR_LIMIT_47BIT to support applications
> > that wish to transition from running in environments that support at
> > most 47-bit VAs to environments that support larger VAs. This
> > personality can be set to cause all allocations to be below the 47-bit
> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
> >
> > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
> > ---
> >  include/uapi/linux/personality.h | 1 +
> >  mm/mmap.c                        | 3 +++
> >  2 files changed, 4 insertions(+)
> >
> > diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
> > index 49796b7756af..cd3b8c154d9b 100644
> > --- a/include/uapi/linux/personality.h
> > +++ b/include/uapi/linux/personality.h
> > @@ -22,6 +22,7 @@ enum {
> >  	WHOLE_SECONDS =		0x2000000,
> >  	STICKY_TIMEOUTS	=	0x4000000,
> >  	ADDR_LIMIT_3GB = 	0x8000000,
> > +	ADDR_LIMIT_47BIT = 	0x10000000,
> >  };
> 
> I wonder if ADDR_LIMIT_128T would be clearer?
> 

I don't follow, what does 128T represent?

> Have you looked at writing an update for the personality(2) man page? :)

I will write an update to the man page if this patch is approved!

> 
> cheers

- Charlie
Charlie Jenkins Sept. 9, 2024, 11:22 p.m. UTC | #11
On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote:
> (Sorry having issues with my IPv6 setup that duplicated the original email...
> 
> On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote:
> > On Fri, Sep 6, 2024, at 08:14, Lorenzo Stoakes wrote:
> > > On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote:
> > >> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
> > >> > Create a personality flag ADDR_LIMIT_47BIT to support applications
> > >> > that wish to transition from running in environments that support at
> > >> > most 47-bit VAs to environments that support larger VAs. This
> > >> > personality can be set to cause all allocations to be below the 47-bit
> > >> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
> > >> >
> > >> > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
> > >>
> > >> I think having an architecture-independent mechanism to limit the size
> > >> of the 64-bit address space is useful in general, and we've discussed
> > >> the same thing for arm64 in the past, though we have not actually
> > >> reached an agreement on the ABI previously.
> > >
> > > The thread on the original proposals attests to this being rather a fraught
> > > topic, and I think the weight of opinion was more so in favour of opt-in
> > > rather than opt-out.
> >
> > You mean opt-in to using the larger addresses like we do on arm64 and
> > powerpc, while "opt-out" means a limit as Charlie suggested?
> 
> I guess I'm not using brilliant terminology here haha!
> 
> To clarify - the weight of opinion was for a situation where the address
> space is limited, except if you set a hint above that (you could call that
> opt-out or opt-in depending which way you look at it, so yeah ok very
> unclear sorry!).
> 
> It was against the MAP_ flag and also I think a _flexible_ per-process
> limit is also questionable as you might end up setting a limit which breaks
> something else, and this starts getting messy quick.
> 
> To be clear, the ADDR_LIMIT_47BIT suggestion is absolutely a compromise and
> practical suggestion.
> 
> >
> > >> > @@ -22,6 +22,7 @@ enum {
> > >> >  	WHOLE_SECONDS =		0x2000000,
> > >> >  	STICKY_TIMEOUTS	=	0x4000000,
> > >> >  	ADDR_LIMIT_3GB = 	0x8000000,
> > >> > +	ADDR_LIMIT_47BIT = 	0x10000000,
> > >> > };
> > >>
> > >> I'm a bit worried about having this done specifically in the
> > >> personality flag bits, as they are rather limited. We obviously
> > >> don't want to add many more such flags when there could be
> > >> a way to just set the default limit.
> > >
> > > Since I'm the one who suggested it, I feel I should offer some kind of
> > > vague defence here :)
> > >
> > > We shouldn't let perfect be the enemy of the good. This is a relatively
> > > straightforward means of achieving the aim (assuming your concern about
> > > arch_get_mmap_end() below isn't a blocker) which has the least impact on
> > > existing code.
> > >
> > > Of course we can end up in absurdities where we start doing
> > > ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an
> > > egregious maintenance burden and is entirely opt-in so has things going for
> > > it.
> >
> > I'm more confused now, I think most importantly we should try to
> > handle this consistently across all architectures. The proposed
> > implementation seems to completely block addresses above BIT(47)
> > even for applications that opt in by calling mmap(BIT(47), ...),
> > which seems to break the existing applications.
> 
> Hm, I thought the commit message suggested the hint overrides it still?
> 
> The intent is to optionally be able to run a process that keeps higher bits
> free for tagging and to be sure no memory mapping in the process will
> clobber these (correct me if I'm wrong Charlie! :)
> 
> So you really wouldn't want this if you are using tagged pointers, you'd
> want to be sure literally nothing touches the higher bits.

Various architectures handle the hint address differently, but it
appears that the only case across any architecture where an address
above 47 bits will be returned is if the application had a hint address
with a value greater than 47 bits and was using the MAP_FIXED flag.
MAP_FIXED bypasses all other checks so I was assuming that it would be
logical for MAP_FIXED to bypass this as well. If MAP_FIXED is not set,
then the intent is for no hint address to cause a value greater than 47
bits to be returned.

This does have the issue that if MAP_FIXED is used then an address can
be returned above 47-bits, but if an application does not want addresses
above 47-bits then they shouldn't ask for a fixed address above that
range.

> 
> >
> > If we want this flag for RISC-V and also keep the behavior of
> > defaulting to >BIT(47) addresses for mmap(0, ...) how about
> > changing arch_get_mmap_end() to return the limit based on
> > ADDR_LIMIT_47BIT and then make this default to enabled on
> > arm64 and powerpc but disabled on riscv?
> 
> But you wouldn't necessarily want all processes to be so restricted, I
> think this is what Charlie's trying to avoid :)
> 
> On the ohter hand - I'm not sure there are many processes on any arch
> that'd want the higher mappings.
> 
> So that'd push us again towards risc v just limiting to 48-bits and only
> mapping above this if a hint is provided like x86-64 does (and as you
> mentioned via irc - it seems risc v is an outlier in that
> DEFAULT_MAP_WINDOW == TASK_SIZE).
> 
> This would be more consistent vs. other arches.

Yes riscv is an outlier here. The reason I am pushing for something like
a flag to restrict the address space rather than setting it to be the
default is it seems like if applications are relying on upper bits to be
free, then they should be explicitly asking the kernel to keep them free
rather than assuming them to be free.

> 
> >
> > >> It's also unclear to me how we want this flag to interact with
> > >> the existing logic in arch_get_mmap_end(), which attempts to
> > >> limit the default mapping to a 47-bit address space already.
> > >
> > > How does ADDR_LIMIT_3GB presently interact with that?
> >
> > That is x86 specific and only relevant to compat tasks, limiting
> > them to 3 instead of 4 GB. There is also ADDR_LIMIT_32BIT, which
> > on arm32 is always set in practice to allow 32-bit addressing
> > as opposed to ARMv2 style 26-bit addressing (IIRC ARMv3 supported
> > both 26-bit and 32-bit addressing, while ARMv4 through ARMv7 are
> > 32-bit only.
> 
> OK, I understand what it's for, I missed it was arch-specific bit, urgh.
> 
> I'd say this limit should be min of the arch-specific limit vs. the 48-bit
> limit. If you have a 36-bit address space obviously it'd be rather unwise
> to try to provide 48 bit addresses..

In this patch I set the high limit to be the minimum of the provided
high limit and 47 bits so I think that should cover this case?

- Charlie

> 
> >
> >       Arnd
Arnd Bergmann Sept. 10, 2024, 9:13 a.m. UTC | #12
On Mon, Sep 9, 2024, at 23:22, Charlie Jenkins wrote:
> On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote:
>> On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote:
>> The intent is to optionally be able to run a process that keeps higher bits
>> free for tagging and to be sure no memory mapping in the process will
>> clobber these (correct me if I'm wrong Charlie! :)
>> 
>> So you really wouldn't want this if you are using tagged pointers, you'd
>> want to be sure literally nothing touches the higher bits.

My understanding was that the purpose of the existing design
is to allow applications to ask for a high address without having
to resort to the complexity of MAP_FIXED.

In particular, I'm sure there is precedent for applications that
want both tagged pointers (for most mappings) and untagged pointers
(for large mappings). With a per-mm_struct or per-task_struct
setting you can't do that.

> Various architectures handle the hint address differently, but it
> appears that the only case across any architecture where an address
> above 47 bits will be returned is if the application had a hint address
> with a value greater than 47 bits and was using the MAP_FIXED flag.
> MAP_FIXED bypasses all other checks so I was assuming that it would be
> logical for MAP_FIXED to bypass this as well. If MAP_FIXED is not set,
> then the intent is for no hint address to cause a value greater than 47
> bits to be returned.

I don't think the MAP_FIXED case is that interesting here because
it has to work in both fixed and non-fixed mappings.

>> This would be more consistent vs. other arches.
>
> Yes riscv is an outlier here. The reason I am pushing for something like
> a flag to restrict the address space rather than setting it to be the
> default is it seems like if applications are relying on upper bits to be
> free, then they should be explicitly asking the kernel to keep them free
> rather than assuming them to be free.

Let's see what the other architectures do and then come up with
a way that fixes the pointer tagging case first on those that are
broken. We can see if there needs to be an extra flag after that.
Here is what I found:

- x86_64 uses DEFAULT_MAP_WINDOW of BIT(47), uses a 57 bit
  address space when an addr hint is passed.
- arm64 uses DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns
  higher 52-bit addresses when either a hint is passed or
  CONFIG_EXPERT and CONFIG_ARM64_FORCE_52BIT is set (this
  is a debugging option)
- ppc64 uses a DEFAULT_MAP_WINDOW of BIT(47) or BIT(48),
  returns 52 bit address when an addr hint is passed
- riscv uses a DEFAULT_MAP_WINDOW of BIT(47) but only uses
  it for allocating the stack below, ignoring it for normal
  mappings
- s390 has no DEFAULT_MAP_WINDOW but tried to allocate in
  the current number of pgtable levels and only upgrades to
  the next level (31, 42, 53, 64 bits) if a hint is passed or
  the current level is exhausted.
- loongarch64 has no DEFAULT_MAP_WINDOW, and a default VA
  space of 47 bits (16K pages, 3 levels), but can support
  a 55 bit space (64K pages, 3 levels).
- sparc has no DEFAULT_MAP_WINDOW and up to 52 bit VA space.
  It may allocate both positive and negative addresses in
  there. (?)
- mips64, parisc64 and alpha have no DEFAULT_MAP_WINDOW and
  at most 48, 41 or 39 address bits, respectively.

I would suggest these changes:

- make riscv enforce DEFAULT_MAP_WINDOW like x86_64, arm64
   and ppc64, leave it at 47

- add DEFAULT_MAP_WINDOW on loongarch64 (47/48 bits
  based on page size), sparc (48 bits) and s390 (unsure if
  42, 53, 47 or 48 bits)

- leave the rest unchanged.

       Arnd
Christophe Leroy Sept. 10, 2024, 9:20 a.m. UTC | #13
>>> diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
>>> index 49796b7756af..cd3b8c154d9b 100644
>>> --- a/include/uapi/linux/personality.h
>>> +++ b/include/uapi/linux/personality.h
>>> @@ -22,6 +22,7 @@ enum {
>>>   	WHOLE_SECONDS =		0x2000000,
>>>   	STICKY_TIMEOUTS	=	0x4000000,
>>>   	ADDR_LIMIT_3GB = 	0x8000000,
>>> +	ADDR_LIMIT_47BIT = 	0x10000000,
>>>   };
>>
>> I wonder if ADDR_LIMIT_128T would be clearer?
>>
> 
> I don't follow, what does 128T represent?
> 

128T is 128 Terabytes, that's the maximum size achievable with a 47BIT 
address, that naming would be more consistant with the ADDR_LIMIT_3GB 
just above that means a 3 Gigabytes limit.

Christophe
Geert Uytterhoeven Sept. 10, 2024, 12:43 p.m. UTC | #14
Hi Christophe,

On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
> >>> diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
> >>> index 49796b7756af..cd3b8c154d9b 100644
> >>> --- a/include/uapi/linux/personality.h
> >>> +++ b/include/uapi/linux/personality.h
> >>> @@ -22,6 +22,7 @@ enum {
> >>>     WHOLE_SECONDS =         0x2000000,
> >>>     STICKY_TIMEOUTS =       0x4000000,
> >>>     ADDR_LIMIT_3GB =        0x8000000,
> >>> +   ADDR_LIMIT_47BIT =      0x10000000,
> >>>   };
> >>
> >> I wonder if ADDR_LIMIT_128T would be clearer?
> >>
> >
> > I don't follow, what does 128T represent?
>
> 128T is 128 Terabytes, that's the maximum size achievable with a 47BIT
> address, that naming would be more consistant with the ADDR_LIMIT_3GB
> just above that means a 3 Gigabytes limit.

Hence ADDR_LIMIT_128TB?

Gr{oetje,eeting}s,

                        Geert
Liam R. Howlett Sept. 10, 2024, 7:08 p.m. UTC | #15
* Catalin Marinas <catalin.marinas@arm.com> [240906 07:44]:
> On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote:
> > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >> It's also unclear to me how we want this flag to interact with
> > >> the existing logic in arch_get_mmap_end(), which attempts to
> > >> limit the default mapping to a 47-bit address space already.
> > >
> > > To optimize RISC-V progress, I recommend:
> > >
> > > Step 1: Approve the patch.
> > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > > Step 3: Wait approximately several iterations for Go & OpenJDK
> > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
> > 
> > I really want to first see a plausible explanation about why
> > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> > like all the other major architectures (x86, arm64, powerpc64),
> 
> FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
> configuration. We end up with a 47-bit with 16K pages but for a
> different reason that has to do with LPA2 support (I doubt we need this
> for the user mapping but we need to untangle some of the macros there;
> that's for a separate discussion).
> 
> That said, we haven't encountered any user space problems with a 48-bit
> DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
> approach (47 or 48 bit default limit). Better to have some ABI
> consistency between architectures. One can still ask for addresses above
> this default limit via mmap().

I think that is best as well.

Can we please just do what x86 and arm64 does?

Thanks,
Liam
Charlie Jenkins Sept. 10, 2024, 11:29 p.m. UTC | #16
On Tue, Sep 10, 2024 at 09:13:33AM +0000, Arnd Bergmann wrote:
> On Mon, Sep 9, 2024, at 23:22, Charlie Jenkins wrote:
> > On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote:
> >> On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote:
> >> The intent is to optionally be able to run a process that keeps higher bits
> >> free for tagging and to be sure no memory mapping in the process will
> >> clobber these (correct me if I'm wrong Charlie! :)
> >> 
> >> So you really wouldn't want this if you are using tagged pointers, you'd
> >> want to be sure literally nothing touches the higher bits.
> 
> My understanding was that the purpose of the existing design
> is to allow applications to ask for a high address without having
> to resort to the complexity of MAP_FIXED.
> 
> In particular, I'm sure there is precedent for applications that
> want both tagged pointers (for most mappings) and untagged pointers
> (for large mappings). With a per-mm_struct or per-task_struct
> setting you can't do that.
> 
> > Various architectures handle the hint address differently, but it
> > appears that the only case across any architecture where an address
> > above 47 bits will be returned is if the application had a hint address
> > with a value greater than 47 bits and was using the MAP_FIXED flag.
> > MAP_FIXED bypasses all other checks so I was assuming that it would be
> > logical for MAP_FIXED to bypass this as well. If MAP_FIXED is not set,
> > then the intent is for no hint address to cause a value greater than 47
> > bits to be returned.
> 
> I don't think the MAP_FIXED case is that interesting here because
> it has to work in both fixed and non-fixed mappings.
> 
> >> This would be more consistent vs. other arches.
> >
> > Yes riscv is an outlier here. The reason I am pushing for something like
> > a flag to restrict the address space rather than setting it to be the
> > default is it seems like if applications are relying on upper bits to be
> > free, then they should be explicitly asking the kernel to keep them free
> > rather than assuming them to be free.
> 
> Let's see what the other architectures do and then come up with
> a way that fixes the pointer tagging case first on those that are
> broken. We can see if there needs to be an extra flag after that.
> Here is what I found:
> 
> - x86_64 uses DEFAULT_MAP_WINDOW of BIT(47), uses a 57 bit
>   address space when an addr hint is passed.
> - arm64 uses DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns
>   higher 52-bit addresses when either a hint is passed or
>   CONFIG_EXPERT and CONFIG_ARM64_FORCE_52BIT is set (this
>   is a debugging option)
> - ppc64 uses a DEFAULT_MAP_WINDOW of BIT(47) or BIT(48),
>   returns 52 bit address when an addr hint is passed
> - riscv uses a DEFAULT_MAP_WINDOW of BIT(47) but only uses
>   it for allocating the stack below, ignoring it for normal
>   mappings
> - s390 has no DEFAULT_MAP_WINDOW but tried to allocate in
>   the current number of pgtable levels and only upgrades to
>   the next level (31, 42, 53, 64 bits) if a hint is passed or
>   the current level is exhausted.
> - loongarch64 has no DEFAULT_MAP_WINDOW, and a default VA
>   space of 47 bits (16K pages, 3 levels), but can support
>   a 55 bit space (64K pages, 3 levels).
> - sparc has no DEFAULT_MAP_WINDOW and up to 52 bit VA space.
>   It may allocate both positive and negative addresses in
>   there. (?)
> - mips64, parisc64 and alpha have no DEFAULT_MAP_WINDOW and
>   at most 48, 41 or 39 address bits, respectively.
> 
> I would suggest these changes:
> 
> - make riscv enforce DEFAULT_MAP_WINDOW like x86_64, arm64
>    and ppc64, leave it at 47
> 
> - add DEFAULT_MAP_WINDOW on loongarch64 (47/48 bits
>   based on page size), sparc (48 bits) and s390 (unsure if
>   42, 53, 47 or 48 bits)
> 
> - leave the rest unchanged.
> 
>        Arnd

Changing all architectures to have a standardized DEFAULT_MAP_WINDOW
mostly solves the problem. However, I am concerned that it is fragile
for applications to rely on a default like this. Having the personality
bit flag is supposed to provide an intuitive ABI for users that
guarantees that they will not accidentally request for memory outside of
the boundary that they specified.

Also you bring up that the DEFAULT_MAP_WINDOW would not be able to be
standardized across architectures, so we still have the problem that
this default behavior will be different across architectures which I am
trying to solve.

- Charlie
Charlie Jenkins Sept. 11, 2024, 12:45 a.m. UTC | #17
On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
> * Catalin Marinas <catalin.marinas@arm.com> [240906 07:44]:
> > On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote:
> > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > >> It's also unclear to me how we want this flag to interact with
> > > >> the existing logic in arch_get_mmap_end(), which attempts to
> > > >> limit the default mapping to a 47-bit address space already.
> > > >
> > > > To optimize RISC-V progress, I recommend:
> > > >
> > > > Step 1: Approve the patch.
> > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > > > Step 3: Wait approximately several iterations for Go & OpenJDK
> > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
> > > 
> > > I really want to first see a plausible explanation about why
> > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> > > like all the other major architectures (x86, arm64, powerpc64),
> > 
> > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
> > configuration. We end up with a 47-bit with 16K pages but for a
> > different reason that has to do with LPA2 support (I doubt we need this
> > for the user mapping but we need to untangle some of the macros there;
> > that's for a separate discussion).
> > 
> > That said, we haven't encountered any user space problems with a 48-bit
> > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
> > approach (47 or 48 bit default limit). Better to have some ABI
> > consistency between architectures. One can still ask for addresses above
> > this default limit via mmap().
> 
> I think that is best as well.
> 
> Can we please just do what x86 and arm64 does?
> 
> Thanks,
> Liam

I responded to Arnd in the other thread, but I am still not convinced
that the solution that x86 and arm64 have selected is the best solution.
The solution of defaulting to 47 bits does allow applications the
ability to get addresses that are below 47 bits. However, due to
differences across architectures it doesn't seem possible to have all
architectures default to the same value. Additionally, this flag will be
able to help users avoid potential bugs where a hint address is passed
that causes upper bits of a VA to be used.

The other issue I have with this is that if there is not a hint address
specified to be greater than 47 bits on x86, then mmap() may return an
address that is greater than 47-bits. The documentation in
Documentation/arch/x86/x86_64/5level-paging.rst says:

"If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window."

arm64 on the other hand defines this as only being able to opt-into the
52-bit VA space with the hint address, and my understanding is that
mmap() will not fall back to the 52-bit address space. Please correct me
if I am wrong. From Documentation/arch/arm64/memory.rst:

"To maintain compatibility with software that relies on the ARMv8.0
VA space maximum size of 48-bits, the kernel will, by default,
return virtual addresses to userspace from a 48-bit range.

"Software can "opt-in" to receiving VAs from a 52-bit space by
specifying an mmap hint parameter that is larger than 48-bit."

This is an inconsistency I am trying to solve with this personality
flag.

- Charlie
Arnd Bergmann Sept. 11, 2024, 7:25 a.m. UTC | #18
On Wed, Sep 11, 2024, at 00:45, Charlie Jenkins wrote:
> On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
>
> I responded to Arnd in the other thread, but I am still not convinced
> that the solution that x86 and arm64 have selected is the best solution.
> The solution of defaulting to 47 bits does allow applications the
> ability to get addresses that are below 47 bits. However, due to
> differences across architectures it doesn't seem possible to have all
> architectures default to the same value. Additionally, this flag will be
> able to help users avoid potential bugs where a hint address is passed
> that causes upper bits of a VA to be used.
>
> The other issue I have with this is that if there is not a hint address
> specified to be greater than 47 bits on x86, then mmap() may return an
> address that is greater than 47-bits. The documentation in
> Documentation/arch/x86/x86_64/5level-paging.rst says:
>
> "If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window."

This is also in the commit message of b569bab78d8d ("x86/mm: Prepare
to expose larger address space to userspace"), which introduced it.
However, I don't actually see the fallback to the full address space,
instead the actual behavior seems to be the same as arm64.

Am I missing something in the x86 implementation, or do we just
need to update the documentation?

      Arnd
Michael Ellerman Sept. 11, 2024, 1:37 p.m. UTC | #19
Charlie Jenkins <charlie@rivosinc.com> writes:
> On Fri, Sep 06, 2024 at 04:59:40PM +1000, Michael Ellerman wrote:
>> Charlie Jenkins <charlie@rivosinc.com> writes:
>> > Create a personality flag ADDR_LIMIT_47BIT to support applications
>> > that wish to transition from running in environments that support at
>> > most 47-bit VAs to environments that support larger VAs. This
>> > personality can be set to cause all allocations to be below the 47-bit
>> > boundary. Using MAP_FIXED with mmap() will bypass this restriction.
>> >
>> > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
>> > ---
>> >  include/uapi/linux/personality.h | 1 +
>> >  mm/mmap.c                        | 3 +++
>> >  2 files changed, 4 insertions(+)
>> >
>> > diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
>> > index 49796b7756af..cd3b8c154d9b 100644
>> > --- a/include/uapi/linux/personality.h
>> > +++ b/include/uapi/linux/personality.h
>> > @@ -22,6 +22,7 @@ enum {
>> >  	WHOLE_SECONDS =		0x2000000,
>> >  	STICKY_TIMEOUTS	=	0x4000000,
>> >  	ADDR_LIMIT_3GB = 	0x8000000,
>> > +	ADDR_LIMIT_47BIT = 	0x10000000,
>> >  };
>> 
>> I wonder if ADDR_LIMIT_128T would be clearer?
>> 
>
> I don't follow, what does 128T represent?

Sorry, as Christophe explained it's 128 Terabytes, which is the actual
value of the address limit.

I think expressing it as the address value is probably more widely
understood, and would also match ADDR_LIMIT_3GB.

>> Have you looked at writing an update for the personality(2) man page? :)
>
> I will write an update to the man page if this patch is approved!

Yeah fair enough.

My (poorly expressed) point was that trying to describe the flag for the
man page might highlight that using the 47BIT name requires more
explanation.

cheers
Michael Ellerman Sept. 11, 2024, 1:38 p.m. UTC | #20
Geert Uytterhoeven <geert@linux-m68k.org> writes:
> Hi Christophe,
>
> On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy
> <christophe.leroy@csgroup.eu> wrote:
>> >>> diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
>> >>> index 49796b7756af..cd3b8c154d9b 100644
>> >>> --- a/include/uapi/linux/personality.h
>> >>> +++ b/include/uapi/linux/personality.h
>> >>> @@ -22,6 +22,7 @@ enum {
>> >>>     WHOLE_SECONDS =         0x2000000,
>> >>>     STICKY_TIMEOUTS =       0x4000000,
>> >>>     ADDR_LIMIT_3GB =        0x8000000,
>> >>> +   ADDR_LIMIT_47BIT =      0x10000000,
>> >>>   };
>> >>
>> >> I wonder if ADDR_LIMIT_128T would be clearer?
>> >>
>> >
>> > I don't follow, what does 128T represent?
>>
>> 128T is 128 Terabytes, that's the maximum size achievable with a 47BIT
>> address, that naming would be more consistant with the ADDR_LIMIT_3GB
>> just above that means a 3 Gigabytes limit.
>
> Hence ADDR_LIMIT_128TB?

Yes it should be 128TB. Typo by me.

cheers
Michael Ellerman Sept. 11, 2024, 1:50 p.m. UTC | #21
"Arnd Bergmann" <arnd@arndb.de> writes:
> On Mon, Sep 9, 2024, at 23:22, Charlie Jenkins wrote:
>> On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote:
>>> On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote:
>>> The intent is to optionally be able to run a process that keeps higher bits
>>> free for tagging and to be sure no memory mapping in the process will
>>> clobber these (correct me if I'm wrong Charlie! :)
...
> Let's see what the other architectures do and then come up with
> a way that fixes the pointer tagging case first on those that are
> broken. We can see if there needs to be an extra flag after that.
> Here is what I found:
>
> - x86_64 uses DEFAULT_MAP_WINDOW of BIT(47), uses a 57 bit
>   address space when an addr hint is passed.
> - arm64 uses DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns
>   higher 52-bit addresses when either a hint is passed or
>   CONFIG_EXPERT and CONFIG_ARM64_FORCE_52BIT is set (this
>   is a debugging option)
> - ppc64 uses a DEFAULT_MAP_WINDOW of BIT(47) or BIT(48),
>   returns 52 bit address when an addr hint is passed
   
It's 46 or 47 depending on PAGE_SIZE (4K or 64K):

  $ git grep "define DEFAULT_MAP_WINDOW_USER64" arch/powerpc/include/asm/task_size_64.h
  arch/powerpc/include/asm/task_size_64.h:#define DEFAULT_MAP_WINDOW_USER64        TASK_SIZE_128TB
  arch/powerpc/include/asm/task_size_64.h:#define DEFAULT_MAP_WINDOW_USER64        TASK_SIZE_64TB

cheers
Catalin Marinas Sept. 11, 2024, 6:21 p.m. UTC | #22
On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
> On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
> > * Catalin Marinas <catalin.marinas@arm.com> [240906 07:44]:
> > > On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote:
> > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > >> It's also unclear to me how we want this flag to interact with
> > > > >> the existing logic in arch_get_mmap_end(), which attempts to
> > > > >> limit the default mapping to a 47-bit address space already.
> > > > >
> > > > > To optimize RISC-V progress, I recommend:
> > > > >
> > > > > Step 1: Approve the patch.
> > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > > > > Step 3: Wait approximately several iterations for Go & OpenJDK
> > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()

Point 4 is an ABI change. What guarantees that there isn't still
software out there that relies on the old behaviour?

> > > > I really want to first see a plausible explanation about why
> > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> > > > like all the other major architectures (x86, arm64, powerpc64),
> > > 
> > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
> > > configuration. We end up with a 47-bit with 16K pages but for a
> > > different reason that has to do with LPA2 support (I doubt we need this
> > > for the user mapping but we need to untangle some of the macros there;
> > > that's for a separate discussion).
> > > 
> > > That said, we haven't encountered any user space problems with a 48-bit
> > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
> > > approach (47 or 48 bit default limit). Better to have some ABI
> > > consistency between architectures. One can still ask for addresses above
> > > this default limit via mmap().
> > 
> > I think that is best as well.
> > 
> > Can we please just do what x86 and arm64 does?
> 
> I responded to Arnd in the other thread, but I am still not convinced
> that the solution that x86 and arm64 have selected is the best solution.
> The solution of defaulting to 47 bits does allow applications the
> ability to get addresses that are below 47 bits. However, due to
> differences across architectures it doesn't seem possible to have all
> architectures default to the same value. Additionally, this flag will be
> able to help users avoid potential bugs where a hint address is passed
> that causes upper bits of a VA to be used.

The reason we added this limit on arm64 is that we noticed programs
using the top 8 bits of a 64-bit pointer for additional information.
IIRC, it wasn't even openJDK but some JavaScript JIT. We could have
taught those programs of a new flag but since we couldn't tell how many
are out there, it was the safest to default to a smaller limit and opt
in to the higher one. Such opt-in is via mmap() but if you prefer a
prctl() flag, that's fine by me as well (though I think this should be
opt-in to higher addresses rather than opt-out of the higher addresses).
Charlie Jenkins Sept. 12, 2024, 6:06 a.m. UTC | #23
On Wed, Sep 11, 2024 at 07:25:08AM +0000, Arnd Bergmann wrote:
> On Wed, Sep 11, 2024, at 00:45, Charlie Jenkins wrote:
> > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
> >
> > I responded to Arnd in the other thread, but I am still not convinced
> > that the solution that x86 and arm64 have selected is the best solution.
> > The solution of defaulting to 47 bits does allow applications the
> > ability to get addresses that are below 47 bits. However, due to
> > differences across architectures it doesn't seem possible to have all
> > architectures default to the same value. Additionally, this flag will be
> > able to help users avoid potential bugs where a hint address is passed
> > that causes upper bits of a VA to be used.
> >
> > The other issue I have with this is that if there is not a hint address
> > specified to be greater than 47 bits on x86, then mmap() may return an
> > address that is greater than 47-bits. The documentation in
> > Documentation/arch/x86/x86_64/5level-paging.rst says:
> >
> > "If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> > to look for unmapped area by specified address. If it's already
> > occupied, we look for unmapped area in *full* address space, rather than
> > from 47-bit window."
> 
> This is also in the commit message of b569bab78d8d ("x86/mm: Prepare
> to expose larger address space to userspace"), which introduced it.
> However, I don't actually see the fallback to the full address space,
> instead the actual behavior seems to be the same as arm64.
> 
> Am I missing something in the x86 implementation, or do we just
> need to update the documentation?
> 
>       Arnd

Yeah I guess it is incorrect documentation then? It seems more
reasonable to me to have a hint address fall back onto the larger
address space because otherwise the "hint" address can cause allocations
to fail even if there is space above the 47-bit limit. This is another
reason I wanted to avoid having this default behavior on riscv, to not
have this abuse of the hint address.

- Charlie
Charlie Jenkins Sept. 12, 2024, 6:18 a.m. UTC | #24
On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote:
> On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
> > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
> > > * Catalin Marinas <catalin.marinas@arm.com> [240906 07:44]:
> > > > On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote:
> > > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > >> It's also unclear to me how we want this flag to interact with
> > > > > >> the existing logic in arch_get_mmap_end(), which attempts to
> > > > > >> limit the default mapping to a 47-bit address space already.
> > > > > >
> > > > > > To optimize RISC-V progress, I recommend:
> > > > > >
> > > > > > Step 1: Approve the patch.
> > > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > > > > > Step 3: Wait approximately several iterations for Go & OpenJDK
> > > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
> 
> Point 4 is an ABI change. What guarantees that there isn't still
> software out there that relies on the old behaviour?

Yeah I don't think it would be desirable to remove the 47 bit
constraint in architectures that already have it.

> 
> > > > > I really want to first see a plausible explanation about why
> > > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> > > > > like all the other major architectures (x86, arm64, powerpc64),
> > > > 
> > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
> > > > configuration. We end up with a 47-bit with 16K pages but for a
> > > > different reason that has to do with LPA2 support (I doubt we need this
> > > > for the user mapping but we need to untangle some of the macros there;
> > > > that's for a separate discussion).
> > > > 
> > > > That said, we haven't encountered any user space problems with a 48-bit
> > > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
> > > > approach (47 or 48 bit default limit). Better to have some ABI
> > > > consistency between architectures. One can still ask for addresses above
> > > > this default limit via mmap().
> > > 
> > > I think that is best as well.
> > > 
> > > Can we please just do what x86 and arm64 does?
> > 
> > I responded to Arnd in the other thread, but I am still not convinced
> > that the solution that x86 and arm64 have selected is the best solution.
> > The solution of defaulting to 47 bits does allow applications the
> > ability to get addresses that are below 47 bits. However, due to
> > differences across architectures it doesn't seem possible to have all
> > architectures default to the same value. Additionally, this flag will be
> > able to help users avoid potential bugs where a hint address is passed
> > that causes upper bits of a VA to be used.
> 
> The reason we added this limit on arm64 is that we noticed programs
> using the top 8 bits of a 64-bit pointer for additional information.
> IIRC, it wasn't even openJDK but some JavaScript JIT. We could have
> taught those programs of a new flag but since we couldn't tell how many
> are out there, it was the safest to default to a smaller limit and opt
> in to the higher one. Such opt-in is via mmap() but if you prefer a
> prctl() flag, that's fine by me as well (though I think this should be
> opt-in to higher addresses rather than opt-out of the higher addresses).

The mmap() flag was used in previous versions but was decided against
because this feature is more useful if it is process-wide. A
personality() flag was chosen instead of a prctl() flag because there
existed other flags in personality() that were similar. I am tempted to
use prctl() however because then we could have an additional arg to
select the exact number of bits that should be reserved (rather than
being fixed at 47 bits).

Opting-in to the higher address space is reasonable. However, it is not
my preference, because the purpose of this flag is to ensure that
allocations do not exceed 47-bits, so it is a clearer ABI to have the
applications that want this guarantee to be the ones setting the flag,
rather than the applications that want the higher bits setting the flag.

- Charlie

> 
> -- 
> Catalin
Charlie Jenkins Sept. 12, 2024, 6:20 a.m. UTC | #25
On Wed, Sep 11, 2024 at 11:38:55PM +1000, Michael Ellerman wrote:
> Geert Uytterhoeven <geert@linux-m68k.org> writes:
> > Hi Christophe,
> >
> > On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy
> > <christophe.leroy@csgroup.eu> wrote:
> >> >>> diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
> >> >>> index 49796b7756af..cd3b8c154d9b 100644
> >> >>> --- a/include/uapi/linux/personality.h
> >> >>> +++ b/include/uapi/linux/personality.h
> >> >>> @@ -22,6 +22,7 @@ enum {
> >> >>>     WHOLE_SECONDS =         0x2000000,
> >> >>>     STICKY_TIMEOUTS =       0x4000000,
> >> >>>     ADDR_LIMIT_3GB =        0x8000000,
> >> >>> +   ADDR_LIMIT_47BIT =      0x10000000,
> >> >>>   };
> >> >>
> >> >> I wonder if ADDR_LIMIT_128T would be clearer?
> >> >>
> >> >
> >> > I don't follow, what does 128T represent?
> >>
> >> 128T is 128 Terabytes, that's the maximum size achievable with a 47BIT
> >> address, that naming would be more consistant with the ADDR_LIMIT_3GB
> >> just above that means a 3 Gigabytes limit.
> >
> > Hence ADDR_LIMIT_128TB?
> 
> Yes it should be 128TB. Typo by me.
> 
> cheers

47BIT was selected because the usecase for this flag is for applications
that want to store data in the upper bits of a virtual address space. In
this case, how large the virtual address space is irrelevant, and only
the number of bits that are being used, and hence the number of bits
that are free.

- Charlie
Catalin Marinas Sept. 12, 2024, 10:53 a.m. UTC | #26
On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
> Opting-in to the higher address space is reasonable. However, it is not
> my preference, because the purpose of this flag is to ensure that
> allocations do not exceed 47-bits, so it is a clearer ABI to have the
> applications that want this guarantee to be the ones setting the flag,
> rather than the applications that want the higher bits setting the flag.

Yes, this would be ideal. Unfortunately those applications don't know
they need to set a flag in order to work.

A slightly better option is to leave the default 47-bit at the kernel
ABI level and have the libc/dynamic loader issue the prctl(). You can
control the default with environment variables if needed.

We do something similar in glibc for arm64 MTE. When MTE is enabled, the
top byte of an allocated pointer contains the tag that must not be
corrupted. We left the decision to the C library via the
glibc.mem.tagging tunable (Android has something similar via the app
manifest). An app can change the default if it wants but if you run with
old glibc or no environment variable to say otherwise, the default would
be safe. Distros can set the environment to be the maximum range by
default if they know the apps included have been upgraded and tested.
Charlie Jenkins Sept. 12, 2024, 9:15 p.m. UTC | #27
On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote:
> On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
> > Opting-in to the higher address space is reasonable. However, it is not
> > my preference, because the purpose of this flag is to ensure that
> > allocations do not exceed 47-bits, so it is a clearer ABI to have the
> > applications that want this guarantee to be the ones setting the flag,
> > rather than the applications that want the higher bits setting the flag.
> 
> Yes, this would be ideal. Unfortunately those applications don't know
> they need to set a flag in order to work.

It's not a regression, the applications never worked (on platforms that
do not have this default). The 47-bit default would allow applications
that didn't work to start working at the cost of a non-ideal ABI. That
doesn't seem like a reasonable tradeoff to me.  If applications want to
run on new hardware that has different requirements, shouldn't they be
required to update rather than expect the kernel will solve their
problems for them?

> 
> A slightly better option is to leave the default 47-bit at the kernel
> ABI level and have the libc/dynamic loader issue the prctl(). You can
> control the default with environment variables if needed.

Having glibc set the 47-bit requirement could make it slightly easier
for applications since they would only have to set the environment
variable. After the kernel interface is approved I can look into
supporting that.

- Charlie

> 
> We do something similar in glibc for arm64 MTE. When MTE is enabled, the
> top byte of an allocated pointer contains the tag that must not be
> corrupted. We left the decision to the C library via the
> glibc.mem.tagging tunable (Android has something similar via the app
> manifest). An app can change the default if it wants but if you run with
> old glibc or no environment variable to say otherwise, the default would
> be safe. Distros can set the environment to be the maximum range by
> default if they know the apps included have been upgraded and tested.
> 
> -- 
> Catalin
Lorenzo Stoakes Sept. 13, 2024, 7:41 a.m. UTC | #28
On Wed, Sep 11, 2024 at 11:18:12PM GMT, Charlie Jenkins wrote:
> On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote:
> > On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
> > > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
> > > > * Catalin Marinas <catalin.marinas@arm.com> [240906 07:44]:
> > > > > On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote:
> > > > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > > > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > >> It's also unclear to me how we want this flag to interact with
> > > > > > >> the existing logic in arch_get_mmap_end(), which attempts to
> > > > > > >> limit the default mapping to a 47-bit address space already.
> > > > > > >
> > > > > > > To optimize RISC-V progress, I recommend:
> > > > > > >
> > > > > > > Step 1: Approve the patch.
> > > > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > > > > > > Step 3: Wait approximately several iterations for Go & OpenJDK
> > > > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
> >
> > Point 4 is an ABI change. What guarantees that there isn't still
> > software out there that relies on the old behaviour?
>
> Yeah I don't think it would be desirable to remove the 47 bit
> constraint in architectures that already have it.
>
> >
> > > > > > I really want to first see a plausible explanation about why
> > > > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> > > > > > like all the other major architectures (x86, arm64, powerpc64),
> > > > >
> > > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
> > > > > configuration. We end up with a 47-bit with 16K pages but for a
> > > > > different reason that has to do with LPA2 support (I doubt we need this
> > > > > for the user mapping but we need to untangle some of the macros there;
> > > > > that's for a separate discussion).
> > > > >
> > > > > That said, we haven't encountered any user space problems with a 48-bit
> > > > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
> > > > > approach (47 or 48 bit default limit). Better to have some ABI
> > > > > consistency between architectures. One can still ask for addresses above
> > > > > this default limit via mmap().
> > > >
> > > > I think that is best as well.
> > > >
> > > > Can we please just do what x86 and arm64 does?
> > >
> > > I responded to Arnd in the other thread, but I am still not convinced
> > > that the solution that x86 and arm64 have selected is the best solution.
> > > The solution of defaulting to 47 bits does allow applications the
> > > ability to get addresses that are below 47 bits. However, due to
> > > differences across architectures it doesn't seem possible to have all
> > > architectures default to the same value. Additionally, this flag will be
> > > able to help users avoid potential bugs where a hint address is passed
> > > that causes upper bits of a VA to be used.
> >
> > The reason we added this limit on arm64 is that we noticed programs
> > using the top 8 bits of a 64-bit pointer for additional information.
> > IIRC, it wasn't even openJDK but some JavaScript JIT. We could have
> > taught those programs of a new flag but since we couldn't tell how many
> > are out there, it was the safest to default to a smaller limit and opt
> > in to the higher one. Such opt-in is via mmap() but if you prefer a
> > prctl() flag, that's fine by me as well (though I think this should be
> > opt-in to higher addresses rather than opt-out of the higher addresses).
>
> The mmap() flag was used in previous versions but was decided against
> because this feature is more useful if it is process-wide. A
> personality() flag was chosen instead of a prctl() flag because there
> existed other flags in personality() that were similar. I am tempted to
> use prctl() however because then we could have an additional arg to
> select the exact number of bits that should be reserved (rather than
> being fixed at 47 bits).

I am very much not in favour of a prctl(), it would require us to add state
limiting the address space and the timing of it becomes critical. Then we
have the same issue we do with the other proposals as to - what happens if
this is too low?

What is 'too low' varies by architecture, and for 32-bit architectures
could get quite... problematic.

And again, wha is the RoI here - we introducing maintenance burden and edge
cases vs. the x86 solution in order to... accommodate things that need more
than 128 TiB of address space? A problem that does not appear to exist in
reality?

I suggested the personality approach as the least impactful compromise way
of this series working, but I think after what Arnd has said (and please
forgive me if I've missed further discussion have been dipping in and out
of this!) - adapting risc v to the approach we take elsewhere seems the
most sensible solution to me.

This remains something we can revisit in future if this turns out to be
egregious.

>
> Opting-in to the higher address space is reasonable. However, it is not
> my preference, because the purpose of this flag is to ensure that
> allocations do not exceed 47-bits, so it is a clearer ABI to have the
> applications that want this guarantee to be the ones setting the flag,
> rather than the applications that want the higher bits setting the flag.

Perfect is the enemy of the good :) and an idealised solution may not end
up being something everybody can agree on.

>
> - Charlie
>
> >
> > --
> > Catalin
>
>
>
Catalin Marinas Sept. 13, 2024, 10:08 a.m. UTC | #29
On Thu, Sep 12, 2024 at 02:15:59PM -0700, Charlie Jenkins wrote:
> On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote:
> > On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
> > > Opting-in to the higher address space is reasonable. However, it is not
> > > my preference, because the purpose of this flag is to ensure that
> > > allocations do not exceed 47-bits, so it is a clearer ABI to have the
> > > applications that want this guarantee to be the ones setting the flag,
> > > rather than the applications that want the higher bits setting the flag.
> > 
> > Yes, this would be ideal. Unfortunately those applications don't know
> > they need to set a flag in order to work.
> 
> It's not a regression, the applications never worked (on platforms that
> do not have this default). The 47-bit default would allow applications
> that didn't work to start working at the cost of a non-ideal ABI. That
> doesn't seem like a reasonable tradeoff to me.  If applications want to
> run on new hardware that has different requirements, shouldn't they be
> required to update rather than expect the kernel will solve their
> problems for them?

That's a valid point but it depends on the application and how much you
want to spend updating user-space. OpenJDK is fine, if you need a JIT
you'll have to add support for that architecture anyway. But others are
arch-agnostic, you just recompile to your target. It's not an ABI
problem, more of an API one.

The x86 case (and powerpc/arm64) was different, the 47-bit worked for a
long time before expanding it. So it made a lot of sense to keep the
same default.

Anyway, the prctl() can go both ways, either expanding or limiting the
default address space. So I'd be fine with such interface.
Catalin Marinas Sept. 13, 2024, 10:21 a.m. UTC | #30
On Fri, Sep 13, 2024 at 11:08:23AM +0100, Catalin Marinas wrote:
> On Thu, Sep 12, 2024 at 02:15:59PM -0700, Charlie Jenkins wrote:
> > On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote:
> > > On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
> > > > Opting-in to the higher address space is reasonable. However, it is not
> > > > my preference, because the purpose of this flag is to ensure that
> > > > allocations do not exceed 47-bits, so it is a clearer ABI to have the
> > > > applications that want this guarantee to be the ones setting the flag,
> > > > rather than the applications that want the higher bits setting the flag.
[...]
> Anyway, the prctl() can go both ways, either expanding or limiting the
> default address space. So I'd be fine with such interface.

Ah, I just realised (while reading Lorenzo's reply) that we can't really
restrict the space via a prctl() as we have the main thread stack
already allocated by the kernel before the user code starts. You may
need to limit this stack as well, not just the later heap allocations
(anonymous mmap()).
Charlie Jenkins Sept. 13, 2024, 8:15 p.m. UTC | #31
On Fri, Sep 13, 2024 at 11:08:23AM +0100, Catalin Marinas wrote:
> On Thu, Sep 12, 2024 at 02:15:59PM -0700, Charlie Jenkins wrote:
> > On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote:
> > > On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
> > > > Opting-in to the higher address space is reasonable. However, it is not
> > > > my preference, because the purpose of this flag is to ensure that
> > > > allocations do not exceed 47-bits, so it is a clearer ABI to have the
> > > > applications that want this guarantee to be the ones setting the flag,
> > > > rather than the applications that want the higher bits setting the flag.
> > > 
> > > Yes, this would be ideal. Unfortunately those applications don't know
> > > they need to set a flag in order to work.
> > 
> > It's not a regression, the applications never worked (on platforms that
> > do not have this default). The 47-bit default would allow applications
> > that didn't work to start working at the cost of a non-ideal ABI. That
> > doesn't seem like a reasonable tradeoff to me.  If applications want to
> > run on new hardware that has different requirements, shouldn't they be
> > required to update rather than expect the kernel will solve their
> > problems for them?
> 
> That's a valid point but it depends on the application and how much you
> want to spend updating user-space. OpenJDK is fine, if you need a JIT
> you'll have to add support for that architecture anyway. But others are
> arch-agnostic, you just recompile to your target. It's not an ABI
> problem, more of an API one.

The arch-agnosticism is my hope with this personality flag, it can be
added arch-agnostic userspace code and allow the application to work
everywhere, but it does have the downside of requiring that change to
user-space code.

> 
> The x86 case (and powerpc/arm64) was different, the 47-bit worked for a
> long time before expanding it. So it made a lot of sense to keep the
> same default.

Yes it is very reasonable that this solution was selected for those
architectures since the support for higher address spaces evolved in the
manner that it did!

- Charlie

> 
> Anyway, the prctl() can go both ways, either expanding or limiting the
> default address space. So I'd be fine with such interface.
> 
> -- 
> Catalin
Charlie Jenkins Sept. 13, 2024, 9:04 p.m. UTC | #32
On Fri, Sep 13, 2024 at 08:41:34AM +0100, Lorenzo Stoakes wrote:
> On Wed, Sep 11, 2024 at 11:18:12PM GMT, Charlie Jenkins wrote:
> > On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote:
> > > On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
> > > > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
> > > > > * Catalin Marinas <catalin.marinas@arm.com> [240906 07:44]:
> > > > > > On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote:
> > > > > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > > > > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > >> It's also unclear to me how we want this flag to interact with
> > > > > > > >> the existing logic in arch_get_mmap_end(), which attempts to
> > > > > > > >> limit the default mapping to a 47-bit address space already.
> > > > > > > >
> > > > > > > > To optimize RISC-V progress, I recommend:
> > > > > > > >
> > > > > > > > Step 1: Approve the patch.
> > > > > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > > > > > > > Step 3: Wait approximately several iterations for Go & OpenJDK
> > > > > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
> > >
> > > Point 4 is an ABI change. What guarantees that there isn't still
> > > software out there that relies on the old behaviour?
> >
> > Yeah I don't think it would be desirable to remove the 47 bit
> > constraint in architectures that already have it.
> >
> > >
> > > > > > > I really want to first see a plausible explanation about why
> > > > > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> > > > > > > like all the other major architectures (x86, arm64, powerpc64),
> > > > > >
> > > > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
> > > > > > configuration. We end up with a 47-bit with 16K pages but for a
> > > > > > different reason that has to do with LPA2 support (I doubt we need this
> > > > > > for the user mapping but we need to untangle some of the macros there;
> > > > > > that's for a separate discussion).
> > > > > >
> > > > > > That said, we haven't encountered any user space problems with a 48-bit
> > > > > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
> > > > > > approach (47 or 48 bit default limit). Better to have some ABI
> > > > > > consistency between architectures. One can still ask for addresses above
> > > > > > this default limit via mmap().
> > > > >
> > > > > I think that is best as well.
> > > > >
> > > > > Can we please just do what x86 and arm64 does?
> > > >
> > > > I responded to Arnd in the other thread, but I am still not convinced
> > > > that the solution that x86 and arm64 have selected is the best solution.
> > > > The solution of defaulting to 47 bits does allow applications the
> > > > ability to get addresses that are below 47 bits. However, due to
> > > > differences across architectures it doesn't seem possible to have all
> > > > architectures default to the same value. Additionally, this flag will be
> > > > able to help users avoid potential bugs where a hint address is passed
> > > > that causes upper bits of a VA to be used.
> > >
> > > The reason we added this limit on arm64 is that we noticed programs
> > > using the top 8 bits of a 64-bit pointer for additional information.
> > > IIRC, it wasn't even openJDK but some JavaScript JIT. We could have
> > > taught those programs of a new flag but since we couldn't tell how many
> > > are out there, it was the safest to default to a smaller limit and opt
> > > in to the higher one. Such opt-in is via mmap() but if you prefer a
> > > prctl() flag, that's fine by me as well (though I think this should be
> > > opt-in to higher addresses rather than opt-out of the higher addresses).
> >
> > The mmap() flag was used in previous versions but was decided against
> > because this feature is more useful if it is process-wide. A
> > personality() flag was chosen instead of a prctl() flag because there
> > existed other flags in personality() that were similar. I am tempted to
> > use prctl() however because then we could have an additional arg to
> > select the exact number of bits that should be reserved (rather than
> > being fixed at 47 bits).
> 
> I am very much not in favour of a prctl(), it would require us to add state
> limiting the address space and the timing of it becomes critical. Then we
> have the same issue we do with the other proposals as to - what happens if
> this is too low?
> 
> What is 'too low' varies by architecture, and for 32-bit architectures
> could get quite... problematic.
> 
> And again, wha is the RoI here - we introducing maintenance burden and edge
> cases vs. the x86 solution in order to... accommodate things that need more
> than 128 TiB of address space? A problem that does not appear to exist in
> reality?
> 
> I suggested the personality approach as the least impactful compromise way
> of this series working, but I think after what Arnd has said (and please
> forgive me if I've missed further discussion have been dipping in and out
> of this!) - adapting risc v to the approach we take elsewhere seems the
> most sensible solution to me.
>
> This remains something we can revisit in future if this turns out to be
> egregious.
>

I appreciate Arnd's comments, but I do not think that making 47-bit the
default is the best solution for riscv. On riscv, support for 48-bit
address spaces was merged in 5.17 and support for 57-bit address spaces
was merged in 5.18 without changing the default addresses provided by
mmap(). It could be argued that this was a mistake, however since at the
time there didn't exist hardware with larger address spaces it wasn't an
issue. The applications that existed at the time that relied on the
smaller address spaces have not been able to move to larger address
spaces. Making a 47-bit user-space address space default solves the
problem, but that is not arch agnostic, and can't be since of the
varying differences in page table sizes across architectures, which is
the other part of the problem I am trying to solve.

> >
> > Opting-in to the higher address space is reasonable. However, it is not
> > my preference, because the purpose of this flag is to ensure that
> > allocations do not exceed 47-bits, so it is a clearer ABI to have the
> > applications that want this guarantee to be the ones setting the flag,
> > rather than the applications that want the higher bits setting the flag.
> 
> Perfect is the enemy of the good :) and an idealised solution may not end
> up being something everybody can agree on.

Yes you are totally right! Although this is not my ideal solution, it
sufficiently accomplishes the goal so I think it is reasonable to
implement this as a personality flag.

> 
> >
> > - Charlie
> >
> > >
> > > --
> > > Catalin
> >
> >
> >
Michael Ellerman Sept. 20, 2024, 5:10 a.m. UTC | #33
Charlie Jenkins <charlie@rivosinc.com> writes:
> On Wed, Sep 11, 2024 at 11:38:55PM +1000, Michael Ellerman wrote:
>> Geert Uytterhoeven <geert@linux-m68k.org> writes:
>> > Hi Christophe,
>> >
>> > On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy
>> > <christophe.leroy@csgroup.eu> wrote:
>> >> >>> diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
>> >> >>> index 49796b7756af..cd3b8c154d9b 100644
>> >> >>> --- a/include/uapi/linux/personality.h
>> >> >>> +++ b/include/uapi/linux/personality.h
>> >> >>> @@ -22,6 +22,7 @@ enum {
>> >> >>>     WHOLE_SECONDS =         0x2000000,
>> >> >>>     STICKY_TIMEOUTS =       0x4000000,
>> >> >>>     ADDR_LIMIT_3GB =        0x8000000,
>> >> >>> +   ADDR_LIMIT_47BIT =      0x10000000,
>> >> >>>   };
>> >> >>
>> >> >> I wonder if ADDR_LIMIT_128T would be clearer?
>> >> >>
>> >> >
>> >> > I don't follow, what does 128T represent?
>> >>
>> >> 128T is 128 Terabytes, that's the maximum size achievable with a 47BIT
>> >> address, that naming would be more consistant with the ADDR_LIMIT_3GB
>> >> just above that means a 3 Gigabytes limit.
>> >
>> > Hence ADDR_LIMIT_128TB?
>> 
>> Yes it should be 128TB. Typo by me.
>
> 47BIT was selected because the usecase for this flag is for applications
> that want to store data in the upper bits of a virtual address space. In
> this case, how large the virtual address space is irrelevant, and only
> the number of bits that are being used, and hence the number of bits
> that are free.

Yeah I understand that's how you came to the problem.

But for the user API I think using the size of the address space is
clearer, easier to explain, and matches the existing ADDR_LIMIT_3GB.

cheers
Palmer Dabbelt Oct. 2, 2024, 2:26 p.m. UTC | #34
On Fri, 13 Sep 2024 14:04:06 PDT (-0700), Charlie Jenkins wrote:
> On Fri, Sep 13, 2024 at 08:41:34AM +0100, Lorenzo Stoakes wrote:
>> On Wed, Sep 11, 2024 at 11:18:12PM GMT, Charlie Jenkins wrote:
>> > On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote:
>> > > On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
>> > > > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
>> > > > > * Catalin Marinas <catalin.marinas@arm.com> [240906 07:44]:
>> > > > > > On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote:
>> > > > > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
>> > > > > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann <arnd@arndb.de> wrote:
>> > > > > > > >> It's also unclear to me how we want this flag to interact with
>> > > > > > > >> the existing logic in arch_get_mmap_end(), which attempts to
>> > > > > > > >> limit the default mapping to a 47-bit address space already.
>> > > > > > > >
>> > > > > > > > To optimize RISC-V progress, I recommend:
>> > > > > > > >
>> > > > > > > > Step 1: Approve the patch.
>> > > > > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
>> > > > > > > > Step 3: Wait approximately several iterations for Go & OpenJDK
>> > > > > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
>> > >
>> > > Point 4 is an ABI change. What guarantees that there isn't still
>> > > software out there that relies on the old behaviour?
>> >
>> > Yeah I don't think it would be desirable to remove the 47 bit
>> > constraint in architectures that already have it.
>> >
>> > >
>> > > > > > > I really want to first see a plausible explanation about why
>> > > > > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
>> > > > > > > like all the other major architectures (x86, arm64, powerpc64),
>> > > > > >
>> > > > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
>> > > > > > configuration. We end up with a 47-bit with 16K pages but for a
>> > > > > > different reason that has to do with LPA2 support (I doubt we need this
>> > > > > > for the user mapping but we need to untangle some of the macros there;
>> > > > > > that's for a separate discussion).
>> > > > > >
>> > > > > > That said, we haven't encountered any user space problems with a 48-bit
>> > > > > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
>> > > > > > approach (47 or 48 bit default limit). Better to have some ABI
>> > > > > > consistency between architectures. One can still ask for addresses above
>> > > > > > this default limit via mmap().
>> > > > >
>> > > > > I think that is best as well.
>> > > > >
>> > > > > Can we please just do what x86 and arm64 does?
>> > > >
>> > > > I responded to Arnd in the other thread, but I am still not convinced
>> > > > that the solution that x86 and arm64 have selected is the best solution.
>> > > > The solution of defaulting to 47 bits does allow applications the
>> > > > ability to get addresses that are below 47 bits. However, due to
>> > > > differences across architectures it doesn't seem possible to have all
>> > > > architectures default to the same value. Additionally, this flag will be
>> > > > able to help users avoid potential bugs where a hint address is passed
>> > > > that causes upper bits of a VA to be used.
>> > >
>> > > The reason we added this limit on arm64 is that we noticed programs
>> > > using the top 8 bits of a 64-bit pointer for additional information.
>> > > IIRC, it wasn't even openJDK but some JavaScript JIT. We could have
>> > > taught those programs of a new flag but since we couldn't tell how many
>> > > are out there, it was the safest to default to a smaller limit and opt
>> > > in to the higher one. Such opt-in is via mmap() but if you prefer a
>> > > prctl() flag, that's fine by me as well (though I think this should be
>> > > opt-in to higher addresses rather than opt-out of the higher addresses).
>> >
>> > The mmap() flag was used in previous versions but was decided against
>> > because this feature is more useful if it is process-wide. A
>> > personality() flag was chosen instead of a prctl() flag because there
>> > existed other flags in personality() that were similar. I am tempted to
>> > use prctl() however because then we could have an additional arg to
>> > select the exact number of bits that should be reserved (rather than
>> > being fixed at 47 bits).
>>
>> I am very much not in favour of a prctl(), it would require us to add state
>> limiting the address space and the timing of it becomes critical. Then we
>> have the same issue we do with the other proposals as to - what happens if
>> this is too low?
>>
>> What is 'too low' varies by architecture, and for 32-bit architectures
>> could get quite... problematic.
>>
>> And again, wha is the RoI here - we introducing maintenance burden and edge
>> cases vs. the x86 solution in order to... accommodate things that need more
>> than 128 TiB of address space? A problem that does not appear to exist in
>> reality?
>>
>> I suggested the personality approach as the least impactful compromise way
>> of this series working, but I think after what Arnd has said (and please
>> forgive me if I've missed further discussion have been dipping in and out
>> of this!) - adapting risc v to the approach we take elsewhere seems the
>> most sensible solution to me.

There's one wrinkle here: RISC-V started out with 39-bit VAs by default, 
and we've had at least one report of userspace breaking when moving to 
48-bit addresses.  That was just address sanitizer, so maybe nobody 
cares, but we're still pretty early in the transition to 48-bit systems 
(most of the HW is still 39-bit) so it's not clear if that's going to be 
the only bug.

So we're sort of in our own world of backwards compatibility here.  
39-bit vs 48-bit is just an arbitrary number, but "38 bits are enough 
for userspace" doesn't seem as sane a "47 bits are enough for 
userspace".  Maybe the right answer here is to just say the 38-bit 
userspace is broken and that it's a Linux-ism that 64-bit sytems have 
47-bit user addresses by default.

>> This remains something we can revisit in future if this turns out to be
>> egregious.
>>
>
> I appreciate Arnd's comments, but I do not think that making 47-bit the
> default is the best solution for riscv. On riscv, support for 48-bit
> address spaces was merged in 5.17 and support for 57-bit address spaces
> was merged in 5.18 without changing the default addresses provided by
> mmap(). It could be argued that this was a mistake, however since at the
> time there didn't exist hardware with larger address spaces it wasn't an
> issue. The applications that existed at the time that relied on the
> smaller address spaces have not been able to move to larger address
> spaces. Making a 47-bit user-space address space default solves the
> problem, but that is not arch agnostic, and can't be since of the
> varying differences in page table sizes across architectures, which is
> the other part of the problem I am trying to solve.
>
>> >
>> > Opting-in to the higher address space is reasonable. However, it is not
>> > my preference, because the purpose of this flag is to ensure that
>> > allocations do not exceed 47-bits, so it is a clearer ABI to have the
>> > applications that want this guarantee to be the ones setting the flag,
>> > rather than the applications that want the higher bits setting the flag.
>>
>> Perfect is the enemy of the good :) and an idealised solution may not end
>> up being something everybody can agree on.
>
> Yes you are totally right! Although this is not my ideal solution, it
> sufficiently accomplishes the goal so I think it is reasonable to
> implement this as a personality flag.
>
>>
>> >
>> > - Charlie
>> >
>> > >
>> > > --
>> > > Catalin
>> >
>> >
>> >
diff mbox series

Patch

diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
index 49796b7756af..cd3b8c154d9b 100644
--- a/include/uapi/linux/personality.h
+++ b/include/uapi/linux/personality.h
@@ -22,6 +22,7 @@  enum {
 	WHOLE_SECONDS =		0x2000000,
 	STICKY_TIMEOUTS	=	0x4000000,
 	ADDR_LIMIT_3GB = 	0x8000000,
+	ADDR_LIMIT_47BIT = 	0x10000000,
 };
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
index d0dfc85b209b..a5c7544853e5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1766,6 +1766,9 @@  unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info)
 {
 	unsigned long addr;
 
+	if (current->personality & ADDR_LIMIT_47BIT)
+		info->high_limit = MIN(info->high_limit, BIT(47) - 1);
+
 	if (info->flags & VM_UNMAPPED_AREA_TOPDOWN)
 		addr = unmapped_area_topdown(info);
 	else