diff mbox

arm64: Expose TASK_SIZE to userspace via auxv

Message ID 20160816183231.21179-1-cov@codeaurora.org (mailing list archive)
State New, archived
Headers show

Commit Message

Christopher Covington Aug. 16, 2016, 6:32 p.m. UTC
Some userspace applications need to know the maximum virtual address they can
use (TASK_SIZE). There are several possible values for TASK_SIZE with the arm64
kernel, and such applications are either making bad hard-coded assumptions, or
are guessing and checking using system calls like munmap(), which may have
other reasons for returning an error than TASK_SIZE being exceeded. To make
correct functioning easy for userspace applications that need to know the
maximum virtual address they can use, communicate TASK_SIZE via the ELF
auxiliary vector, just like PAGE_SIZE is currently communicated.

Signed-off-by: Christopher Covington <cov@codeaurora.org>
---
Tested with the following commands:
LD_SHOW_AUXV=1 sleep 1 # GNU dynamic ld-linux*.so
hexdump -v -e '4/4 "%08x " "\n"' /proc/self/auxv | \
  sed -r 's/0*([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+)/\1 0x\4\3/
    s/^0 /    NULL: /
    s/^3 /    PHDR: /
    s/^4 /   PHENT: /
    s/^5 /   PHNUM: /
    s/^6 /  PAGESZ: /
    s/^7 /    BASE: /
    s/^8 /   FLAGS: /
    s/^9 /   ENTRY: /
    s/^b /     UID: /
    s/^c /    EUID: /
    s/^d /     GID: /
    s/^e /    EGID: /
    s/^f /PLATFORM: /
    s/^10 /   HWCAP: /
    s/^11 /  CLKTCK: /
    s/^17 /  SECURE: /
    s/^19 /  RANDOM: /
    s/^1f /  EXECFN: /
    s/^21 /    VDSO: /
    s/^22 /  TASKSZ: /' # compatible with static busybox
---
 arch/arm64/include/asm/elf.h         | 1 +
 arch/arm64/include/uapi/asm/auxvec.h | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Comments

Catalin Marinas Aug. 17, 2016, 10:30 a.m. UTC | #1
On Tue, Aug 16, 2016 at 02:32:29PM -0400, Christopher Covington wrote:
> Some userspace applications need to know the maximum virtual address they can
> use (TASK_SIZE).

Just curious, what are the cases needing TASK_SIZE in user space?
Christopher Covington Aug. 17, 2016, 11:12 a.m. UTC | #2
On August 17, 2016 6:30:06 AM EDT, Catalin Marinas <catalin.marinas@arm.com> wrote:
>On Tue, Aug 16, 2016 at 02:32:29PM -0400, Christopher Covington wrote:
>> Some userspace applications need to know the maximum virtual address
>they can
>> use (TASK_SIZE).
>
>Just curious, what are the cases needing TASK_SIZE in user space?

Checkpoint/Restore In Userspace and the Mozilla Javascript Engine https://bugzilla.mozilla.org/show_bug.cgi?id=1143022 are the specific cases I've run into. I've heard LuaJIT might have a similar situation. In general I think making allocations from the top down is a shortcut for finding a large unused region of memory.

Thanks,
Cov
Ard Biesheuvel Aug. 18, 2016, noon UTC | #3
On 17 August 2016 at 13:12, Christopher Covington <cov@codeaurora.org> wrote:
>
>
> On August 17, 2016 6:30:06 AM EDT, Catalin Marinas <catalin.marinas@arm.com> wrote:
>>On Tue, Aug 16, 2016 at 02:32:29PM -0400, Christopher Covington wrote:
>>> Some userspace applications need to know the maximum virtual address
>>they can
>>> use (TASK_SIZE).
>>
>>Just curious, what are the cases needing TASK_SIZE in user space?
>
> Checkpoint/Restore In Userspace and the Mozilla Javascript Engine https://bugzilla.mozilla.org/show_bug.cgi?id=1143022 are the specific cases I've run into. I've heard LuaJIT might have a similar situation. In general I think making allocations from the top down is a shortcut for finding a large unused region of memory.
>

One aspect of this that I would like to discuss is whether the current
practice makes sense, of tying TASK_SIZE to whatever the size of the
kernel VA space is.

I could imagine simply limiting the user VA space to 39-bits (or even
36-bits, depending on how deeply we care about 16 KB pages), and
implement an arch specific hook (prctl() perhaps?) to increase
TASK_SIZE on demand. That would not only give us a reliable way to
check whether this is supported (i.e., the prctl() would return error
if it isn't), it also allows for some optimizations, since a 48-bit VA
kernel can run all processes using 3 levels with relative ease (and
switching between 4levels and 3levels processes would also be
possible, but would either require a TLB flush, or would result in
this optimization to be disabled globally, whichever is less costly in
terms of performance)
Richard Weinberger Aug. 18, 2016, 12:17 p.m. UTC | #4
On Wed, Aug 17, 2016 at 1:12 PM, Christopher Covington
<cov@codeaurora.org> wrote:
>
>
> On August 17, 2016 6:30:06 AM EDT, Catalin Marinas <catalin.marinas@arm.com> wrote:
>>On Tue, Aug 16, 2016 at 02:32:29PM -0400, Christopher Covington wrote:
>>> Some userspace applications need to know the maximum virtual address
>>they can
>>> use (TASK_SIZE).
>>
>>Just curious, what are the cases needing TASK_SIZE in user space?
>
> Checkpoint/Restore In Userspace and the Mozilla Javascript Engine https://bugzilla.mozilla.org/show_bug.cgi?id=1143022 are the specific cases I've run into. I've heard LuaJIT might have a similar situation. In general I think making allocations from the top down is a shortcut for finding a large unused region of memory.

I think this makes sense for all archs.
At lest UserModeLinux on x86 also needs to know bottom and top
addresses of the usable
address space.
Currently it figures by scanning and catching SIGSEGV.
Catalin Marinas Aug. 18, 2016, 12:42 p.m. UTC | #5
On Thu, Aug 18, 2016 at 02:00:56PM +0200, Ard Biesheuvel wrote:
> On 17 August 2016 at 13:12, Christopher Covington <cov@codeaurora.org> wrote:
> > On August 17, 2016 6:30:06 AM EDT, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >>On Tue, Aug 16, 2016 at 02:32:29PM -0400, Christopher Covington wrote:
> >>> Some userspace applications need to know the maximum virtual address
> >>they can
> >>> use (TASK_SIZE).
> >>
> >>Just curious, what are the cases needing TASK_SIZE in user space?
> >
> > Checkpoint/Restore In Userspace and the Mozilla Javascript Engine
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1143022 are the
> > specific cases I've run into. I've heard LuaJIT might have a similar
> > situation. In general I think making allocations from the top down
> > is a shortcut for finding a large unused region of memory.
> 
> One aspect of this that I would like to discuss is whether the current
> practice makes sense, of tying TASK_SIZE to whatever the size of the
> kernel VA space is.

I'm fine with decoupling them as long as we can have sane
pgd/pud/pmd/pte macros. We rely on generic files line pgtable-nopud.h
etc. currently, so we would have to give up on that and do our own
checks. It's also worth testing any potential performance implication of
creating/tearing down large page tables with the new macros.

> I could imagine simply limiting the user VA space to 39-bits (or even
> 36-bits, depending on how deeply we care about 16 KB pages), and
> implement an arch specific hook (prctl() perhaps?) to increase
> TASK_SIZE on demand.

As you stated below, switching TASK_SIZE on demand is problematic if you
actually want a switch the TCR_EL1.T0SZ. As per other recent
discussions, I'm not sure we can do it safely without full TLBI on
context switch. That's an aspect we'll have to sort out with 52-bit VA
but most likely we'll allow this range in T0SZ and only artificially
limit TASK_SIZE to smaller values so that we don't break any other
tasks. But then you won't gain much from a reduced number of page table
levels.

> That would not only give us a reliable way to check whether this is
> supported (i.e., the prctl() would return error if it isn't), it also
> allows for some optimizations, since a 48-bit VA kernel can run all
> processes using 3 levels with relative ease (and switching between
> 4levels and 3levels processes would also be possible, but would either
> require a TLB flush, or would result in this optimization to be
> disabled globally, whichever is less costly in terms of performance)

I'm more for using 48-bit VA permanently for both user and kernel (and
52-bit VA at some point in the future, though limiting user space to
48-bit VA by default). But it would be good to get some benchmark
numbers on the impact to see whether it's still worth keeping the other
VA combinations around.
Ard Biesheuvel Aug. 18, 2016, 1:18 p.m. UTC | #6
On 18 August 2016 at 14:42, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Thu, Aug 18, 2016 at 02:00:56PM +0200, Ard Biesheuvel wrote:
>> On 17 August 2016 at 13:12, Christopher Covington <cov@codeaurora.org> wrote:
>> > On August 17, 2016 6:30:06 AM EDT, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> >>On Tue, Aug 16, 2016 at 02:32:29PM -0400, Christopher Covington wrote:
>> >>> Some userspace applications need to know the maximum virtual address
>> >>they can
>> >>> use (TASK_SIZE).
>> >>
>> >>Just curious, what are the cases needing TASK_SIZE in user space?
>> >
>> > Checkpoint/Restore In Userspace and the Mozilla Javascript Engine
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1143022 are the
>> > specific cases I've run into. I've heard LuaJIT might have a similar
>> > situation. In general I think making allocations from the top down
>> > is a shortcut for finding a large unused region of memory.
>>
>> One aspect of this that I would like to discuss is whether the current
>> practice makes sense, of tying TASK_SIZE to whatever the size of the
>> kernel VA space is.
>
> I'm fine with decoupling them as long as we can have sane
> pgd/pud/pmd/pte macros. We rely on generic files line pgtable-nopud.h
> etc. currently, so we would have to give up on that and do our own
> checks. It's also worth testing any potential performance implication of
> creating/tearing down large page tables with the new macros.
>

Well, I don't think it is necessarily worth the trouble of rewriting
all that. My concern is that TASK_SIZE randomly increased to 48 bits
recently, merely because some Freescale SoCs cannot fit their RAM into
the linear mapping on a 39-bit VA kernel. This had nothing to do with
userland requirements. Do we know the userland requirements? What use
cases do we know about that require >39 bit userland VA space?

>> I could imagine simply limiting the user VA space to 39-bits (or even
>> 36-bits, depending on how deeply we care about 16 KB pages), and
>> implement an arch specific hook (prctl() perhaps?) to increase
>> TASK_SIZE on demand.
>
> As you stated below, switching TASK_SIZE on demand is problematic if you
> actually want a switch the TCR_EL1.T0SZ. As per other recent
> discussions, I'm not sure we can do it safely without full TLBI on
> context switch. That's an aspect we'll have to sort out with 52-bit VA
> but most likely we'll allow this range in T0SZ and only artificially
> limit TASK_SIZE to smaller values so that we don't break any other
> tasks. But then you won't gain much from a reduced number of page table
> levels.
>

There are several ways to go about this. The 48-bit VA kernel could
run everything with 3 levels, and simply switch to 4 levels the moment
some process needs it. So we keep all the existing macros, but simply
point TTBR0_EL1 to the level 1 translation table rather than to the
level 0 table (and update T0SZ accordingly). So when the first 48 bit
VA userland process arrives (which may be never in many cases), we
either switch to 4 levels for everything (and the page tables are
already set up for that), or we do a TLB flush, but only when
switching from a 4levels task to a 3levels task or vice versa (but
this is messy so the first approach is probably more suitable)

So there is no associated space savings, only the TLB and cache
footprint gets optimized.

>> That would not only give us a reliable way to check whether this is
>> supported (i.e., the prctl() would return error if it isn't), it also
>> allows for some optimizations, since a 48-bit VA kernel can run all
>> processes using 3 levels with relative ease (and switching between
>> 4levels and 3levels processes would also be possible, but would either
>> require a TLB flush, or would result in this optimization to be
>> disabled globally, whichever is less costly in terms of performance)
>
> I'm more for using 48-bit VA permanently for both user and kernel (and
> 52-bit VA at some point in the future, though limiting user space to
> 48-bit VA by default). But it would be good to get some benchmark
> numbers on the impact to see whether it's still worth keeping the other
> VA combinations around.
>

Of course, none of this complexity is justified if the performance
impact is negligible. I do wonder about the virt case, though.
Christopher Covington Sept. 9, 2016, 2:14 p.m. UTC | #7
Hi Richard,

On 08/18/2016 08:17 AM, Richard Weinberger wrote:
> On Wed, Aug 17, 2016 at 1:12 PM, Christopher Covington
> <cov@codeaurora.org> wrote:
>>
>>
>> On August 17, 2016 6:30:06 AM EDT, Catalin Marinas <catalin.marinas@arm.com> wrote:
>>> On Tue, Aug 16, 2016 at 02:32:29PM -0400, Christopher Covington wrote:
>>>> Some userspace applications need to know the maximum virtual address
>>> they can
>>>> use (TASK_SIZE).
>>>
>>> Just curious, what are the cases needing TASK_SIZE in user space?
>>
>> Checkpoint/Restore In Userspace and the Mozilla Javascript Engine https://bugzilla.mozilla.org/show_bug.cgi?id=1143022 are the specific cases I've run into. I've heard LuaJIT might have a similar situation. In general I think making allocations from the top down is a shortcut for finding a large unused region of memory.
> 
> I think this makes sense for all archs.
> At lest UserModeLinux on x86 also needs to know bottom and top
> addresses of the usable
> address space.
> Currently it figures by scanning and catching SIGSEGV.

For the bottom, can you use /proc/sys/vm/mmap_min_addr?

Cov
diff mbox

Patch

diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h
index a55384f..3811795 100644
--- a/arch/arm64/include/asm/elf.h
+++ b/arch/arm64/include/asm/elf.h
@@ -145,6 +145,7 @@  typedef struct user_fpsimd_state elf_fpregset_t;
 do {									\
 	NEW_AUX_ENT(AT_SYSINFO_EHDR,					\
 		    (elf_addr_t)current->mm->context.vdso);		\
+	NEW_AUX_ENT(AT_TASKSZ, TASK_SIZE);				\
 } while (0)
 
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES
diff --git a/arch/arm64/include/uapi/asm/auxvec.h b/arch/arm64/include/uapi/asm/auxvec.h
index 4cf0c17..595bfda 100644
--- a/arch/arm64/include/uapi/asm/auxvec.h
+++ b/arch/arm64/include/uapi/asm/auxvec.h
@@ -18,7 +18,8 @@ 
 
 /* vDSO location */
 #define AT_SYSINFO_EHDR	33
+#define AT_TASKSZ	34
 
-#define AT_VECTOR_SIZE_ARCH 1 /* entries in ARCH_DLINFO */
+#define AT_VECTOR_SIZE_ARCH 2 /* entries in ARCH_DLINFO */
 
 #endif