diff mbox series

[v4,11/26] arm64: mte: Add PROT_MTE support to mmap() and mprotect()

Message ID 20200515171612.1020-12-catalin.marinas@arm.com (mailing list archive)
State New, archived
Headers show
Series arm64: Memory Tagging Extension user-space support | expand

Commit Message

Catalin Marinas May 15, 2020, 5:15 p.m. UTC
To enable tagging on a memory range, the user must explicitly opt in via
a new PROT_MTE flag passed to mmap() or mprotect(). Since this is a new
memory type in the AttrIndx field of a pte, simplify the or'ing of these
bits over the protection_map[] attributes by making MT_NORMAL index 0.

There are two conditions for arch_vm_get_page_prot() to return the
MT_NORMAL_TAGGED memory type: (1) the user requested it via PROT_MTE,
registered as VM_MTE in the vm_flags, and (2) the vma supports MTE,
decided during the mmap() call (only) and registered as VM_MTE_ALLOWED.

arch_calc_vm_prot_bits() is responsible for registering the user request
as VM_MTE. The newly introduced arch_calc_vm_flag_bits() sets
VM_MTE_ALLOWED if the mapping is MAP_ANONYMOUS. An MTE-capable
filesystem (RAM-based) may be able to set VM_MTE_ALLOWED during its
mmap() file ops call.

In addition, update VM_DATA_DEFAULT_FLAGS to allow mprotect(PROT_MTE) on
stack or brk area.

The Linux mmap() syscall currently ignores unknown PROT_* flags. In the
presence of MTE, an mmap(PROT_MTE) on a file which does not support MTE
will not report an error and the memory will not be mapped as Normal
Tagged. For consistency, mprotect(PROT_MTE) will not report an error
either if the memory range does not support MTE. Two subsequent patches
in the series will propose tightening of this behaviour.

Co-developed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
---

Notes:
    v2:
    - Add VM_MTE_ALLOWED to show_smap_vma_flags().

 arch/arm64/include/asm/memory.h    | 18 +++++----
 arch/arm64/include/asm/mman.h      | 64 ++++++++++++++++++++++++++++++
 arch/arm64/include/asm/page.h      |  2 +-
 arch/arm64/include/asm/pgtable.h   |  7 +++-
 arch/arm64/include/uapi/asm/mman.h | 14 +++++++
 fs/proc/task_mmu.c                 |  4 ++
 include/linux/mm.h                 |  8 ++++
 7 files changed, 108 insertions(+), 9 deletions(-)
 create mode 100644 arch/arm64/include/asm/mman.h
 create mode 100644 arch/arm64/include/uapi/asm/mman.h

Comments

Peter Collingbourne May 27, 2020, 6:57 p.m. UTC | #1
On Fri, May 15, 2020 at 10:16 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> To enable tagging on a memory range, the user must explicitly opt in via
> a new PROT_MTE flag passed to mmap() or mprotect(). Since this is a new
> memory type in the AttrIndx field of a pte, simplify the or'ing of these
> bits over the protection_map[] attributes by making MT_NORMAL index 0.

Should the userspace stack always be mapped as if with PROT_MTE if the
hardware supports it? Such a change would be invisible to non-MTE
aware userspace since it would already need to opt in to tag checking
via prctl. This would let userspace avoid a complex stack
initialization sequence when running with stack tagging enabled on the
main thread.

Peter
Catalin Marinas May 28, 2020, 9:14 a.m. UTC | #2
On Wed, May 27, 2020 at 11:57:39AM -0700, Peter Collingbourne wrote:
> On Fri, May 15, 2020 at 10:16 AM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > To enable tagging on a memory range, the user must explicitly opt in via
> > a new PROT_MTE flag passed to mmap() or mprotect(). Since this is a new
> > memory type in the AttrIndx field of a pte, simplify the or'ing of these
> > bits over the protection_map[] attributes by making MT_NORMAL index 0.
> 
> Should the userspace stack always be mapped as if with PROT_MTE if the
> hardware supports it? Such a change would be invisible to non-MTE
> aware userspace since it would already need to opt in to tag checking
> via prctl. This would let userspace avoid a complex stack
> initialization sequence when running with stack tagging enabled on the
> main thread.

I don't think the stack initialisation is that difficult. On program
startup (can be the dynamic loader). Something like (untested):

	register unsigned long stack asm ("sp");
	unsigned long page_sz = sysconf(_SC_PAGESIZE);

	mprotect((void *)(stack & ~(page_sz - 1)), page_sz,
		 PROT_READ | PROT_WRITE | PROT_MTE | PROT_GROWSDOWN);

(the essential part it PROT_GROWSDOWN so that you don't have to specify
a stack lower limit)

I don't like enabling this by default since it will have a small cost
even if the application doesn't enable tag checking. The kernel would
still have to zero the tags when mapping the stack and preserve them
when swapping out.

Another case where this could go wrong is if we want enable some
quiet monitoring of user programs: the libc enables PROT_MTE on heap
allocations but keeps tag checking disabled as it doesn't want any
SIGSEGV; the kernel could enable async TCF and log any faults
(rate-limited). Default PROT_MTE stack would get in the way. Anyway,
this use-case is something for the future, so far these patches rely on
the user solely driving the tag checking mode.

I'm fine, however, with enabling PROT_MTE on the main stack based on
some ELF note.
Szabolcs Nagy May 28, 2020, 11:05 a.m. UTC | #3
The 05/28/2020 10:14, Catalin Marinas wrote:
> On Wed, May 27, 2020 at 11:57:39AM -0700, Peter Collingbourne wrote:
> > On Fri, May 15, 2020 at 10:16 AM Catalin Marinas
> > <catalin.marinas@arm.com> wrote:
> > > To enable tagging on a memory range, the user must explicitly opt in via
> > > a new PROT_MTE flag passed to mmap() or mprotect(). Since this is a new
> > > memory type in the AttrIndx field of a pte, simplify the or'ing of these
> > > bits over the protection_map[] attributes by making MT_NORMAL index 0.
> > 
> > Should the userspace stack always be mapped as if with PROT_MTE if the
> > hardware supports it? Such a change would be invisible to non-MTE
> > aware userspace since it would already need to opt in to tag checking
> > via prctl. This would let userspace avoid a complex stack
> > initialization sequence when running with stack tagging enabled on the
> > main thread.
> 
> I don't think the stack initialisation is that difficult. On program
> startup (can be the dynamic loader). Something like (untested):
> 
> 	register unsigned long stack asm ("sp");
> 	unsigned long page_sz = sysconf(_SC_PAGESIZE);
> 
> 	mprotect((void *)(stack & ~(page_sz - 1)), page_sz,
> 		 PROT_READ | PROT_WRITE | PROT_MTE | PROT_GROWSDOWN);
> 
> (the essential part it PROT_GROWSDOWN so that you don't have to specify
> a stack lower limit)

does this work even if the currently mapped stack is more than page_sz?
determining the mapped main stack area is i think non-trivial to do in
userspace (requires parsing /proc/self/maps or similar).

...
> I'm fine, however, with enabling PROT_MTE on the main stack based on
> some ELF note.

note that would likely mean an elf note on the dynamic linker
(because a dynamic linked executable may not be loaded by the
kernel and ctors in loaded libs run before the executable entry
code anyway, so the executable alone cannot be in charge of this
decision) i.e. one global switch for all dynamic linked binaries.

i think a dynamic linker can map a new stack and switch to it
if it needs to control the properties of the stack at runtime
(it's wasteful though).

and i think there should be a runtime mechanism for the brk area:
it should be possible to request that future brk expansions are
mapped as PROT_MTE so an mte aware malloc implementation can use
brk. i think this is not important in the initial design, but if
a prctl flag can do it that may be useful to add (may be at a
later time).

(and eventually there should be a way to use PROT_MTE on
writable global data and appropriate code generation that
takes colors into account when globals are accessed, but
that requires significant ELF, ld.so and compiler changes,
that need not be part of the initial mte design).
Catalin Marinas May 28, 2020, 4:34 p.m. UTC | #4
On Thu, May 28, 2020 at 12:05:09PM +0100, Szabolcs Nagy wrote:
> The 05/28/2020 10:14, Catalin Marinas wrote:
> > On Wed, May 27, 2020 at 11:57:39AM -0700, Peter Collingbourne wrote:
> > > On Fri, May 15, 2020 at 10:16 AM Catalin Marinas
> > > <catalin.marinas@arm.com> wrote:
> > > > To enable tagging on a memory range, the user must explicitly opt in via
> > > > a new PROT_MTE flag passed to mmap() or mprotect(). Since this is a new
> > > > memory type in the AttrIndx field of a pte, simplify the or'ing of these
> > > > bits over the protection_map[] attributes by making MT_NORMAL index 0.
> > > 
> > > Should the userspace stack always be mapped as if with PROT_MTE if the
> > > hardware supports it? Such a change would be invisible to non-MTE
> > > aware userspace since it would already need to opt in to tag checking
> > > via prctl. This would let userspace avoid a complex stack
> > > initialization sequence when running with stack tagging enabled on the
> > > main thread.
> > 
> > I don't think the stack initialisation is that difficult. On program
> > startup (can be the dynamic loader). Something like (untested):
> > 
> > 	register unsigned long stack asm ("sp");
> > 	unsigned long page_sz = sysconf(_SC_PAGESIZE);
> > 
> > 	mprotect((void *)(stack & ~(page_sz - 1)), page_sz,
> > 		 PROT_READ | PROT_WRITE | PROT_MTE | PROT_GROWSDOWN);
> > 
> > (the essential part it PROT_GROWSDOWN so that you don't have to specify
> > a stack lower limit)
> 
> does this work even if the currently mapped stack is more than page_sz?
> determining the mapped main stack area is i think non-trivial to do in
> userspace (requires parsing /proc/self/maps or similar).

Because of PROT_GROWSDOWN, the kernel adjusts the start of the range
down automatically. It is potentially problematic if the top of the
stack is more than a page away and you want the whole stack coloured. I
haven't run a test but my reading of the kernel code is that the stack
vma would be split in this scenario, so the range beyond sp+page_sz
won't have PROT_MTE set.

My assumption is that if you do this during program start, the stack is
smaller than a page. Alternatively, could we use argv or envp to
determine the top of the user stack (the bottom is taken care of by the
kernel)?

> > I'm fine, however, with enabling PROT_MTE on the main stack based on
> > some ELF note.
> 
> note that would likely mean an elf note on the dynamic linker
> (because a dynamic linked executable may not be loaded by the
> kernel and ctors in loaded libs run before the executable entry
> code anyway, so the executable alone cannot be in charge of this
> decision) i.e. one global switch for all dynamic linked binaries.

I guess parsing such note in the kernel is only useful for static
binaries.

> i think a dynamic linker can map a new stack and switch to it
> if it needs to control the properties of the stack at runtime
> (it's wasteful though).

There is already user code to check for HWCAP2_MTE and the prctl(), so
adding an mprotect() doesn't look like a significant overhead.

> and i think there should be a runtime mechanism for the brk area:
> it should be possible to request that future brk expansions are
> mapped as PROT_MTE so an mte aware malloc implementation can use
> brk. i think this is not important in the initial design, but if
> a prctl flag can do it that may be useful to add (may be at a
> later time).

Looking at the kernel code briefly, I think this would work. We do end
up with two vmas for the brk, only the expansion having PROT_MTE, and
I'd to find a way to store the extra flag.

From a coding perspective, it's easier to just set PROT_MTE by default
on both brk and initial stack ;) (VM_DATA_DEFAULT_FLAGS).

> (and eventually there should be a way to use PROT_MTE on
> writable global data and appropriate code generation that
> takes colors into account when globals are accessed, but
> that requires significant ELF, ld.so and compiler changes,
> that need not be part of the initial mte design).

The .data section needs to be driven by the ELF information. It's also a
file mapping and we don't support PROT_MTE on them even if MAP_PRIVATE.
There are complications like DAX where the file you mmap for CoW may be
hosted on memory that does not support MTE (copied to RAM on write).

Is there a use-case for global data to be tagged?
Catalin Marinas May 29, 2020, 11:19 a.m. UTC | #5
On Thu, May 28, 2020 at 11:35:50AM -0700, Evgenii Stepanov wrote:
> On Thu, May 28, 2020 at 9:34 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Thu, May 28, 2020 at 12:05:09PM +0100, Szabolcs Nagy wrote:
> > > On 05/28/2020 10:14, Catalin Marinas wrote:
> > > > I don't think the stack initialisation is that difficult. On program
> > > > startup (can be the dynamic loader). Something like (untested):
> > > >
> > > >     register unsigned long stack asm ("sp");
> > > >     unsigned long page_sz = sysconf(_SC_PAGESIZE);
> > > >
> > > >     mprotect((void *)(stack & ~(page_sz - 1)), page_sz,
> > > >              PROT_READ | PROT_WRITE | PROT_MTE | PROT_GROWSDOWN);
> > > >
> > > > (the essential part it PROT_GROWSDOWN so that you don't have to specify
> > > > a stack lower limit)
> > >
> > > does this work even if the currently mapped stack is more than page_sz?
> > > determining the mapped main stack area is i think non-trivial to do in
> > > userspace (requires parsing /proc/self/maps or similar).
> >
> > Because of PROT_GROWSDOWN, the kernel adjusts the start of the range
> > down automatically. It is potentially problematic if the top of the
> > stack is more than a page away and you want the whole stack coloured. I
> > haven't run a test but my reading of the kernel code is that the stack
> > vma would be split in this scenario, so the range beyond sp+page_sz
> > won't have PROT_MTE set.
> >
> > My assumption is that if you do this during program start, the stack is
> > smaller than a page. Alternatively, could we use argv or envp to
> > determine the top of the user stack (the bottom is taken care of by the
> > kernel)?
> 
> PROT_GROWSDOWN seems to work fine in our case, and the extra tag
> maintenance overhead sounds like a valid argument against setting PROT_MTE
> unconditionally.
> 
> On the other hand, we may end up doing this in the userspace in every
> process. The reason is, PROT_MTE can not be set on a page that contains a
> live frame with stack tagging because of mismatching tags (IRG is not
> affected by PROT_MTE but STG is). So ideally, this should be done at (or
> near) the program entry point, while the stack is mostly empty.

Since stack tagging cannot use instructions in the NOP space anyway, I
think we need an ELF note to check for the presence of STG etc. and, in
addition, we can turn PROT_MTE by default on the initial stack. Maybe on
such binaries we could just set PROT_MTE on all anonymous and ramfs
mappings (i.e. VM_MTE_ALLOWED implies VM_MTE).

For dynamically linked binaries, we base this decision on the main ELF,
not the interpreter, and it would be up to the dynamic loader to reject
libraries that have such note when HWCAP2_MTE is not present.

> > > (and eventually there should be a way to use PROT_MTE on
> > > writable global data and appropriate code generation that
> > > takes colors into account when globals are accessed, but
> > > that requires significant ELF, ld.so and compiler changes,
> > > that need not be part of the initial mte design).
> >
> > The .data section needs to be driven by the ELF information. It's also a
> > file mapping and we don't support PROT_MTE on them even if MAP_PRIVATE.
> > There are complications like DAX where the file you mmap for CoW may be
> > hosted on memory that does not support MTE (copied to RAM on write).
> >
> > Is there a use-case for global data to be tagged?
> 
> Yes, catching global buffer overflow bugs. They are not nearly as
> common as heap-based issues though.

OK, so these would be tagged red-zones around global data. IIUC, having
different colours for global variables was not considered because of the
relocations and relative accesses.

If such red-zone colouring is done during load (the dynamic linker?), we
could set PROT_MTE only when MAP_PRIVATE and copied on write to make
sure it is in RAM. As above, I think this should be driven by some ELF
information.

There's also the option of scrapping PROT_MTE altogether and enabling
MTE (default tag 0) on all anonymous and private+copied pages (i.e.
those stored in RAM). At this point, I can't really tell whether there
will be a performance impact.
Dave Martin June 1, 2020, 8:55 a.m. UTC | #6
On Thu, May 28, 2020 at 05:34:13PM +0100, Catalin Marinas wrote:
> On Thu, May 28, 2020 at 12:05:09PM +0100, Szabolcs Nagy wrote:
> > The 05/28/2020 10:14, Catalin Marinas wrote:
> > > On Wed, May 27, 2020 at 11:57:39AM -0700, Peter Collingbourne wrote:

[...]

Just jumping in on this point:

> > > > Should the userspace stack always be mapped as if with PROT_MTE if the
> > > > hardware supports it? Such a change would be invisible to non-MTE
> > > > aware userspace since it would already need to opt in to tag checking
> > > > via prctl. This would let userspace avoid a complex stack
> > > > initialization sequence when running with stack tagging enabled on the
> > > > main thread.
> > > 
> > > I don't think the stack initialisation is that difficult. On program
> > > startup (can be the dynamic loader). Something like (untested):
> > > 
> > > 	register unsigned long stack asm ("sp");
> > > 	unsigned long page_sz = sysconf(_SC_PAGESIZE);
> > > 
> > > 	mprotect((void *)(stack & ~(page_sz - 1)), page_sz,
> > > 		 PROT_READ | PROT_WRITE | PROT_MTE | PROT_GROWSDOWN);
> > > 
> > > (the essential part it PROT_GROWSDOWN so that you don't have to specify
> > > a stack lower limit)
> > 
> > does this work even if the currently mapped stack is more than page_sz?
> > determining the mapped main stack area is i think non-trivial to do in
> > userspace (requires parsing /proc/self/maps or similar).
> 
> Because of PROT_GROWSDOWN, the kernel adjusts the start of the range
> down automatically. It is potentially problematic if the top of the
> stack is more than a page away and you want the whole stack coloured. I
> haven't run a test but my reading of the kernel code is that the stack
> vma would be split in this scenario, so the range beyond sp+page_sz
> won't have PROT_MTE set.
> 
> My assumption is that if you do this during program start, the stack is
> smaller than a page. Alternatively, could we use argv or envp to
> determine the top of the user stack (the bottom is taken care of by the
> kernel)?

I don't think you can easily know when the stack ends, but perhaps it
doesn't matter.

From memory, the initial stack looks like:

	argv/env strings
	AT_NULL
	auxv
	NULL
	env
	NULL
	argv
	argc	<--- sp

If we don't care about tagging the strings correctly, we could step to
the end of auxv and tag down from there.

If we do care about tagging the strings, there's probably no good way
to find the end of the string area, other than looking up sp in
/proc/self/maps.  I'm not sure we should trust all past and future
kernels to spit out the strings in a predictable order.

Assuming that the last env string has the highest address does not
sounds like a good idea to me.  It would be easy for someone to break
that assumption later without realising.


If we're concerned about this, and reading /proc/self/auxv is deemed
unacceptable (likely: some binaries need to work before /proc is
mounted) then we could perhaps add a new auxv entry to report the stack
base address to the user startup code.


I don't think it matters if all this is "hard" for userspace: only the
C library / runtime should be doing this.  After libc startup, it's
generally too late to do this kind of thing safely.

[...]

Cheers
---Dave
Catalin Marinas June 1, 2020, 2:45 p.m. UTC | #7
On Mon, Jun 01, 2020 at 09:55:38AM +0100, Dave P Martin wrote:
> On Thu, May 28, 2020 at 05:34:13PM +0100, Catalin Marinas wrote:
> > On Thu, May 28, 2020 at 12:05:09PM +0100, Szabolcs Nagy wrote:
> > > The 05/28/2020 10:14, Catalin Marinas wrote:
> > > > On Wed, May 27, 2020 at 11:57:39AM -0700, Peter Collingbourne wrote:
> > > > > Should the userspace stack always be mapped as if with PROT_MTE if the
> > > > > hardware supports it? Such a change would be invisible to non-MTE
> > > > > aware userspace since it would already need to opt in to tag checking
> > > > > via prctl. This would let userspace avoid a complex stack
> > > > > initialization sequence when running with stack tagging enabled on the
> > > > > main thread.
> > > > 
> > > > I don't think the stack initialisation is that difficult. On program
> > > > startup (can be the dynamic loader). Something like (untested):
> > > > 
> > > > 	register unsigned long stack asm ("sp");
> > > > 	unsigned long page_sz = sysconf(_SC_PAGESIZE);
> > > > 
> > > > 	mprotect((void *)(stack & ~(page_sz - 1)), page_sz,
> > > > 		 PROT_READ | PROT_WRITE | PROT_MTE | PROT_GROWSDOWN);
> > > > 
> > > > (the essential part it PROT_GROWSDOWN so that you don't have to specify
> > > > a stack lower limit)
> > > 
> > > does this work even if the currently mapped stack is more than page_sz?
> > > determining the mapped main stack area is i think non-trivial to do in
> > > userspace (requires parsing /proc/self/maps or similar).
> > 
> > Because of PROT_GROWSDOWN, the kernel adjusts the start of the range
> > down automatically. It is potentially problematic if the top of the
> > stack is more than a page away and you want the whole stack coloured. I
> > haven't run a test but my reading of the kernel code is that the stack
> > vma would be split in this scenario, so the range beyond sp+page_sz
> > won't have PROT_MTE set.
> > 
> > My assumption is that if you do this during program start, the stack is
> > smaller than a page. Alternatively, could we use argv or envp to
> > determine the top of the user stack (the bottom is taken care of by the
> > kernel)?
> 
> I don't think you can easily know when the stack ends, but perhaps it
> doesn't matter.
> 
> From memory, the initial stack looks like:
> 
> 	argv/env strings
> 	AT_NULL
> 	auxv
> 	NULL
> 	env
> 	NULL
> 	argv
> 	argc	<--- sp
> 
> If we don't care about tagging the strings correctly, we could step to
> the end of auxv and tag down from there.
> 
> If we do care about tagging the strings, there's probably no good way
> to find the end of the string area, other than looking up sp in
> /proc/self/maps.  I'm not sure we should trust all past and future
> kernels to spit out the strings in a predictable order.

I don't think we care about tagging whatever the kernel places on the
stack since the argv/envp pointers are untagged. An mprotect(PROT_MTE)
may or may not cover the environment but it shouldn't matter as the
kernel clears the tags on the corresponding pages anyway.

AFAIK stack tagging works by colouring a stack frame on function entry
and clearing the tags on return. We would only hit a problem if the
function issuing mprotect(sp, PROT_MTE) on and its callers already
assumed a PROT_MTE stack. Without PROT_MTE, an STG would be
write-ignore, so subsequently turning it on would lead to a mismatch
between the pointer and the allocation tags.

So PROT_MTE turning on should happen very early in the user process
startup code before any code with stack tagging enabled. Whether you
reach the top of the stack with such mprotect() doesn't really matter
since up to that point there should not be any use of stack tagging. If
that's not possible, for example the glibc code setting up the stack was
compiled to stack tagging itself, the kernel would have to enable it
when the user process starts. However, I'd only do this based on some
ELF note.
Dave Martin June 1, 2020, 3:04 p.m. UTC | #8
On Mon, Jun 01, 2020 at 03:45:45PM +0100, Catalin Marinas wrote:
> On Mon, Jun 01, 2020 at 09:55:38AM +0100, Dave P Martin wrote:
> > On Thu, May 28, 2020 at 05:34:13PM +0100, Catalin Marinas wrote:
> > > On Thu, May 28, 2020 at 12:05:09PM +0100, Szabolcs Nagy wrote:
> > > > The 05/28/2020 10:14, Catalin Marinas wrote:
> > > > > On Wed, May 27, 2020 at 11:57:39AM -0700, Peter Collingbourne wrote:
> > > > > > Should the userspace stack always be mapped as if with PROT_MTE if the
> > > > > > hardware supports it? Such a change would be invisible to non-MTE
> > > > > > aware userspace since it would already need to opt in to tag checking
> > > > > > via prctl. This would let userspace avoid a complex stack
> > > > > > initialization sequence when running with stack tagging enabled on the
> > > > > > main thread.
> > > > > 
> > > > > I don't think the stack initialisation is that difficult. On program
> > > > > startup (can be the dynamic loader). Something like (untested):
> > > > > 
> > > > > 	register unsigned long stack asm ("sp");
> > > > > 	unsigned long page_sz = sysconf(_SC_PAGESIZE);
> > > > > 
> > > > > 	mprotect((void *)(stack & ~(page_sz - 1)), page_sz,
> > > > > 		 PROT_READ | PROT_WRITE | PROT_MTE | PROT_GROWSDOWN);
> > > > > 
> > > > > (the essential part it PROT_GROWSDOWN so that you don't have to specify
> > > > > a stack lower limit)
> > > > 
> > > > does this work even if the currently mapped stack is more than page_sz?
> > > > determining the mapped main stack area is i think non-trivial to do in
> > > > userspace (requires parsing /proc/self/maps or similar).
> > > 
> > > Because of PROT_GROWSDOWN, the kernel adjusts the start of the range
> > > down automatically. It is potentially problematic if the top of the
> > > stack is more than a page away and you want the whole stack coloured. I
> > > haven't run a test but my reading of the kernel code is that the stack
> > > vma would be split in this scenario, so the range beyond sp+page_sz
> > > won't have PROT_MTE set.
> > > 
> > > My assumption is that if you do this during program start, the stack is
> > > smaller than a page. Alternatively, could we use argv or envp to
> > > determine the top of the user stack (the bottom is taken care of by the
> > > kernel)?
> > 
> > I don't think you can easily know when the stack ends, but perhaps it
> > doesn't matter.
> > 
> > From memory, the initial stack looks like:
> > 
> > 	argv/env strings
> > 	AT_NULL
> > 	auxv
> > 	NULL
> > 	env
> > 	NULL
> > 	argv
> > 	argc	<--- sp
> > 
> > If we don't care about tagging the strings correctly, we could step to
> > the end of auxv and tag down from there.
> > 
> > If we do care about tagging the strings, there's probably no good way
> > to find the end of the string area, other than looking up sp in
> > /proc/self/maps.  I'm not sure we should trust all past and future
> > kernels to spit out the strings in a predictable order.
> 
> I don't think we care about tagging whatever the kernel places on the
> stack since the argv/envp pointers are untagged. An mprotect(PROT_MTE)
> may or may not cover the environment but it shouldn't matter as the
> kernel clears the tags on the corresponding pages anyway.

We have no match-all tag, right?  So we do rely on the tags being
cleared for the initial stack contents so that using untagged pointers
to access it works.

> AFAIK stack tagging works by colouring a stack frame on function entry
> and clearing the tags on return. We would only hit a problem if the
> function issuing mprotect(sp, PROT_MTE) on and its callers already
> assumed a PROT_MTE stack. Without PROT_MTE, an STG would be
> write-ignore, so subsequently turning it on would lead to a mismatch
> between the pointer and the allocation tags.
> 
> So PROT_MTE turning on should happen very early in the user process
> startup code before any code with stack tagging enabled. Whether you
> reach the top of the stack with such mprotect() doesn't really matter
> since up to that point there should not be any use of stack tagging. If
> that's not possible, for example the glibc code setting up the stack was
> compiled to stack tagging itself, the kernel would have to enable it
> when the user process starts. However, I'd only do this based on some
> ELF note.

Sounds fair.

This early on, the process shouldn't be exposed to arbitrary, untrusted
data.  So it's probably not a problem that tagging isn't turned on right
from the start.

Cheers
---Dave
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 472c77a68225..770535b7ca35 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -129,14 +129,18 @@ 
 
 /*
  * Memory types available.
+ *
+ * IMPORTANT: MT_NORMAL must be index 0 since vm_get_page_prot() may 'or' in
+ *	      the MT_NORMAL_TAGGED memory type for PROT_MTE mappings. Note
+ *	      that protection_map[] only contains MT_NORMAL attributes.
  */
-#define MT_DEVICE_nGnRnE	0
-#define MT_DEVICE_nGnRE		1
-#define MT_DEVICE_GRE		2
-#define MT_NORMAL_NC		3
-#define MT_NORMAL		4
-#define MT_NORMAL_WT		5
-#define MT_NORMAL_TAGGED	6
+#define MT_NORMAL		0
+#define MT_NORMAL_TAGGED	1
+#define MT_NORMAL_NC		2
+#define MT_NORMAL_WT		3
+#define MT_DEVICE_nGnRnE	4
+#define MT_DEVICE_nGnRE		5
+#define MT_DEVICE_GRE		6
 
 /*
  * Memory types for Stage-2 translation
diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
new file mode 100644
index 000000000000..c77a23869223
--- /dev/null
+++ b/arch/arm64/include/asm/mman.h
@@ -0,0 +1,64 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_MMAN_H__
+#define __ASM_MMAN_H__
+
+#include <uapi/asm/mman.h>
+
+/*
+ * There are two conditions required for returning a Normal Tagged memory type
+ * in arch_vm_get_page_prot(): (1) the user requested it via PROT_MTE passed
+ * to mmap() or mprotect() and (2) the corresponding vma supports MTE. We
+ * register (1) as VM_MTE in the vma->vm_flags and (2) as VM_MTE_ALLOWED. Note
+ * that the latter can only be set during the mmap() call since mprotect()
+ * does not accept MAP_* flags.
+ */
+static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
+						   unsigned long pkey)
+{
+	if (!system_supports_mte())
+		return 0;
+
+	if (prot & PROT_MTE)
+		return VM_MTE;
+
+	return 0;
+}
+#define arch_calc_vm_prot_bits arch_calc_vm_prot_bits
+
+static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)
+{
+	if (!system_supports_mte())
+		return 0;
+
+	/*
+	 * Only allow MTE on anonymous mappings as these are guaranteed to be
+	 * backed by tags-capable memory. The vm_flags may be overridden by a
+	 * filesystem supporting MTE (RAM-based).
+	 */
+	if (flags & MAP_ANONYMOUS)
+		return VM_MTE_ALLOWED;
+
+	return 0;
+}
+#define arch_calc_vm_flag_bits arch_calc_vm_flag_bits
+
+static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
+{
+	return (vm_flags & VM_MTE) && (vm_flags & VM_MTE_ALLOWED) ?
+		__pgprot(PTE_ATTRINDX(MT_NORMAL_TAGGED)) :
+		__pgprot(0);
+}
+#define arch_vm_get_page_prot arch_vm_get_page_prot
+
+static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
+{
+	unsigned long supported = PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM;
+
+	if (system_supports_mte())
+		supported |= PROT_MTE;
+
+	return (prot & ~supported) == 0;
+}
+#define arch_validate_prot arch_validate_prot
+
+#endif /* !__ASM_MMAN_H__ */
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index c01b52add377..673033e0393b 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -36,7 +36,7 @@  extern int pfn_valid(unsigned long);
 
 #endif /* !__ASSEMBLY__ */
 
-#define VM_DATA_DEFAULT_FLAGS	VM_DATA_FLAGS_TSK_EXEC
+#define VM_DATA_DEFAULT_FLAGS	(VM_DATA_FLAGS_TSK_EXEC | VM_MTE_ALLOWED)
 
 #include <asm-generic/getorder.h>
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 647a3f0c7874..f2cd59b01b27 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -665,8 +665,13 @@  static inline phys_addr_t pgd_page_paddr(pgd_t pgd)
 
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
+	/*
+	 * Normal and Normal-Tagged are two different memory types and indices
+	 * in MAIR_EL1. The mask below has to include PTE_ATTRINDX_MASK.
+	 */
 	const pteval_t mask = PTE_USER | PTE_PXN | PTE_UXN | PTE_RDONLY |
-			      PTE_PROT_NONE | PTE_VALID | PTE_WRITE;
+			      PTE_PROT_NONE | PTE_VALID | PTE_WRITE |
+			      PTE_ATTRINDX_MASK;
 	/* preserve the hardware dirty information */
 	if (pte_hw_dirty(pte))
 		pte = pte_mkdirty(pte);
diff --git a/arch/arm64/include/uapi/asm/mman.h b/arch/arm64/include/uapi/asm/mman.h
new file mode 100644
index 000000000000..d7677ee84878
--- /dev/null
+++ b/arch/arm64/include/uapi/asm/mman.h
@@ -0,0 +1,14 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI__ASM_MMAN_H
+#define _UAPI__ASM_MMAN_H
+
+#include <asm-generic/mman.h>
+
+/*
+ * The generic mman.h file reserves 0x10 and 0x20 for arch-specific PROT_*
+ * flags.
+ */
+/* 0x10 reserved for PROT_BTI */
+#define PROT_MTE	 0x20		/* Normal Tagged mapping */
+
+#endif /* !_UAPI__ASM_MMAN_H */
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8d382d4ec067..2f26112ebb77 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -647,6 +647,10 @@  static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_MERGEABLE)]	= "mg",
 		[ilog2(VM_UFFD_MISSING)]= "um",
 		[ilog2(VM_UFFD_WP)]	= "uw",
+#ifdef CONFIG_ARM64_MTE
+		[ilog2(VM_MTE)]		= "mt",
+		[ilog2(VM_MTE_ALLOWED)]	= "",
+#endif
 #ifdef CONFIG_ARCH_HAS_PKEYS
 		/* These come out via ProtectionKey: */
 		[ilog2(VM_PKEY_BIT0)]	= "",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a323422d783..132ca88e407d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -336,6 +336,14 @@  extern unsigned int kobjsize(const void *objp);
 # define VM_MPX		VM_NONE
 #endif
 
+#if defined(CONFIG_ARM64_MTE)
+# define VM_MTE		VM_HIGH_ARCH_0	/* Use Tagged memory for access control */
+# define VM_MTE_ALLOWED	VM_HIGH_ARCH_1	/* Tagged memory permitted */
+#else
+# define VM_MTE		VM_NONE
+# define VM_MTE_ALLOWED	VM_NONE
+#endif
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif