diff mbox series

[v7,29/29] arm64: mte: Add Memory Tagging Extension documentation

Message ID 20200715170844.30064-30-catalin.marinas@arm.com (mailing list archive)
State New, archived
Headers show
Series arm64: Memory Tagging Extension user-space support | expand

Commit Message

Catalin Marinas July 15, 2020, 5:08 p.m. UTC
From: Vincenzo Frascino <vincenzo.frascino@arm.com>

Memory Tagging Extension (part of the ARMv8.5 Extensions) provides
a mechanism to detect the sources of memory related errors which
may be vulnerable to exploitation, including bounds violations,
use-after-free, use-after-return, use-out-of-scope and use before
initialization errors.

Add Memory Tagging Extension documentation for the arm64 linux
kernel support.

Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Co-developed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: Will Deacon <will@kernel.org>
---

Notes:
    v7:
    - Add information on ptrace() regset access (NT_ARM_TAGGED_ADDR_CTRL).
    
    v4:
    - Document behaviour of madvise(MADV_DONTNEED/MADV_FREE).
    - Document the initial process state on fork/execve.
    - Clarify when the kernel uaccess checks the tags.
    - Minor updates to the example code.
    - A few other minor clean-ups following review.
    
    v3:
    - Modify the uaccess checking conditions: only when the sync mode is
      selected by the user. In async mode, the kernel uaccesses are not
      checked.
    - Clarify that an include mask of 0 (exclude mask 0xffff) results in
      always generating tag 0.
    - Document the ptrace() interface.
    
    v2:
    - Documented the uaccess kernel tag checking mode.
    - Removed the BTI definitions from cpu-feature-registers.rst.
    - Removed the paragraph stating that MTE depends on the tagged address
      ABI (while the Kconfig entry does, there is no requirement for the
      user to enable both).
    - Changed the GCR_EL1.Exclude handling description following the change
      in the prctl() interface (include vs exclude mask).
    - Updated the example code.

 Documentation/arm64/cpu-feature-registers.rst |   2 +
 Documentation/arm64/elf_hwcaps.rst            |   4 +
 Documentation/arm64/index.rst                 |   1 +
 .../arm64/memory-tagging-extension.rst        | 305 ++++++++++++++++++
 4 files changed, 312 insertions(+)
 create mode 100644 Documentation/arm64/memory-tagging-extension.rst

Comments

Szabolcs Nagy July 27, 2020, 4:36 p.m. UTC | #1
The 07/15/2020 18:08, Catalin Marinas wrote:
> From: Vincenzo Frascino <vincenzo.frascino@arm.com>
> 
> Memory Tagging Extension (part of the ARMv8.5 Extensions) provides
> a mechanism to detect the sources of memory related errors which
> may be vulnerable to exploitation, including bounds violations,
> use-after-free, use-after-return, use-out-of-scope and use before
> initialization errors.
> 
> Add Memory Tagging Extension documentation for the arm64 linux
> kernel support.
> 
> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
> Co-developed-by: Catalin Marinas <catalin.marinas@arm.com>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Acked-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
> Cc: Will Deacon <will@kernel.org>
> ---
> 
> Notes:
>     v7:
>     - Add information on ptrace() regset access (NT_ARM_TAGGED_ADDR_CTRL).
>     
>     v4:
>     - Document behaviour of madvise(MADV_DONTNEED/MADV_FREE).
>     - Document the initial process state on fork/execve.
>     - Clarify when the kernel uaccess checks the tags.
>     - Minor updates to the example code.
>     - A few other minor clean-ups following review.
>     
>     v3:
>     - Modify the uaccess checking conditions: only when the sync mode is
>       selected by the user. In async mode, the kernel uaccesses are not
>       checked.
>     - Clarify that an include mask of 0 (exclude mask 0xffff) results in
>       always generating tag 0.
>     - Document the ptrace() interface.
>     
>     v2:
>     - Documented the uaccess kernel tag checking mode.
>     - Removed the BTI definitions from cpu-feature-registers.rst.
>     - Removed the paragraph stating that MTE depends on the tagged address
>       ABI (while the Kconfig entry does, there is no requirement for the
>       user to enable both).
>     - Changed the GCR_EL1.Exclude handling description following the change
>       in the prctl() interface (include vs exclude mask).
>     - Updated the example code.
> 
>  Documentation/arm64/cpu-feature-registers.rst |   2 +
>  Documentation/arm64/elf_hwcaps.rst            |   4 +
>  Documentation/arm64/index.rst                 |   1 +
>  .../arm64/memory-tagging-extension.rst        | 305 ++++++++++++++++++
>  4 files changed, 312 insertions(+)
>  create mode 100644 Documentation/arm64/memory-tagging-extension.rst
> 
> diff --git a/Documentation/arm64/cpu-feature-registers.rst b/Documentation/arm64/cpu-feature-registers.rst
...
> +Tag Check Faults
> +----------------
> +
> +When ``PROT_MTE`` is enabled on an address range and a mismatch between
> +the logical and allocation tags occurs on access, there are three
> +configurable behaviours:
> +
> +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the
> +  tag check fault.
> +
> +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with
> +  ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The
> +  memory access is not performed. If ``SIGSEGV`` is ignored or blocked
> +  by the offending thread, the containing process is terminated with a
> +  ``coredump``.
> +
> +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the offending
> +  thread, asynchronously following one or multiple tag check faults,
> +  with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0`` (the faulting
> +  address is unknown).
> +
> +The user can select the above modes, per thread, using the
> +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where
> +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK``
> +bit-field:
> +
> +- ``PR_MTE_TCF_NONE``  - *Ignore* tag check faults
> +- ``PR_MTE_TCF_SYNC``  - *Synchronous* tag check fault mode
> +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode
> +
> +The current tag check fault mode can be read using the
> +``prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0)`` system call.

we discussed the need for per process prctl off list, i will
try to summarize the requirement here:

- it cannot be guaranteed in general that a library initializer
or first call into a library happens when the process is still
single threaded.

- user code currently has no way to call prctl in all threads of
a process and even within the c runtime doing so is problematic
(it has to signal all threads, which requires a reserved signal
and dealing with exiting threads and signal masks, such mechanism
can break qemu user and various other userspace tooling).

- we don't yet have defined contract in userspace about how user
code may enable mte (i.e. use the prctl call), but it seems that
there will be use cases for it: LD_PRELOADing malloc for heap
tagging is one such case, but any library or custom allocator
that wants to use mte will have this issue: when it enables mte
it wants to enable it for all threads in the process. (or at
least all threads managed by the c runtime).

- even if user code is not allowed to call the prctl directly,
i.e. the prctl settings are owned by the libc, there will be
cases when the settings have to be changed in a multithreaded
process (e.g. dlopening a library that requires a particular
mte state).

a solution is to introduce a flag like SECCOMP_FILTER_FLAG_TSYNC
that means the prctl is for all threads in the process not just
for the current one. however the exact semantics is not obvious
if there are inconsistent settings in different threads or user
code tries to use the prctl concurrently: first checking then
setting the mte state via separate prctl calls is racy. but if
the userspace contract for enabling mte limits who and when can
call the prctl then i think the simple sync flag approach works.

(the sync flag should apply to all prctl settings: tagged addr
syscall abi, mte check fault mode, irg tag excludes. ideally it
would work for getting the process wide state and it would fail
in case of inconsistent settings.)

we may need to document some memory ordering details when
memory accesses in other threads are affected, but i think
that can be something simple that leaves it unspecified
what happens with memory accesses that are not synchrnized
with the prctl call.
Dave Martin July 28, 2020, 11:08 a.m. UTC | #2
On Mon, Jul 27, 2020 at 05:36:35PM +0100, Szabolcs Nagy wrote:
> The 07/15/2020 18:08, Catalin Marinas wrote:
> > From: Vincenzo Frascino <vincenzo.frascino@arm.com>
> > 
> > Memory Tagging Extension (part of the ARMv8.5 Extensions) provides
> > a mechanism to detect the sources of memory related errors which
> > may be vulnerable to exploitation, including bounds violations,
> > use-after-free, use-after-return, use-out-of-scope and use before
> > initialization errors.
> > 
> > Add Memory Tagging Extension documentation for the arm64 linux
> > kernel support.
> > 
> > Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
> > Co-developed-by: Catalin Marinas <catalin.marinas@arm.com>
> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > Acked-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > ---
> > 
> > Notes:
> >     v7:
> >     - Add information on ptrace() regset access (NT_ARM_TAGGED_ADDR_CTRL).
> >     
> >     v4:
> >     - Document behaviour of madvise(MADV_DONTNEED/MADV_FREE).
> >     - Document the initial process state on fork/execve.
> >     - Clarify when the kernel uaccess checks the tags.
> >     - Minor updates to the example code.
> >     - A few other minor clean-ups following review.
> >     
> >     v3:
> >     - Modify the uaccess checking conditions: only when the sync mode is
> >       selected by the user. In async mode, the kernel uaccesses are not
> >       checked.
> >     - Clarify that an include mask of 0 (exclude mask 0xffff) results in
> >       always generating tag 0.
> >     - Document the ptrace() interface.
> >     
> >     v2:
> >     - Documented the uaccess kernel tag checking mode.
> >     - Removed the BTI definitions from cpu-feature-registers.rst.
> >     - Removed the paragraph stating that MTE depends on the tagged address
> >       ABI (while the Kconfig entry does, there is no requirement for the
> >       user to enable both).
> >     - Changed the GCR_EL1.Exclude handling description following the change
> >       in the prctl() interface (include vs exclude mask).
> >     - Updated the example code.
> > 
> >  Documentation/arm64/cpu-feature-registers.rst |   2 +
> >  Documentation/arm64/elf_hwcaps.rst            |   4 +
> >  Documentation/arm64/index.rst                 |   1 +
> >  .../arm64/memory-tagging-extension.rst        | 305 ++++++++++++++++++
> >  4 files changed, 312 insertions(+)
> >  create mode 100644 Documentation/arm64/memory-tagging-extension.rst
> > 
> > diff --git a/Documentation/arm64/cpu-feature-registers.rst b/Documentation/arm64/cpu-feature-registers.rst
> ...
> > +Tag Check Faults
> > +----------------
> > +
> > +When ``PROT_MTE`` is enabled on an address range and a mismatch between
> > +the logical and allocation tags occurs on access, there are three
> > +configurable behaviours:
> > +
> > +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the
> > +  tag check fault.
> > +
> > +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with
> > +  ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The
> > +  memory access is not performed. If ``SIGSEGV`` is ignored or blocked
> > +  by the offending thread, the containing process is terminated with a
> > +  ``coredump``.
> > +
> > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the offending
> > +  thread, asynchronously following one or multiple tag check faults,
> > +  with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0`` (the faulting
> > +  address is unknown).
> > +
> > +The user can select the above modes, per thread, using the
> > +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where
> > +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK``
> > +bit-field:
> > +
> > +- ``PR_MTE_TCF_NONE``  - *Ignore* tag check faults
> > +- ``PR_MTE_TCF_SYNC``  - *Synchronous* tag check fault mode
> > +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode
> > +
> > +The current tag check fault mode can be read using the
> > +``prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0)`` system call.
> 
> we discussed the need for per process prctl off list, i will
> try to summarize the requirement here:
> 
> - it cannot be guaranteed in general that a library initializer
> or first call into a library happens when the process is still
> single threaded.
> 
> - user code currently has no way to call prctl in all threads of
> a process and even within the c runtime doing so is problematic
> (it has to signal all threads, which requires a reserved signal
> and dealing with exiting threads and signal masks, such mechanism
> can break qemu user and various other userspace tooling).

When working on the SVE support, I came to the conclusion that this
kind of thing would normally either be done by the runtime itself, or in
close cooperation with the runtime.  However, for SVE it never makes
sense for one thread to asynchronously change the vector length of
another thread -- that's different from the MTE situation.

> - we don't yet have defined contract in userspace about how user
> code may enable mte (i.e. use the prctl call), but it seems that
> there will be use cases for it: LD_PRELOADing malloc for heap
> tagging is one such case, but any library or custom allocator
> that wants to use mte will have this issue: when it enables mte
> it wants to enable it for all threads in the process. (or at
> least all threads managed by the c runtime).

What are the situations where we anticipate a need to twiddle MTE in
multiple threads simultaneously, other than during process startup?

> - even if user code is not allowed to call the prctl directly,
> i.e. the prctl settings are owned by the libc, there will be
> cases when the settings have to be changed in a multithreaded
> process (e.g. dlopening a library that requires a particular
> mte state).

Could be avoided by refusing to dlopen a library that is incompatible
with the current process.

dlopen()ing a library that doesn't support tagged addresses, in a
process that does use tagged addresses, seems undesirable even if tag
checking is currently turned off.


> a solution is to introduce a flag like SECCOMP_FILTER_FLAG_TSYNC
> that means the prctl is for all threads in the process not just
> for the current one. however the exact semantics is not obvious
> if there are inconsistent settings in different threads or user
> code tries to use the prctl concurrently: first checking then
> setting the mte state via separate prctl calls is racy. but if
> the userspace contract for enabling mte limits who and when can
> call the prctl then i think the simple sync flag approach works.
> 
> (the sync flag should apply to all prctl settings: tagged addr
> syscall abi, mte check fault mode, irg tag excludes. ideally it
> would work for getting the process wide state and it would fail
> in case of inconsistent settings.)

If going down this route, perhaps we could have sets of settings:
so for each setting we have a process-wide value and a per-thread
value, with defines rules about how they combine.

Since MTE is a debugging feature, we might be able to be less aggressive
about synchronisation than in the SECCOMP case.

> we may need to document some memory ordering details when
> memory accesses in other threads are affected, but i think
> that can be something simple that leaves it unspecified
> what happens with memory accesses that are not synchrnized
> with the prctl call.

Hmmm...

Cheers
---Dave
Szabolcs Nagy July 28, 2020, 2:53 p.m. UTC | #3
The 07/28/2020 12:08, Dave Martin wrote:
> On Mon, Jul 27, 2020 at 05:36:35PM +0100, Szabolcs Nagy wrote:
> > The 07/15/2020 18:08, Catalin Marinas wrote:
> > > +The user can select the above modes, per thread, using the
> > > +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where
> > > +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK``
> > > +bit-field:
> > > +
> > > +- ``PR_MTE_TCF_NONE``  - *Ignore* tag check faults
> > > +- ``PR_MTE_TCF_SYNC``  - *Synchronous* tag check fault mode
> > > +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode
> > > +
> > > +The current tag check fault mode can be read using the
> > > +``prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0)`` system call.
> > 
> > we discussed the need for per process prctl off list, i will
> > try to summarize the requirement here:
> > 
> > - it cannot be guaranteed in general that a library initializer
> > or first call into a library happens when the process is still
> > single threaded.
> > 
> > - user code currently has no way to call prctl in all threads of
> > a process and even within the c runtime doing so is problematic
> > (it has to signal all threads, which requires a reserved signal
> > and dealing with exiting threads and signal masks, such mechanism
> > can break qemu user and various other userspace tooling).
> 
> When working on the SVE support, I came to the conclusion that this
> kind of thing would normally either be done by the runtime itself, or in
> close cooperation with the runtime.  However, for SVE it never makes
> sense for one thread to asynchronously change the vector length of
> another thread -- that's different from the MTE situation.

currently there is libc mechanism to do some operation
in all threads (e.g. for set*id) but this is fragile
and not something that can be exposed to user code.
(on the kernel side it should be much simpler to do)

> > - we don't yet have defined contract in userspace about how user
> > code may enable mte (i.e. use the prctl call), but it seems that
> > there will be use cases for it: LD_PRELOADing malloc for heap
> > tagging is one such case, but any library or custom allocator
> > that wants to use mte will have this issue: when it enables mte
> > it wants to enable it for all threads in the process. (or at
> > least all threads managed by the c runtime).
> 
> What are the situations where we anticipate a need to twiddle MTE in
> multiple threads simultaneously, other than during process startup?
> 
> > - even if user code is not allowed to call the prctl directly,
> > i.e. the prctl settings are owned by the libc, there will be
> > cases when the settings have to be changed in a multithreaded
> > process (e.g. dlopening a library that requires a particular
> > mte state).
> 
> Could be avoided by refusing to dlopen a library that is incompatible
> with the current process.
> 
> dlopen()ing a library that doesn't support tagged addresses, in a
> process that does use tagged addresses, seems undesirable even if tag
> checking is currently turned off.

yes but it can go the other way too:

at startup the libc does not enable tag checks
for performance reasons, but at dlopen time a
library is detected to use mte (e.g. stack
tagging or custom allocator).

then libc or the dlopened library has to ensure
that checks are enabled in all threads. (in case
of stack tagging the libc has to mark existing
stacks with PROT_MTE too, there is mechanism for
this in glibc to deal with dlopened libraries
that require executable stack and only reject
the dlopen if this cannot be performed.)

another usecase is that the libc is mte-safe
(it accepts tagged pointers and memory in its
interfaces), but it does not enable mte (this
will be the case with glibc 2.32) and user
libraries have to enable mte to use it (custom
allocator or malloc interposition are examples).

and i think this is necessary if userpsace wants
to turn async tag check into sync tag check at
runtime when a failure is detected.

> > a solution is to introduce a flag like SECCOMP_FILTER_FLAG_TSYNC
> > that means the prctl is for all threads in the process not just
> > for the current one. however the exact semantics is not obvious
> > if there are inconsistent settings in different threads or user
> > code tries to use the prctl concurrently: first checking then
> > setting the mte state via separate prctl calls is racy. but if
> > the userspace contract for enabling mte limits who and when can
> > call the prctl then i think the simple sync flag approach works.
> > 
> > (the sync flag should apply to all prctl settings: tagged addr
> > syscall abi, mte check fault mode, irg tag excludes. ideally it
> > would work for getting the process wide state and it would fail
> > in case of inconsistent settings.)
> 
> If going down this route, perhaps we could have sets of settings:
> so for each setting we have a process-wide value and a per-thread
> value, with defines rules about how they combine.
> 
> Since MTE is a debugging feature, we might be able to be less aggressive
> about synchronisation than in the SECCOMP case.

separate process-wide and per-thread value
works for me and i expect most uses will
be process wide settings.

i don't think mte is less of a security
feature than seccomp.

if linux does not want to add a per process
setting then only libc will be able to opt-in
to mte and only at very early in the startup
process (before executing any user code that
may start threads). this is not out of question,
but i think it limits the usage and deployment
options.

> > we may need to document some memory ordering details when
> > memory accesses in other threads are affected, but i think
> > that can be something simple that leaves it unspecified
> > what happens with memory accesses that are not synchrnized
> > with the prctl call.
> 
> Hmmm...

e.g. it may be enough if the spec only works if
there is no PROT_MTE memory mapped yet, and no
tagged addresses are present in the multi-threaded
process when the prctl is called.
Catalin Marinas July 28, 2020, 7:59 p.m. UTC | #4
On Tue, Jul 28, 2020 at 03:53:51PM +0100, Szabolcs Nagy wrote:
> The 07/28/2020 12:08, Dave Martin wrote:
> > On Mon, Jul 27, 2020 at 05:36:35PM +0100, Szabolcs Nagy wrote:
> > > a solution is to introduce a flag like SECCOMP_FILTER_FLAG_TSYNC
> > > that means the prctl is for all threads in the process not just
> > > for the current one. however the exact semantics is not obvious
> > > if there are inconsistent settings in different threads or user
> > > code tries to use the prctl concurrently: first checking then
> > > setting the mte state via separate prctl calls is racy. but if
> > > the userspace contract for enabling mte limits who and when can
> > > call the prctl then i think the simple sync flag approach works.
> > > 
> > > (the sync flag should apply to all prctl settings: tagged addr
> > > syscall abi, mte check fault mode, irg tag excludes. ideally it
> > > would work for getting the process wide state and it would fail
> > > in case of inconsistent settings.)
> > 
> > If going down this route, perhaps we could have sets of settings:
> > so for each setting we have a process-wide value and a per-thread
> > value, with defines rules about how they combine.
> > 
> > Since MTE is a debugging feature, we might be able to be less aggressive
> > about synchronisation than in the SECCOMP case.
> 
> separate process-wide and per-thread value
> works for me and i expect most uses will
> be process wide settings.

The problem with the thread synchronisation is, unlike SECCOMP, that we
need to update the SCTLR_EL1.TCF0 field across all the CPUs that may run
threads of the current process. I haven't convinced myself that this is
race-free without heavy locking. If we go for some heavy mechanism like
stop_machine(), that opens the kernel to DoS attacks from user. Still
investigating if something like membarrier() would be sufficient.

SECCOMP gets away with this as it only needs to set some variable
without IPI'ing the other CPUs.

> i don't think mte is less of a security
> feature than seccomp.

Well, MTE is probabilistic, SECCOMP seems to be more precise ;).

> if linux does not want to add a per process
> setting then only libc will be able to opt-in
> to mte and only at very early in the startup
> process (before executing any user code that
> may start threads). this is not out of question,
> but i think it limits the usage and deployment
> options.

There is also the risk that we try to be too flexible at this stage
without a real use-case.
Szabolcs Nagy Aug. 3, 2020, 12:43 p.m. UTC | #5
The 07/28/2020 20:59, Catalin Marinas wrote:
> On Tue, Jul 28, 2020 at 03:53:51PM +0100, Szabolcs Nagy wrote:
> > if linux does not want to add a per process
> > setting then only libc will be able to opt-in
> > to mte and only at very early in the startup
> > process (before executing any user code that
> > may start threads). this is not out of question,
> > but i think it limits the usage and deployment
> > options.
> 
> There is also the risk that we try to be too flexible at this stage
> without a real use-case.

i don't know how mte will be turned on in libc.

if we can always turn sync tag checks on early
whenever mte is available then i think there is
no issue.

but if we have to make the decision later for
compatibility or performance reasons then per
thread setting is problematic.

use of the prctl outside of libc is very limited
if it's per thread only:

- application code may use it in a (elf specific)
  pre-initialization function, but that's a bit
  obscure (not exposed in c) and it is reasonable
  for an application to enable mte checks after
  it registered a signal handler for mte faults.
  (and at that point it may be multi-threaded).

- library code normally initializes per thread
  state on the first call into the library from a
  given thread, but with mte, as soon as memory /
  pointers are tagged in one thread, all threads
  are affected: not performing checks in other
  threads is less secure (may be ok) and it means
  incompatible syscall abi (not ok). so at least
  PR_TAGGED_ADDR_ENABLE should have process wide
  setting for this usage.

but i guess it is fine to design the mechanism
for these in a later linux version, until then
such usage will be unreliable (will depend on
how early threads are created).
Catalin Marinas Aug. 7, 2020, 3:19 p.m. UTC | #6
Hi Szabolcs,

On Mon, Aug 03, 2020 at 01:43:10PM +0100, Szabolcs Nagy wrote:
> The 07/28/2020 20:59, Catalin Marinas wrote:
> > On Tue, Jul 28, 2020 at 03:53:51PM +0100, Szabolcs Nagy wrote:
> > > if linux does not want to add a per process setting then only libc
> > > will be able to opt-in to mte and only at very early in the
> > > startup process (before executing any user code that may start
> > > threads). this is not out of question, but i think it limits the
> > > usage and deployment options.
> > 
> > There is also the risk that we try to be too flexible at this stage
> > without a real use-case.
> 
> i don't know how mte will be turned on in libc.
> 
> if we can always turn sync tag checks on early whenever mte is
> available then i think there is no issue.
> 
> but if we have to make the decision later for compatibility or
> performance reasons then per thread setting is problematic.

At least for libc, I'm not sure how you could even turn MTE on at
run-time. The heap allocations would have to be mapped with PROT_MTE as
we can't easily change them (well, you could mprotect(), assuming the
user doesn't use tagged pointers on them).

There is a case to switch tag checking from asynchronous to synchronous
at run-time based on a signal but that's rather specific to Android
where zygote controls the signal handler. I don't think you can do this
with libc. Even on Android, since the async fault signal is delivered
per thread, it probably does this lazily (alternatively, it could issue
a SIGUSRx to the other threads for synchronisation).

> use of the prctl outside of libc is very limited if it's per thread
> only:

In the non-Android context, I think the prctl() for MTE control should
be restricted to the libc. You can control the mode prior to the process
being started using environment variables. I really don't see how the
libc could handle the changing of the MTE behaviour at run-time without
itself handling signals.

> - application code may use it in a (elf specific) pre-initialization
>   function, but that's a bit obscure (not exposed in c) and it is
>   reasonable for an application to enable mte checks after it
>   registered a signal handler for mte faults. (and at that point it
>   may be multi-threaded).

Since the app can install signal handlers, it can also deal with
notifying other threads with a SIGUSRx, assuming that it decided this
after multiple threads were created. If it does this while
single-threaded, subsequent threads would inherit the first one.

The only use-case I see for doing this in the kernel is if the code
requiring an MTE behaviour change cannot install signal handlers. More
on this below.

> - library code normally initializes per thread state on the first call
>   into the library from a given thread, but with mte, as soon as
>   memory / pointers are tagged in one thread, all threads are
>   affected: not performing checks in other threads is less secure (may
>   be ok) and it means incompatible syscall abi (not ok). so at least
>   PR_TAGGED_ADDR_ENABLE should have process wide setting for this
>   usage.

My assumption with MTE is that the libc will initialise it when the
library is loaded (something __attribute__((constructor))) and it's
still in single-threaded mode. Does it wait until the first malloc()
call? Also, is there such thing as a per-thread initialiser for a
dynamic library (not sure it can be implemented in practice though)?

The PR_TAGGED_ADDR_ENABLE synchronisation at least doesn't require IPIs
to other CPUs to change the hardware state. However, it can still race
with thread creation or a prctl() on another thread, not sure what we
can define here, especially as it depends on the kernel internals: e.g.
thread creation copies some data structures of the calling thread but at
the same time another thread wants to change such structures for all
threads of that process. The ordering of events here looks pretty
fragile.

Maybe with another global status (per process) which takes priority over
the per thread one would be easier. But such priority is not temporal
(i.e. whoever called prctl() last) but pretty strict: once a global
control was requested, it will remain global no matter what subsequent
threads request (or we can do it the other way around).

> but i guess it is fine to design the mechanism for these in a later
> linux version, until then such usage will be unreliable (will depend
> on how early threads are created).

Until we have a real use-case, I'd not complicate the matters further.
For example, I'm still not sure how realistic it is for an application
to load a new heap allocator after some threads were created. Even the
glibc support, I don't think it needs this.

Could an LD_PRELOADED library be initialised after threads were created
(I guess it could if another preloaded library created threads)? Even if
it does, do we have an example or it's rather theoretical.

If this becomes an essential use-case, we can look at adding a new flag
for prctl() which would set the option globally, with the caveats
mentioned above. It doesn't need to be in the initial ABI (and the
PR_TAGGED_ADDR_ENABLE is already upstream).

Thanks.
Szabolcs Nagy Aug. 10, 2020, 2:13 p.m. UTC | #7
The 08/07/2020 16:19, Catalin Marinas wrote:
> On Mon, Aug 03, 2020 at 01:43:10PM +0100, Szabolcs Nagy wrote:
> > if we can always turn sync tag checks on early whenever mte is
> > available then i think there is no issue.
> > 
> > but if we have to make the decision later for compatibility or
> > performance reasons then per thread setting is problematic.
> 
> At least for libc, I'm not sure how you could even turn MTE on at
> run-time. The heap allocations would have to be mapped with PROT_MTE as
> we can't easily change them (well, you could mprotect(), assuming the
> user doesn't use tagged pointers on them).

e.g. dlopen of library with stack tagging.
(libc can mark stacks with PROT_MTE at that time)

or just turn on sync tag checks later when using
heap tagging.

> 
> There is a case to switch tag checking from asynchronous to synchronous
> at run-time based on a signal but that's rather specific to Android
> where zygote controls the signal handler. I don't think you can do this
> with libc. Even on Android, since the async fault signal is delivered
> per thread, it probably does this lazily (alternatively, it could issue
> a SIGUSRx to the other threads for synchronisation).

i think what that zygote is doing is a valid use-case but
in a normal linux setup the application owns the signal
handlers so the tag check switch has to be done by the
application. the libc can expose some api for it, so in
principle it's enough if the libc can do the runtime
switch, but we dont plan to add new libc apis for mte.

> > use of the prctl outside of libc is very limited if it's per thread
> > only:
> 
> In the non-Android context, I think the prctl() for MTE control should
> be restricted to the libc. You can control the mode prior to the process
> being started using environment variables. I really don't see how the
> libc could handle the changing of the MTE behaviour at run-time without
> itself handling signals.
> 
> > - application code may use it in a (elf specific) pre-initialization
> >   function, but that's a bit obscure (not exposed in c) and it is
> >   reasonable for an application to enable mte checks after it
> >   registered a signal handler for mte faults. (and at that point it
> >   may be multi-threaded).
> 
> Since the app can install signal handlers, it can also deal with
> notifying other threads with a SIGUSRx, assuming that it decided this
> after multiple threads were created. If it does this while
> single-threaded, subsequent threads would inherit the first one.

the application does not know what libraries create what
threads in the background, i dont think there is a way to
send signals to each thread (e.g. /proc/self/task cannot
be read atomically with respect to thread creation/exit).

the libc controls thread creation and exit so it can have
a list of threads it can notify, but an application cannot
do this. (libc could provide an api so applications can
do some per thread operation, but a libc would not do this
happily: currently there are locks around thread creation
and exit that are only needed for this "signal all threads"
mechanism which makes it hard to expose to users)

one way applications sometimes work this around is to
self re-exec. but that's a big hammer and not entirely
reliable (e.g. the exe may not be available on the
filesystem any more or the commandline args may need
to be preserved but they are clobbered, or some complex
application state needs to be recreated etc)

> 
> The only use-case I see for doing this in the kernel is if the code
> requiring an MTE behaviour change cannot install signal handlers. More
> on this below.
> 
> > - library code normally initializes per thread state on the first call
> >   into the library from a given thread, but with mte, as soon as
> >   memory / pointers are tagged in one thread, all threads are
> >   affected: not performing checks in other threads is less secure (may
> >   be ok) and it means incompatible syscall abi (not ok). so at least
> >   PR_TAGGED_ADDR_ENABLE should have process wide setting for this
> >   usage.
> 
> My assumption with MTE is that the libc will initialise it when the
> library is loaded (something __attribute__((constructor))) and it's
> still in single-threaded mode. Does it wait until the first malloc()
> call? Also, is there such thing as a per-thread initialiser for a
> dynamic library (not sure it can be implemented in practice though)?

there is no per thread initializer in an elf module.
(tls state is usually initialized lazily in threads
when necessary.)

malloc calls can happen before the ctors of an LD_PRELOAD
library and threads can be created before both.
glibc runs ldpreload ctors after other library ctors.

custom allocator can be of course dlopened.
(i'd expect several language runtimes to have their
own allocator and support dlopening the runtime)

> 
> The PR_TAGGED_ADDR_ENABLE synchronisation at least doesn't require IPIs
> to other CPUs to change the hardware state. However, it can still race
> with thread creation or a prctl() on another thread, not sure what we
> can define here, especially as it depends on the kernel internals: e.g.
> thread creation copies some data structures of the calling thread but at
> the same time another thread wants to change such structures for all
> threads of that process. The ordering of events here looks pretty
> fragile.
> 
> Maybe with another global status (per process) which takes priority over
> the per thread one would be easier. But such priority is not temporal
> (i.e. whoever called prctl() last) but pretty strict: once a global
> control was requested, it will remain global no matter what subsequent
> threads request (or we can do it the other way around).

i see.

> > but i guess it is fine to design the mechanism for these in a later
> > linux version, until then such usage will be unreliable (will depend
> > on how early threads are created).
> 
> Until we have a real use-case, I'd not complicate the matters further.
> For example, I'm still not sure how realistic it is for an application
> to load a new heap allocator after some threads were created. Even the
> glibc support, I don't think it needs this.
> 
> Could an LD_PRELOADED library be initialised after threads were created
> (I guess it could if another preloaded library created threads)? Even if
> it does, do we have an example or it's rather theoretical.

i believe this happens e.g. in applications built
with tsan. (the thread sanitizer creates a
background thread early which i think does not
call malloc itself but may want to access
malloced memory, but i don't have a setup with
tsan support to test this)

> 
> If this becomes an essential use-case, we can look at adding a new flag
> for prctl() which would set the option globally, with the caveats
> mentioned above. It doesn't need to be in the initial ABI (and the
> PR_TAGGED_ADDR_ENABLE is already upstream).
> 
> Thanks.
> 
> -- 
> Catalin
Catalin Marinas Aug. 11, 2020, 5:20 p.m. UTC | #8
On Mon, Aug 10, 2020 at 03:13:09PM +0100, Szabolcs Nagy wrote:
> The 08/07/2020 16:19, Catalin Marinas wrote:
> > On Mon, Aug 03, 2020 at 01:43:10PM +0100, Szabolcs Nagy wrote:
> > > if we can always turn sync tag checks on early whenever mte is
> > > available then i think there is no issue.
> > > 
> > > but if we have to make the decision later for compatibility or
> > > performance reasons then per thread setting is problematic.
> > 
> > At least for libc, I'm not sure how you could even turn MTE on at
> > run-time. The heap allocations would have to be mapped with PROT_MTE as
> > we can't easily change them (well, you could mprotect(), assuming the
> > user doesn't use tagged pointers on them).
> 
> e.g. dlopen of library with stack tagging. (libc can mark stacks with
> PROT_MTE at that time)

If we allow such mixed object support with stack tagging enabled at
dlopen, PROT_MTE would need to be turned on for each thread stack. This
wouldn't require synchronisation, only knowing where the thread stacks
are, but you'd need to make sure threads don't call into the new library
until the stacks have been mprotect'ed. Doing this midway through a
function execution may corrupt the tags.

So I'm not sure how safe any of this is without explicit user
synchronisation (i.e. don't call into the library until all threads have
been updated). Even changing options like GCR_EL1.Excl across multiple
threads may have unwanted effects. See this comment from Peter, the
difference being that instead of an explicit prctl() call on the current
stack, another thread would do it:

https://lore.kernel.org/linux-arch/CAMn1gO5rhOG1W+nVe103v=smvARcFFp_Ct9XqH2Ca4BUMfpDdg@mail.gmail.com/

> or just turn on sync tag checks later when using heap tagging.

I wonder whether setting the synchronous tag check mode by default would
improve this aspect. This would not have any effect until PROT_MTE is
used. If software wants some better performance they can explicitly opt
in to asynchronous mode or disable tag checking after some SIGSEGV +
reporting (this shouldn't exclude the environment variables you
currently use for controlling the tag check mode).

Also, if there are saner defaults for the user GCR_EL1.Excl (currently
all masked), we should decide them now.

If stack tagging will come with some ELF information, we could make the
default tag checking and GCR_EL1.Excl choices based on that, otherwise
maybe we should revisit the default configuration the kernel sets for
the user in the absence of any other information.

> > There is a case to switch tag checking from asynchronous to synchronous
> > at run-time based on a signal but that's rather specific to Android
> > where zygote controls the signal handler. I don't think you can do this
> > with libc. Even on Android, since the async fault signal is delivered
> > per thread, it probably does this lazily (alternatively, it could issue
> > a SIGUSRx to the other threads for synchronisation).
> 
> i think what that zygote is doing is a valid use-case but
> in a normal linux setup the application owns the signal
> handlers so the tag check switch has to be done by the
> application. the libc can expose some api for it, so in
> principle it's enough if the libc can do the runtime
> switch, but we dont plan to add new libc apis for mte.

Due to the synchronisation aspect especially regarding the stack
tagging, I'm not sure the kernel alone can safely do this.

Changing the tagged address syscall ABI across multiple threads should
be safer (well, at least the relaxing part). But if we don't solve the
other aspects I mentioned above, I don't think there is much point in
only doing it for this.

> > > - library code normally initializes per thread state on the first call
> > >   into the library from a given thread, but with mte, as soon as
> > >   memory / pointers are tagged in one thread, all threads are
> > >   affected: not performing checks in other threads is less secure (may
> > >   be ok) and it means incompatible syscall abi (not ok). so at least
> > >   PR_TAGGED_ADDR_ENABLE should have process wide setting for this
> > >   usage.
> > 
> > My assumption with MTE is that the libc will initialise it when the
> > library is loaded (something __attribute__((constructor))) and it's
> > still in single-threaded mode. Does it wait until the first malloc()
> > call? Also, is there such thing as a per-thread initialiser for a
> > dynamic library (not sure it can be implemented in practice though)?
> 
> there is no per thread initializer in an elf module.
> (tls state is usually initialized lazily in threads
> when necessary.)
> 
> malloc calls can happen before the ctors of an LD_PRELOAD
> library and threads can be created before both.
> glibc runs ldpreload ctors after other library ctors.

In the presence of stack tagging, I think any subsequent MTE config
change across all threads is unsafe, irrespective of whether it's done
by the kernel or user via SIGUSRx. I think the best we can do here is
start with more appropriate defaults or enable them based on an ELF note
before the application is started. The dynamic loader would not have to
do anything extra here.

If we ignore stack tagging, the global configuration change may be
achievable. I think for the MTE bits, this could be done lazily by the
libc (e.g. on malloc()/free() call). The tag checking won't happen
before such calls unless we change the kernel defaults. There is still
the tagged address ABI enabling, could this be done lazily on syscall by
the libc? If not, the kernel could synchronise (force) this on syscall
entry from each thread based on some global prctl() bit.
Szabolcs Nagy Aug. 12, 2020, 12:45 p.m. UTC | #9
The 08/11/2020 18:20, Catalin Marinas wrote:
> On Mon, Aug 10, 2020 at 03:13:09PM +0100, Szabolcs Nagy wrote:
> > The 08/07/2020 16:19, Catalin Marinas wrote:
> > > On Mon, Aug 03, 2020 at 01:43:10PM +0100, Szabolcs Nagy wrote:
> > > > if we can always turn sync tag checks on early whenever mte is
> > > > available then i think there is no issue.
> > > > 
> > > > but if we have to make the decision later for compatibility or
> > > > performance reasons then per thread setting is problematic.
> > > 
> > > At least for libc, I'm not sure how you could even turn MTE on at
> > > run-time. The heap allocations would have to be mapped with PROT_MTE as
> > > we can't easily change them (well, you could mprotect(), assuming the
> > > user doesn't use tagged pointers on them).
> > 
> > e.g. dlopen of library with stack tagging. (libc can mark stacks with
> > PROT_MTE at that time)
> 
> If we allow such mixed object support with stack tagging enabled at
> dlopen, PROT_MTE would need to be turned on for each thread stack. This
> wouldn't require synchronisation, only knowing where the thread stacks
> are, but you'd need to make sure threads don't call into the new library
> until the stacks have been mprotect'ed. Doing this midway through a
> function execution may corrupt the tags.
> 
> So I'm not sure how safe any of this is without explicit user
> synchronisation (i.e. don't call into the library until all threads have
> been updated). Even changing options like GCR_EL1.Excl across multiple
> threads may have unwanted effects. See this comment from Peter, the
> difference being that instead of an explicit prctl() call on the current
> stack, another thread would do it:
> 
> https://lore.kernel.org/linux-arch/CAMn1gO5rhOG1W+nVe103v=smvARcFFp_Ct9XqH2Ca4BUMfpDdg@mail.gmail.com/

there is no midway problem: the libc (ld.so) would do
the PROT_MTE at dlopen time based on some elf marking
(which can be handled before relocation processing,
so before library code can run, the midway problem
happens when a library, e.g libc, wants to turn on
stack tagging on itself).

the libc already does this when a library is loaded
that requires executable stack (it marks stacks as
PROT_EXEC at dlopen time or fails the dlopen if that
is not possible, this does not require running code
in other threads, only synchronization with thread
creation and exit. but changing the check mode for
mte needs per thread code execution.).

i'm not entirely sure if this is a good idea, but i
expect stack tagging not to be used in the libc
(because libc needs to run on all hw and we don't yet
have a backward compatible stack tagging solution),
so stack tagging should work when only some elf modules
in a process are built with it, which implies that
enabling it at dlopen time should work otherwise
it will not be very useful.

> > or just turn on sync tag checks later when using heap tagging.
> 
> I wonder whether setting the synchronous tag check mode by default would
> improve this aspect. This would not have any effect until PROT_MTE is
> used. If software wants some better performance they can explicitly opt
> in to asynchronous mode or disable tag checking after some SIGSEGV +
> reporting (this shouldn't exclude the environment variables you
> currently use for controlling the tag check mode).
> 
> Also, if there are saner defaults for the user GCR_EL1.Excl (currently
> all masked), we should decide them now.
> 
> If stack tagging will come with some ELF information, we could make the
> default tag checking and GCR_EL1.Excl choices based on that, otherwise
> maybe we should revisit the default configuration the kernel sets for
> the user in the absence of any other information.

do tag checks have overhead if PROT_MTE is not used?
i'd expect some checks are still done at memory access.
(and the tagged address syscall abi has to be in use.)

turning sync tag checks on early would enable the
most of the interesting usecases (only PROT_MTE has
to be handled at runtime not the prctls. however i
don't yet know how userspace will deal with compat
issues, i.e. it may not be valid to unconditionally
turn tag checks on early).

> > > > - library code normally initializes per thread state on the first call
> > > >   into the library from a given thread, but with mte, as soon as
> > > >   memory / pointers are tagged in one thread, all threads are
> > > >   affected: not performing checks in other threads is less secure (may
> > > >   be ok) and it means incompatible syscall abi (not ok). so at least
> > > >   PR_TAGGED_ADDR_ENABLE should have process wide setting for this
> > > >   usage.
> > > 
> > > My assumption with MTE is that the libc will initialise it when the
> > > library is loaded (something __attribute__((constructor))) and it's
> > > still in single-threaded mode. Does it wait until the first malloc()
> > > call? Also, is there such thing as a per-thread initialiser for a
> > > dynamic library (not sure it can be implemented in practice though)?
> > 
> > there is no per thread initializer in an elf module.
> > (tls state is usually initialized lazily in threads
> > when necessary.)
> > 
> > malloc calls can happen before the ctors of an LD_PRELOAD
> > library and threads can be created before both.
> > glibc runs ldpreload ctors after other library ctors.
> 
> In the presence of stack tagging, I think any subsequent MTE config
> change across all threads is unsafe, irrespective of whether it's done
> by the kernel or user via SIGUSRx. I think the best we can do here is
> start with more appropriate defaults or enable them based on an ELF note
> before the application is started. The dynamic loader would not have to
> do anything extra here.
> 
> If we ignore stack tagging, the global configuration change may be
> achievable. I think for the MTE bits, this could be done lazily by the
> libc (e.g. on malloc()/free() call). The tag checking won't happen
> before such calls unless we change the kernel defaults. There is still
> the tagged address ABI enabling, could this be done lazily on syscall by
> the libc? If not, the kernel could synchronise (force) this on syscall
> entry from each thread based on some global prctl() bit.

i think the interesting use-cases are all about
changing mte settings before mte is in use in any
way but after there are multiple threads.
(the async -> sync mode change on tag faults is
i think less interesting to the gnu linux world.)

i guess lazy syscall abi switch works, but it is
ugly: raw syscall usage will be problematic and
doing checks before calling into the vdso might
have unwanted overhead.

based on the discussion it seems we should design
the userspace abis so that per process prctl is
not required and then see how far we get.
Catalin Marinas Aug. 19, 2020, 9:54 a.m. UTC | #10
On Wed, Aug 12, 2020 at 01:45:21PM +0100, Szabolcs Nagy wrote:
> On 08/11/2020 18:20, Catalin Marinas wrote:
> > If we allow such mixed object support with stack tagging enabled at
> > dlopen, PROT_MTE would need to be turned on for each thread stack. This
> > wouldn't require synchronisation, only knowing where the thread stacks
> > are, but you'd need to make sure threads don't call into the new library
> > until the stacks have been mprotect'ed. Doing this midway through a
> > function execution may corrupt the tags.
> > 
> > So I'm not sure how safe any of this is without explicit user
> > synchronisation (i.e. don't call into the library until all threads have
> > been updated). Even changing options like GCR_EL1.Excl across multiple
> > threads may have unwanted effects. See this comment from Peter, the
> > difference being that instead of an explicit prctl() call on the current
> > stack, another thread would do it:
> > 
> > https://lore.kernel.org/linux-arch/CAMn1gO5rhOG1W+nVe103v=smvARcFFp_Ct9XqH2Ca4BUMfpDdg@mail.gmail.com/
> 
> there is no midway problem: the libc (ld.so) would do the PROT_MTE at
> dlopen time based on some elf marking (which can be handled before
> relocation processing, so before library code can run, the midway
> problem happens when a library, e.g libc, wants to turn on stack
> tagging on itself).

OK, that makes sense, you can't call into the new object until the
relocations have been resolved.

> the libc already does this when a library is loaded that requires
> executable stack (it marks stacks as PROT_EXEC at dlopen time or fails
> the dlopen if that is not possible, this does not require running code
> in other threads, only synchronization with thread creation and exit.
> but changing the check mode for mte needs per thread code execution.).
> 
> i'm not entirely sure if this is a good idea, but i expect stack
> tagging not to be used in the libc (because libc needs to run on all
> hw and we don't yet have a backward compatible stack tagging
> solution),

In theory, you could have two libc deployed in your distro and ldd gets
smarter to pick the right one. I still hope we'd find a compromise with
stack tagging and single binary.

> so stack tagging should work when only some elf modules in a process
> are built with it, which implies that enabling it at dlopen time
> should work otherwise it will not be very useful.

There is still the small risk of an old object using tagged pointers to
the stack. Since the stack would be shared between such objects, turning
PROT_MTE on would cause issues. Hopefully such problems are minor and
not really a concern for the kernel.

> do tag checks have overhead if PROT_MTE is not used? i'd expect some
> checks are still done at memory access. (and the tagged address
> syscall abi has to be in use.)

My understanding from talking to hardware engineers is that there won't
be an overhead if PROT_MTE is not used, no tags being fetched or
checked. But I can't guarantee until we get real silicon.

> turning sync tag checks on early would enable the most of the
> interesting usecases (only PROT_MTE has to be handled at runtime not
> the prctls. however i don't yet know how userspace will deal with
> compat issues, i.e. it may not be valid to unconditionally turn tag
> checks on early).

If we change the defaults so that no prctl() is required for the
standard use-case, it would solve most of the common deployment issues:

1. Tagged address ABI default on when HWCAP2_MTE is present
2. Synchronous TCF by default
3. GCR_EL1.Excl allows all tags except 0 by default

Any other configuration diverging from the above is considered
specialist deployment and will have to issue the prctl() on a per-thread
basis.

Compat issues in user-space will be dealt with via environment
variables but pretty much on/off rather than fine-grained tag checking
mode. So for glibc, you'd have only _MTAG=0 or 1 and the only effect is
using PROT_MTE + tagged pointers or no-PROT_MTE + tag 0.

> > In the presence of stack tagging, I think any subsequent MTE config
> > change across all threads is unsafe, irrespective of whether it's done
> > by the kernel or user via SIGUSRx. I think the best we can do here is
> > start with more appropriate defaults or enable them based on an ELF note
> > before the application is started. The dynamic loader would not have to
> > do anything extra here.
> > 
> > If we ignore stack tagging, the global configuration change may be
> > achievable. I think for the MTE bits, this could be done lazily by the
> > libc (e.g. on malloc()/free() call). The tag checking won't happen
> > before such calls unless we change the kernel defaults. There is still
> > the tagged address ABI enabling, could this be done lazily on syscall by
> > the libc? If not, the kernel could synchronise (force) this on syscall
> > entry from each thread based on some global prctl() bit.
> 
> i think the interesting use-cases are all about changing mte settings
> before mte is in use in any way but after there are multiple threads.
> (the async -> sync mode change on tag faults is i think less
> interesting to the gnu linux world.)

So let's consider async/sync/no-check specialist uses and glibc would
not have to handle them. I don't think async mode is useful on its own
unless you have a way to turn on sync mode at run-time for more precise
error identification (well, hoping that it will happen again).

> i guess lazy syscall abi switch works, but it is ugly: raw syscall
> usage will be problematic and doing checks before calling into the
> vdso might have unwanted overhead.

This lazy ABI switch could be handled by the kernel, though I wonder
whether we should just relax it permanently when HWCAP2_MTE is present.
Szabolcs Nagy Aug. 20, 2020, 4:43 p.m. UTC | #11
adding libc-alpha on cc, they might have
additional input about compat issue handling:

the discussion is about enabling tag checks,
which cannot be done reasonably at runtime when
a process is already multi-threaded so it has to
be decided early either in the libc or in the
kernel. such early decision might have backward
compat issues (we dont yet know if any allocator
wants to use tagging opportunistically later for
debugging or if there will be code loaded later
that is incompatible with tag checks).

The 08/19/2020 10:54, Catalin Marinas wrote:
> On Wed, Aug 12, 2020 at 01:45:21PM +0100, Szabolcs Nagy wrote:
> > On 08/11/2020 18:20, Catalin Marinas wrote:
> > > If we allow such mixed object support with stack tagging enabled at
> > > dlopen, PROT_MTE would need to be turned on for each thread stack. This
> > > wouldn't require synchronisation, only knowing where the thread stacks
> > > are, but you'd need to make sure threads don't call into the new library
> > > until the stacks have been mprotect'ed. Doing this midway through a
> > > function execution may corrupt the tags.
...
> > there is no midway problem: the libc (ld.so) would do the PROT_MTE at
> > dlopen time based on some elf marking (which can be handled before
> > relocation processing, so before library code can run, the midway
> > problem happens when a library, e.g libc, wants to turn on stack
> > tagging on itself).
> 
> OK, that makes sense, you can't call into the new object until the
> relocations have been resolved.
> 
> > the libc already does this when a library is loaded that requires
> > executable stack (it marks stacks as PROT_EXEC at dlopen time or fails
> > the dlopen if that is not possible, this does not require running code
> > in other threads, only synchronization with thread creation and exit.
> > but changing the check mode for mte needs per thread code execution.).
> > 
> > i'm not entirely sure if this is a good idea, but i expect stack
> > tagging not to be used in the libc (because libc needs to run on all
> > hw and we don't yet have a backward compatible stack tagging
> > solution),
> 
> In theory, you could have two libc deployed in your distro and ldd gets
> smarter to pick the right one. I still hope we'd find a compromise with
> stack tagging and single binary.

distros don't like the two libc solution, i
think we will look at backward compat stack
tagging support, but we might end up not
having any stack tagging in libc, only in
user binaries (used for debugging).

> > so stack tagging should work when only some elf modules in a process
> > are built with it, which implies that enabling it at dlopen time
> > should work otherwise it will not be very useful.
> 
> There is still the small risk of an old object using tagged pointers to
> the stack. Since the stack would be shared between such objects, turning
> PROT_MTE on would cause issues. Hopefully such problems are minor and
> not really a concern for the kernel.
> 
> > do tag checks have overhead if PROT_MTE is not used? i'd expect some
> > checks are still done at memory access. (and the tagged address
> > syscall abi has to be in use.)
> 
> My understanding from talking to hardware engineers is that there won't
> be an overhead if PROT_MTE is not used, no tags being fetched or
> checked. But I can't guarantee until we get real silicon.
> 
> > turning sync tag checks on early would enable the most of the
> > interesting usecases (only PROT_MTE has to be handled at runtime not
> > the prctls. however i don't yet know how userspace will deal with
> > compat issues, i.e. it may not be valid to unconditionally turn tag
> > checks on early).
> 
> If we change the defaults so that no prctl() is required for the
> standard use-case, it would solve most of the common deployment issues:
> 
> 1. Tagged address ABI default on when HWCAP2_MTE is present
> 2. Synchronous TCF by default
> 3. GCR_EL1.Excl allows all tags except 0 by default
> 
> Any other configuration diverging from the above is considered
> specialist deployment and will have to issue the prctl() on a per-thread
> basis.
> 
> Compat issues in user-space will be dealt with via environment
> variables but pretty much on/off rather than fine-grained tag checking
> mode. So for glibc, you'd have only _MTAG=0 or 1 and the only effect is
> using PROT_MTE + tagged pointers or no-PROT_MTE + tag 0.

enabling mte checks by default would be
nice and simple (a libc can support tagging
allocators without any change assuming its
code is mte safe which is true e.g. for the
latest glibc release and for musl libc).

the compat issue with this is existing code
using pointer top bits which i assume faults
when dereferenced with the mte checks enabled.
(although this should be very rare since
top byte ignore on deref is aarch64 specific.)

i see two options:

- don't care about top bit compat issues:
  change the default in the kernel as you
  described (so checks are enabled and users
  only need PROT_MTE mapping if they want to
  use taggging).

- care about top bit issues:
  leave the kernel abi as in the patch set
  and do the mte setup early in the libc.
  add elf markings to new binaries that
  they are mte compatible and libc can use
  that marking for the mte setup. dlopening
  incompatible libraries will fail. the
  issue with this is that we have no idea
  how to add the marking and the marking
  prevents mte use with existing binaries
  (and eg. ldpreload malloc would require
  an updated libc).

for me it's hard to figure out which is
the right direction for mte.

> > > In the presence of stack tagging, I think any subsequent MTE config
> > > change across all threads is unsafe, irrespective of whether it's done
> > > by the kernel or user via SIGUSRx. I think the best we can do here is
> > > start with more appropriate defaults or enable them based on an ELF note
> > > before the application is started. The dynamic loader would not have to
> > > do anything extra here.
> > > 
> > > If we ignore stack tagging, the global configuration change may be
> > > achievable. I think for the MTE bits, this could be done lazily by the
> > > libc (e.g. on malloc()/free() call). The tag checking won't happen
> > > before such calls unless we change the kernel defaults. There is still
> > > the tagged address ABI enabling, could this be done lazily on syscall by
> > > the libc? If not, the kernel could synchronise (force) this on syscall
> > > entry from each thread based on some global prctl() bit.
> > 
> > i think the interesting use-cases are all about changing mte settings
> > before mte is in use in any way but after there are multiple threads.
> > (the async -> sync mode change on tag faults is i think less
> > interesting to the gnu linux world.)
> 
> So let's consider async/sync/no-check specialist uses and glibc would
> not have to handle them. I don't think async mode is useful on its own
> unless you have a way to turn on sync mode at run-time for more precise
> error identification (well, hoping that it will happen again).
> 
> > i guess lazy syscall abi switch works, but it is ugly: raw syscall
> > usage will be problematic and doing checks before calling into the
> > vdso might have unwanted overhead.
> 
> This lazy ABI switch could be handled by the kernel, though I wonder
> whether we should just relax it permanently when HWCAP2_MTE is present.

yeah i don't immediately see a problem with that.
but ideally there would be an escape hatch
(a way to opt out from the change).

thanks
Paul Eggert Aug. 20, 2020, 5:27 p.m. UTC | #12
On 8/20/20 9:43 AM, Szabolcs Nagy wrote:
> the compat issue with this is existing code
> using pointer top bits which i assume faults
> when dereferenced with the mte checks enabled.
> (although this should be very rare since
> top byte ignore on deref is aarch64 specific.)

Does anyone know of significant aarch64-specific application code that depends 
on top byte ignore? I would think it's so rare (nonexistent?) as to not be worth 
worrying about.

Even in the bad old days when Emacs used pointer top bits for typechecking, it 
carefully removed those bits before dereferencing. Any other reasonably-portable 
application would have to do the same of course.

This whole thing reminds me of the ancient IBM S/360 mainframes that were 
documented to ignore the top 8 bits of 32-bit addresses merely because a single 
model (the IBM 360/30, circa 1965) was so underpowered that it couldn't quickly 
check that the top bits were zero. This has caused countless software hassles 
over the years. Even today, the IBM z-Series hardware and software still 
supports 24-bit addressing mode because of that early-1960s design mistake. See:

Mashey JR. The Long Road to 64 Bits. ACM Queue. 2006-10-10. 
https://queue.acm.org/detail.cfm?id=1165766
Catalin Marinas Aug. 22, 2020, 11:28 a.m. UTC | #13
On Thu, Aug 20, 2020 at 05:43:15PM +0100, Szabolcs Nagy wrote:
> The 08/19/2020 10:54, Catalin Marinas wrote:
> > On Wed, Aug 12, 2020 at 01:45:21PM +0100, Szabolcs Nagy wrote:
> > > On 08/11/2020 18:20, Catalin Marinas wrote:
> > > turning sync tag checks on early would enable the most of the
> > > interesting usecases (only PROT_MTE has to be handled at runtime not
> > > the prctls. however i don't yet know how userspace will deal with
> > > compat issues, i.e. it may not be valid to unconditionally turn tag
> > > checks on early).
> > 
> > If we change the defaults so that no prctl() is required for the
> > standard use-case, it would solve most of the common deployment issues:
> > 
> > 1. Tagged address ABI default on when HWCAP2_MTE is present
> > 2. Synchronous TCF by default
> > 3. GCR_EL1.Excl allows all tags except 0 by default
> > 
> > Any other configuration diverging from the above is considered
> > specialist deployment and will have to issue the prctl() on a per-thread
> > basis.
> > 
> > Compat issues in user-space will be dealt with via environment
> > variables but pretty much on/off rather than fine-grained tag checking
> > mode. So for glibc, you'd have only _MTAG=0 or 1 and the only effect is
> > using PROT_MTE + tagged pointers or no-PROT_MTE + tag 0.
> 
> enabling mte checks by default would be nice and simple (a libc can
> support tagging allocators without any change assuming its code is mte
> safe which is true e.g. for the latest glibc release and for musl
> libc).

While talking to the Android folk, it occurred to me that the default
tag checking mode doesn't even need to be decided by the kernel. The
dynamic loader can set the desired tag check mode and the tagged address
ABI based on environment variables (_MTAG_ENABLE=x) and do a prctl()
before any threads have been created. Subsequent malloc() calls or
dlopen() can mmap/mprotect different memory regions to PROT_MTE and all
threads will be affected equally.

The only configuration a heap allocator may want to change is the tag
exclude mask (GCR_EL1.Excl) but even this can, by convention, be
configured by the dynamic loader.

> the compat issue with this is existing code using pointer top bits
> which i assume faults when dereferenced with the mte checks enabled.
> (although this should be very rare since top byte ignore on deref is
> aarch64 specific.)

They'd fault only if they dereference PROT_MTE memory and the tag check
mode is async or sync.

> i see two options:
> 
> - don't care about top bit compat issues:
>   change the default in the kernel as you described (so checks are
>   enabled and users only need PROT_MTE mapping if they want to use
>   taggging).

As I said above, suggested by the Google guys, this default choice can
be left with the dynamic loader before any threads are started.

> - care about top bit issues:
>   leave the kernel abi as in the patch set and do the mte setup early
>   in the libc. add elf markings to new binaries that they are mte
>   compatible and libc can use that marking for the mte setup.
>   dlopening incompatible libraries will fail. the issue with this is
>   that we have no idea how to add the marking and the marking prevents
>   mte use with existing binaries (and eg. ldpreload malloc would
>   require an updated libc).

Maybe a third option (which leaves the kernel ABI as is):

If the ELF markings only control the PROT_MTE regions (stack or heap),
we can configure the tag checking mode and tagged address ABI early
through environment variables (_MTAG_ENABLE). If you have a problematic
binary, just set _MTAG_ENABLE=0 and a dlopen, even if loading an
MTE-capable object, would not map the stack with PROT_MTE. Heap
allocators could also ignore _MTAG_ENABLE since PROT_MTE doesn't have an
effect if no tag checking is in place. This way we can probably mix
objects as long as we have a control.

So, in summary, I think we can get away with only issuing the prctl() in
the dynamic loader before any threads start and using PROT_MTE later at
run-time, multi-threaded, as needed by malloc(), dlopen etc.
Catalin Marinas Aug. 22, 2020, 11:31 a.m. UTC | #14
On Thu, Aug 20, 2020 at 10:27:43AM -0700, Paul Eggert wrote:
> On 8/20/20 9:43 AM, Szabolcs Nagy wrote:
> > the compat issue with this is existing code
> > using pointer top bits which i assume faults
> > when dereferenced with the mte checks enabled.
> > (although this should be very rare since
> > top byte ignore on deref is aarch64 specific.)
> 
> Does anyone know of significant aarch64-specific application code that
> depends on top byte ignore? I would think it's so rare (nonexistent?) as to
> not be worth worrying about.

Apart from the LLVM hwasan feature, I'm not aware of code relying on the
top byte ignore. There were discussions in the past to use it with some
JITs but I'm not sure they ever materialised.

I think the Mozilla JS engine uses (used?) additional bits on top of a
pointer but they are masked out before the access.

> Even in the bad old days when Emacs used pointer top bits for typechecking,
> it carefully removed those bits before dereferencing. Any other
> reasonably-portable application would have to do the same of course.

I agree.
diff mbox series

Patch

diff --git a/Documentation/arm64/cpu-feature-registers.rst b/Documentation/arm64/cpu-feature-registers.rst
index 314fa5bc2655..27d8559d565b 100644
--- a/Documentation/arm64/cpu-feature-registers.rst
+++ b/Documentation/arm64/cpu-feature-registers.rst
@@ -174,6 +174,8 @@  infrastructure:
      +------------------------------+---------+---------+
      | Name                         |  bits   | visible |
      +------------------------------+---------+---------+
+     | MTE                          | [11-8]  |    y    |
+     +------------------------------+---------+---------+
      | SSBS                         | [7-4]   |    y    |
      +------------------------------+---------+---------+
      | BT                           | [3-0]   |    y    |
diff --git a/Documentation/arm64/elf_hwcaps.rst b/Documentation/arm64/elf_hwcaps.rst
index 84a9fd2d41b4..bbd9cf54db6c 100644
--- a/Documentation/arm64/elf_hwcaps.rst
+++ b/Documentation/arm64/elf_hwcaps.rst
@@ -240,6 +240,10 @@  HWCAP2_BTI
 
     Functionality implied by ID_AA64PFR0_EL1.BT == 0b0001.
 
+HWCAP2_MTE
+
+    Functionality implied by ID_AA64PFR1_EL1.MTE == 0b0010, as described
+    by Documentation/arm64/memory-tagging-extension.rst.
 
 4. Unused AT_HWCAP bits
 -----------------------
diff --git a/Documentation/arm64/index.rst b/Documentation/arm64/index.rst
index 09cbb4ed2237..4cd0e696f064 100644
--- a/Documentation/arm64/index.rst
+++ b/Documentation/arm64/index.rst
@@ -14,6 +14,7 @@  ARM64 Architecture
     hugetlbpage
     legacy_instructions
     memory
+    memory-tagging-extension
     pointer-authentication
     silicon-errata
     sve
diff --git a/Documentation/arm64/memory-tagging-extension.rst b/Documentation/arm64/memory-tagging-extension.rst
new file mode 100644
index 000000000000..e3709b536b89
--- /dev/null
+++ b/Documentation/arm64/memory-tagging-extension.rst
@@ -0,0 +1,305 @@ 
+===============================================
+Memory Tagging Extension (MTE) in AArch64 Linux
+===============================================
+
+Authors: Vincenzo Frascino <vincenzo.frascino@arm.com>
+         Catalin Marinas <catalin.marinas@arm.com>
+
+Date: 2020-02-25
+
+This document describes the provision of the Memory Tagging Extension
+functionality in AArch64 Linux.
+
+Introduction
+============
+
+ARMv8.5 based processors introduce the Memory Tagging Extension (MTE)
+feature. MTE is built on top of the ARMv8.0 virtual address tagging TBI
+(Top Byte Ignore) feature and allows software to access a 4-bit
+allocation tag for each 16-byte granule in the physical address space.
+Such memory range must be mapped with the Normal-Tagged memory
+attribute. A logical tag is derived from bits 59-56 of the virtual
+address used for the memory access. A CPU with MTE enabled will compare
+the logical tag against the allocation tag and potentially raise an
+exception on mismatch, subject to system registers configuration.
+
+Userspace Support
+=================
+
+When ``CONFIG_ARM64_MTE`` is selected and Memory Tagging Extension is
+supported by the hardware, the kernel advertises the feature to
+userspace via ``HWCAP2_MTE``.
+
+PROT_MTE
+--------
+
+To access the allocation tags, a user process must enable the Tagged
+memory attribute on an address range using a new ``prot`` flag for
+``mmap()`` and ``mprotect()``:
+
+``PROT_MTE`` - Pages allow access to the MTE allocation tags.
+
+The allocation tag is set to 0 when such pages are first mapped in the
+user address space and preserved on copy-on-write. ``MAP_SHARED`` is
+supported and the allocation tags can be shared between processes.
+
+**Note**: ``PROT_MTE`` is only supported on ``MAP_ANONYMOUS`` and
+RAM-based file mappings (``tmpfs``, ``memfd``). Passing it to other
+types of mapping will result in ``-EINVAL`` returned by these system
+calls.
+
+**Note**: The ``PROT_MTE`` flag (and corresponding memory type) cannot
+be cleared by ``mprotect()``.
+
+**Note**: ``madvise()`` memory ranges with ``MADV_DONTNEED`` and
+``MADV_FREE`` may have the allocation tags cleared (set to 0) at any
+point after the system call.
+
+Tag Check Faults
+----------------
+
+When ``PROT_MTE`` is enabled on an address range and a mismatch between
+the logical and allocation tags occurs on access, there are three
+configurable behaviours:
+
+- *Ignore* - This is the default mode. The CPU (and kernel) ignores the
+  tag check fault.
+
+- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with
+  ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The
+  memory access is not performed. If ``SIGSEGV`` is ignored or blocked
+  by the offending thread, the containing process is terminated with a
+  ``coredump``.
+
+- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the offending
+  thread, asynchronously following one or multiple tag check faults,
+  with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0`` (the faulting
+  address is unknown).
+
+The user can select the above modes, per thread, using the
+``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where
+``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK``
+bit-field:
+
+- ``PR_MTE_TCF_NONE``  - *Ignore* tag check faults
+- ``PR_MTE_TCF_SYNC``  - *Synchronous* tag check fault mode
+- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode
+
+The current tag check fault mode can be read using the
+``prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0)`` system call.
+
+Tag checking can also be disabled for a user thread by setting the
+``PSTATE.TCO`` bit with ``MSR TCO, #1``.
+
+**Note**: Signal handlers are always invoked with ``PSTATE.TCO = 0``,
+irrespective of the interrupted context. ``PSTATE.TCO`` is restored on
+``sigreturn()``.
+
+**Note**: There are no *match-all* logical tags available for user
+applications.
+
+**Note**: Kernel accesses to the user address space (e.g. ``read()``
+system call) are not checked if the user thread tag checking mode is
+``PR_MTE_TCF_NONE`` or ``PR_MTE_TCF_ASYNC``. If the tag checking mode is
+``PR_MTE_TCF_SYNC``, the kernel makes a best effort to check its user
+address accesses, however it cannot always guarantee it.
+
+Excluding Tags in the ``IRG``, ``ADDG`` and ``SUBG`` instructions
+-----------------------------------------------------------------
+
+The architecture allows excluding certain tags to be randomly generated
+via the ``GCR_EL1.Exclude`` register bit-field. By default, Linux
+excludes all tags other than 0. A user thread can enable specific tags
+in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL,
+flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap
+in the ``PR_MTE_TAG_MASK`` bit-field.
+
+**Note**: The hardware uses an exclude mask but the ``prctl()``
+interface provides an include mask. An include mask of ``0`` (exclusion
+mask ``0xffff``) results in the CPU always generating tag ``0``.
+
+Initial process state
+---------------------
+
+On ``execve()``, the new process has the following configuration:
+
+- ``PR_TAGGED_ADDR_ENABLE`` set to 0 (disabled)
+- Tag checking mode set to ``PR_MTE_TCF_NONE``
+- ``PR_MTE_TAG_MASK`` set to 0 (all tags excluded)
+- ``PSTATE.TCO`` set to 0
+- ``PROT_MTE`` not set on any of the initial memory maps
+
+On ``fork()``, the new process inherits the parent's configuration and
+memory map attributes with the exception of the ``madvise()`` ranges
+with ``MADV_WIPEONFORK`` which will have the data and tags cleared (set
+to 0).
+
+The ``ptrace()`` interface
+--------------------------
+
+``PTRACE_PEEKMTETAGS`` and ``PTRACE_POKEMTETAGS`` allow a tracer to read
+the tags from or set the tags to a tracee's address space. The
+``ptrace()`` system call is invoked as ``ptrace(request, pid, addr,
+data)`` where:
+
+- ``request`` - one of ``PTRACE_PEEKMTETAGS`` or ``PTRACE_PEEKMTETAGS``.
+- ``pid`` - the tracee's PID.
+- ``addr`` - address in the tracee's address space.
+- ``data`` - pointer to a ``struct iovec`` where ``iov_base`` points to
+  a buffer of ``iov_len`` length in the tracer's address space.
+
+The tags in the tracer's ``iov_base`` buffer are represented as one
+4-bit tag per byte and correspond to a 16-byte MTE tag granule in the
+tracee's address space.
+
+**Note**: If ``addr`` is not aligned to a 16-byte granule, the kernel
+will use the corresponding aligned address.
+
+``ptrace()`` return value:
+
+- 0 - tags were copied, the tracer's ``iov_len`` was updated to the
+  number of tags transferred. This may be smaller than the requested
+  ``iov_len`` if the requested address range in the tracee's or the
+  tracer's space cannot be accessed or does not have valid tags.
+- ``-EPERM`` - the specified process cannot be traced.
+- ``-EIO`` - the tracee's address range cannot be accessed (e.g. invalid
+  address) and no tags copied. ``iov_len`` not updated.
+- ``-EFAULT`` - fault on accessing the tracer's memory (``struct iovec``
+  or ``iov_base`` buffer) and no tags copied. ``iov_len`` not updated.
+- ``-EOPNOTSUPP`` - the tracee's address does not have valid tags (never
+  mapped with the ``PROT_MTE`` flag). ``iov_len`` not updated.
+
+**Note**: There are no transient errors for the requests above, so user
+programs should not retry in case of a non-zero system call return.
+
+``PTRACE_GETREGSET`` and ``PTRACE_SETREGSET`` with ``addr ==
+``NT_ARM_TAGGED_ADDR_CTRL`` allow ``ptrace()`` access to the tagged
+address ABI control and MTE configuration of a process as per the
+``prctl()`` options described in
+Documentation/arm64/tagged-address-abi.rst and above. The corresponding
+``regset`` is 1 element of 8 bytes (``sizeof(long))``).
+
+Example of correct usage
+========================
+
+*MTE Example code*
+
+.. code-block:: c
+
+    /*
+     * To be compiled with -march=armv8.5-a+memtag
+     */
+    #include <errno.h>
+    #include <stdint.h>
+    #include <stdio.h>
+    #include <stdlib.h>
+    #include <unistd.h>
+    #include <sys/auxv.h>
+    #include <sys/mman.h>
+    #include <sys/prctl.h>
+
+    /*
+     * From arch/arm64/include/uapi/asm/hwcap.h
+     */
+    #define HWCAP2_MTE              (1 << 18)
+
+    /*
+     * From arch/arm64/include/uapi/asm/mman.h
+     */
+    #define PROT_MTE                 0x20
+
+    /*
+     * From include/uapi/linux/prctl.h
+     */
+    #define PR_SET_TAGGED_ADDR_CTRL 55
+    #define PR_GET_TAGGED_ADDR_CTRL 56
+    # define PR_TAGGED_ADDR_ENABLE  (1UL << 0)
+    # define PR_MTE_TCF_SHIFT       1
+    # define PR_MTE_TCF_NONE        (0UL << PR_MTE_TCF_SHIFT)
+    # define PR_MTE_TCF_SYNC        (1UL << PR_MTE_TCF_SHIFT)
+    # define PR_MTE_TCF_ASYNC       (2UL << PR_MTE_TCF_SHIFT)
+    # define PR_MTE_TCF_MASK        (3UL << PR_MTE_TCF_SHIFT)
+    # define PR_MTE_TAG_SHIFT       3
+    # define PR_MTE_TAG_MASK        (0xffffUL << PR_MTE_TAG_SHIFT)
+
+    /*
+     * Insert a random logical tag into the given pointer.
+     */
+    #define insert_random_tag(ptr) ({                       \
+            uint64_t __val;                                 \
+            asm("irg %0, %1" : "=r" (__val) : "r" (ptr));   \
+            __val;                                          \
+    })
+
+    /*
+     * Set the allocation tag on the destination address.
+     */
+    #define set_tag(tagged_addr) do {                                      \
+            asm volatile("stg %0, [%0]" : : "r" (tagged_addr) : "memory"); \
+    } while (0)
+
+    int main()
+    {
+            unsigned char *a;
+            unsigned long page_sz = sysconf(_SC_PAGESIZE);
+            unsigned long hwcap2 = getauxval(AT_HWCAP2);
+
+            /* check if MTE is present */
+            if (!(hwcap2 & HWCAP2_MTE))
+                    return EXIT_FAILURE;
+
+            /*
+             * Enable the tagged address ABI, synchronous MTE tag check faults and
+             * allow all non-zero tags in the randomly generated set.
+             */
+            if (prctl(PR_SET_TAGGED_ADDR_CTRL,
+                      PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC | (0xfffe << PR_MTE_TAG_SHIFT),
+                      0, 0, 0)) {
+                    perror("prctl() failed");
+                    return EXIT_FAILURE;
+            }
+
+            a = mmap(0, page_sz, PROT_READ | PROT_WRITE,
+                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+            if (a == MAP_FAILED) {
+                    perror("mmap() failed");
+                    return EXIT_FAILURE;
+            }
+
+            /*
+             * Enable MTE on the above anonymous mmap. The flag could be passed
+             * directly to mmap() and skip this step.
+             */
+            if (mprotect(a, page_sz, PROT_READ | PROT_WRITE | PROT_MTE)) {
+                    perror("mprotect() failed");
+                    return EXIT_FAILURE;
+            }
+
+            /* access with the default tag (0) */
+            a[0] = 1;
+            a[1] = 2;
+
+            printf("a[0] = %hhu a[1] = %hhu\n", a[0], a[1]);
+
+            /* set the logical and allocation tags */
+            a = (unsigned char *)insert_random_tag(a);
+            set_tag(a);
+
+            printf("%p\n", a);
+
+            /* non-zero tag access */
+            a[0] = 3;
+            printf("a[0] = %hhu a[1] = %hhu\n", a[0], a[1]);
+
+            /*
+             * If MTE is enabled correctly the next instruction will generate an
+             * exception.
+             */
+            printf("Expecting SIGSEGV...\n");
+            a[16] = 0xdd;
+
+            /* this should not be printed in the PR_MTE_TCF_SYNC mode */
+            printf("...haven't got one\n");
+
+            return EXIT_FAILURE;
+    }