[v7,29/29] arm64: mte: Add Memory Tagging Extension documentation
diff mbox series

Message ID 20200715170844.30064-30-catalin.marinas@arm.com
State New
Headers show
Series
  • arm64: Memory Tagging Extension user-space support
Related show

Commit Message

Catalin Marinas July 15, 2020, 5:08 p.m. UTC
From: Vincenzo Frascino <vincenzo.frascino@arm.com>

Memory Tagging Extension (part of the ARMv8.5 Extensions) provides
a mechanism to detect the sources of memory related errors which
may be vulnerable to exploitation, including bounds violations,
use-after-free, use-after-return, use-out-of-scope and use before
initialization errors.

Add Memory Tagging Extension documentation for the arm64 linux
kernel support.

Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Co-developed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: Will Deacon <will@kernel.org>
---

Notes:
    v7:
    - Add information on ptrace() regset access (NT_ARM_TAGGED_ADDR_CTRL).
    
    v4:
    - Document behaviour of madvise(MADV_DONTNEED/MADV_FREE).
    - Document the initial process state on fork/execve.
    - Clarify when the kernel uaccess checks the tags.
    - Minor updates to the example code.
    - A few other minor clean-ups following review.
    
    v3:
    - Modify the uaccess checking conditions: only when the sync mode is
      selected by the user. In async mode, the kernel uaccesses are not
      checked.
    - Clarify that an include mask of 0 (exclude mask 0xffff) results in
      always generating tag 0.
    - Document the ptrace() interface.
    
    v2:
    - Documented the uaccess kernel tag checking mode.
    - Removed the BTI definitions from cpu-feature-registers.rst.
    - Removed the paragraph stating that MTE depends on the tagged address
      ABI (while the Kconfig entry does, there is no requirement for the
      user to enable both).
    - Changed the GCR_EL1.Exclude handling description following the change
      in the prctl() interface (include vs exclude mask).
    - Updated the example code.

 Documentation/arm64/cpu-feature-registers.rst |   2 +
 Documentation/arm64/elf_hwcaps.rst            |   4 +
 Documentation/arm64/index.rst                 |   1 +
 .../arm64/memory-tagging-extension.rst        | 305 ++++++++++++++++++
 4 files changed, 312 insertions(+)
 create mode 100644 Documentation/arm64/memory-tagging-extension.rst

Comments

Szabolcs Nagy July 27, 2020, 4:36 p.m. UTC | #1
The 07/15/2020 18:08, Catalin Marinas wrote:
> From: Vincenzo Frascino <vincenzo.frascino@arm.com>
> 
> Memory Tagging Extension (part of the ARMv8.5 Extensions) provides
> a mechanism to detect the sources of memory related errors which
> may be vulnerable to exploitation, including bounds violations,
> use-after-free, use-after-return, use-out-of-scope and use before
> initialization errors.
> 
> Add Memory Tagging Extension documentation for the arm64 linux
> kernel support.
> 
> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
> Co-developed-by: Catalin Marinas <catalin.marinas@arm.com>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Acked-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
> Cc: Will Deacon <will@kernel.org>
> ---
> 
> Notes:
>     v7:
>     - Add information on ptrace() regset access (NT_ARM_TAGGED_ADDR_CTRL).
>     
>     v4:
>     - Document behaviour of madvise(MADV_DONTNEED/MADV_FREE).
>     - Document the initial process state on fork/execve.
>     - Clarify when the kernel uaccess checks the tags.
>     - Minor updates to the example code.
>     - A few other minor clean-ups following review.
>     
>     v3:
>     - Modify the uaccess checking conditions: only when the sync mode is
>       selected by the user. In async mode, the kernel uaccesses are not
>       checked.
>     - Clarify that an include mask of 0 (exclude mask 0xffff) results in
>       always generating tag 0.
>     - Document the ptrace() interface.
>     
>     v2:
>     - Documented the uaccess kernel tag checking mode.
>     - Removed the BTI definitions from cpu-feature-registers.rst.
>     - Removed the paragraph stating that MTE depends on the tagged address
>       ABI (while the Kconfig entry does, there is no requirement for the
>       user to enable both).
>     - Changed the GCR_EL1.Exclude handling description following the change
>       in the prctl() interface (include vs exclude mask).
>     - Updated the example code.
> 
>  Documentation/arm64/cpu-feature-registers.rst |   2 +
>  Documentation/arm64/elf_hwcaps.rst            |   4 +
>  Documentation/arm64/index.rst                 |   1 +
>  .../arm64/memory-tagging-extension.rst        | 305 ++++++++++++++++++
>  4 files changed, 312 insertions(+)
>  create mode 100644 Documentation/arm64/memory-tagging-extension.rst
> 
> diff --git a/Documentation/arm64/cpu-feature-registers.rst b/Documentation/arm64/cpu-feature-registers.rst
...
> +Tag Check Faults
> +----------------
> +
> +When ``PROT_MTE`` is enabled on an address range and a mismatch between
> +the logical and allocation tags occurs on access, there are three
> +configurable behaviours:
> +
> +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the
> +  tag check fault.
> +
> +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with
> +  ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The
> +  memory access is not performed. If ``SIGSEGV`` is ignored or blocked
> +  by the offending thread, the containing process is terminated with a
> +  ``coredump``.
> +
> +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the offending
> +  thread, asynchronously following one or multiple tag check faults,
> +  with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0`` (the faulting
> +  address is unknown).
> +
> +The user can select the above modes, per thread, using the
> +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where
> +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK``
> +bit-field:
> +
> +- ``PR_MTE_TCF_NONE``  - *Ignore* tag check faults
> +- ``PR_MTE_TCF_SYNC``  - *Synchronous* tag check fault mode
> +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode
> +
> +The current tag check fault mode can be read using the
> +``prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0)`` system call.

we discussed the need for per process prctl off list, i will
try to summarize the requirement here:

- it cannot be guaranteed in general that a library initializer
or first call into a library happens when the process is still
single threaded.

- user code currently has no way to call prctl in all threads of
a process and even within the c runtime doing so is problematic
(it has to signal all threads, which requires a reserved signal
and dealing with exiting threads and signal masks, such mechanism
can break qemu user and various other userspace tooling).

- we don't yet have defined contract in userspace about how user
code may enable mte (i.e. use the prctl call), but it seems that
there will be use cases for it: LD_PRELOADing malloc for heap
tagging is one such case, but any library or custom allocator
that wants to use mte will have this issue: when it enables mte
it wants to enable it for all threads in the process. (or at
least all threads managed by the c runtime).

- even if user code is not allowed to call the prctl directly,
i.e. the prctl settings are owned by the libc, there will be
cases when the settings have to be changed in a multithreaded
process (e.g. dlopening a library that requires a particular
mte state).

a solution is to introduce a flag like SECCOMP_FILTER_FLAG_TSYNC
that means the prctl is for all threads in the process not just
for the current one. however the exact semantics is not obvious
if there are inconsistent settings in different threads or user
code tries to use the prctl concurrently: first checking then
setting the mte state via separate prctl calls is racy. but if
the userspace contract for enabling mte limits who and when can
call the prctl then i think the simple sync flag approach works.

(the sync flag should apply to all prctl settings: tagged addr
syscall abi, mte check fault mode, irg tag excludes. ideally it
would work for getting the process wide state and it would fail
in case of inconsistent settings.)

we may need to document some memory ordering details when
memory accesses in other threads are affected, but i think
that can be something simple that leaves it unspecified
what happens with memory accesses that are not synchrnized
with the prctl call.
Dave Martin July 28, 2020, 11:08 a.m. UTC | #2
On Mon, Jul 27, 2020 at 05:36:35PM +0100, Szabolcs Nagy wrote:
> The 07/15/2020 18:08, Catalin Marinas wrote:
> > From: Vincenzo Frascino <vincenzo.frascino@arm.com>
> > 
> > Memory Tagging Extension (part of the ARMv8.5 Extensions) provides
> > a mechanism to detect the sources of memory related errors which
> > may be vulnerable to exploitation, including bounds violations,
> > use-after-free, use-after-return, use-out-of-scope and use before
> > initialization errors.
> > 
> > Add Memory Tagging Extension documentation for the arm64 linux
> > kernel support.
> > 
> > Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
> > Co-developed-by: Catalin Marinas <catalin.marinas@arm.com>
> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > Acked-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > ---
> > 
> > Notes:
> >     v7:
> >     - Add information on ptrace() regset access (NT_ARM_TAGGED_ADDR_CTRL).
> >     
> >     v4:
> >     - Document behaviour of madvise(MADV_DONTNEED/MADV_FREE).
> >     - Document the initial process state on fork/execve.
> >     - Clarify when the kernel uaccess checks the tags.
> >     - Minor updates to the example code.
> >     - A few other minor clean-ups following review.
> >     
> >     v3:
> >     - Modify the uaccess checking conditions: only when the sync mode is
> >       selected by the user. In async mode, the kernel uaccesses are not
> >       checked.
> >     - Clarify that an include mask of 0 (exclude mask 0xffff) results in
> >       always generating tag 0.
> >     - Document the ptrace() interface.
> >     
> >     v2:
> >     - Documented the uaccess kernel tag checking mode.
> >     - Removed the BTI definitions from cpu-feature-registers.rst.
> >     - Removed the paragraph stating that MTE depends on the tagged address
> >       ABI (while the Kconfig entry does, there is no requirement for the
> >       user to enable both).
> >     - Changed the GCR_EL1.Exclude handling description following the change
> >       in the prctl() interface (include vs exclude mask).
> >     - Updated the example code.
> > 
> >  Documentation/arm64/cpu-feature-registers.rst |   2 +
> >  Documentation/arm64/elf_hwcaps.rst            |   4 +
> >  Documentation/arm64/index.rst                 |   1 +
> >  .../arm64/memory-tagging-extension.rst        | 305 ++++++++++++++++++
> >  4 files changed, 312 insertions(+)
> >  create mode 100644 Documentation/arm64/memory-tagging-extension.rst
> > 
> > diff --git a/Documentation/arm64/cpu-feature-registers.rst b/Documentation/arm64/cpu-feature-registers.rst
> ...
> > +Tag Check Faults
> > +----------------
> > +
> > +When ``PROT_MTE`` is enabled on an address range and a mismatch between
> > +the logical and allocation tags occurs on access, there are three
> > +configurable behaviours:
> > +
> > +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the
> > +  tag check fault.
> > +
> > +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with
> > +  ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The
> > +  memory access is not performed. If ``SIGSEGV`` is ignored or blocked
> > +  by the offending thread, the containing process is terminated with a
> > +  ``coredump``.
> > +
> > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the offending
> > +  thread, asynchronously following one or multiple tag check faults,
> > +  with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0`` (the faulting
> > +  address is unknown).
> > +
> > +The user can select the above modes, per thread, using the
> > +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where
> > +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK``
> > +bit-field:
> > +
> > +- ``PR_MTE_TCF_NONE``  - *Ignore* tag check faults
> > +- ``PR_MTE_TCF_SYNC``  - *Synchronous* tag check fault mode
> > +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode
> > +
> > +The current tag check fault mode can be read using the
> > +``prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0)`` system call.
> 
> we discussed the need for per process prctl off list, i will
> try to summarize the requirement here:
> 
> - it cannot be guaranteed in general that a library initializer
> or first call into a library happens when the process is still
> single threaded.
> 
> - user code currently has no way to call prctl in all threads of
> a process and even within the c runtime doing so is problematic
> (it has to signal all threads, which requires a reserved signal
> and dealing with exiting threads and signal masks, such mechanism
> can break qemu user and various other userspace tooling).

When working on the SVE support, I came to the conclusion that this
kind of thing would normally either be done by the runtime itself, or in
close cooperation with the runtime.  However, for SVE it never makes
sense for one thread to asynchronously change the vector length of
another thread -- that's different from the MTE situation.

> - we don't yet have defined contract in userspace about how user
> code may enable mte (i.e. use the prctl call), but it seems that
> there will be use cases for it: LD_PRELOADing malloc for heap
> tagging is one such case, but any library or custom allocator
> that wants to use mte will have this issue: when it enables mte
> it wants to enable it for all threads in the process. (or at
> least all threads managed by the c runtime).

What are the situations where we anticipate a need to twiddle MTE in
multiple threads simultaneously, other than during process startup?

> - even if user code is not allowed to call the prctl directly,
> i.e. the prctl settings are owned by the libc, there will be
> cases when the settings have to be changed in a multithreaded
> process (e.g. dlopening a library that requires a particular
> mte state).

Could be avoided by refusing to dlopen a library that is incompatible
with the current process.

dlopen()ing a library that doesn't support tagged addresses, in a
process that does use tagged addresses, seems undesirable even if tag
checking is currently turned off.


> a solution is to introduce a flag like SECCOMP_FILTER_FLAG_TSYNC
> that means the prctl is for all threads in the process not just
> for the current one. however the exact semantics is not obvious
> if there are inconsistent settings in different threads or user
> code tries to use the prctl concurrently: first checking then
> setting the mte state via separate prctl calls is racy. but if
> the userspace contract for enabling mte limits who and when can
> call the prctl then i think the simple sync flag approach works.
> 
> (the sync flag should apply to all prctl settings: tagged addr
> syscall abi, mte check fault mode, irg tag excludes. ideally it
> would work for getting the process wide state and it would fail
> in case of inconsistent settings.)

If going down this route, perhaps we could have sets of settings:
so for each setting we have a process-wide value and a per-thread
value, with defines rules about how they combine.

Since MTE is a debugging feature, we might be able to be less aggressive
about synchronisation than in the SECCOMP case.

> we may need to document some memory ordering details when
> memory accesses in other threads are affected, but i think
> that can be something simple that leaves it unspecified
> what happens with memory accesses that are not synchrnized
> with the prctl call.

Hmmm...

Cheers
---Dave
Szabolcs Nagy July 28, 2020, 2:53 p.m. UTC | #3
The 07/28/2020 12:08, Dave Martin wrote:
> On Mon, Jul 27, 2020 at 05:36:35PM +0100, Szabolcs Nagy wrote:
> > The 07/15/2020 18:08, Catalin Marinas wrote:
> > > +The user can select the above modes, per thread, using the
> > > +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where
> > > +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK``
> > > +bit-field:
> > > +
> > > +- ``PR_MTE_TCF_NONE``  - *Ignore* tag check faults
> > > +- ``PR_MTE_TCF_SYNC``  - *Synchronous* tag check fault mode
> > > +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode
> > > +
> > > +The current tag check fault mode can be read using the
> > > +``prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0)`` system call.
> > 
> > we discussed the need for per process prctl off list, i will
> > try to summarize the requirement here:
> > 
> > - it cannot be guaranteed in general that a library initializer
> > or first call into a library happens when the process is still
> > single threaded.
> > 
> > - user code currently has no way to call prctl in all threads of
> > a process and even within the c runtime doing so is problematic
> > (it has to signal all threads, which requires a reserved signal
> > and dealing with exiting threads and signal masks, such mechanism
> > can break qemu user and various other userspace tooling).
> 
> When working on the SVE support, I came to the conclusion that this
> kind of thing would normally either be done by the runtime itself, or in
> close cooperation with the runtime.  However, for SVE it never makes
> sense for one thread to asynchronously change the vector length of
> another thread -- that's different from the MTE situation.

currently there is libc mechanism to do some operation
in all threads (e.g. for set*id) but this is fragile
and not something that can be exposed to user code.
(on the kernel side it should be much simpler to do)

> > - we don't yet have defined contract in userspace about how user
> > code may enable mte (i.e. use the prctl call), but it seems that
> > there will be use cases for it: LD_PRELOADing malloc for heap
> > tagging is one such case, but any library or custom allocator
> > that wants to use mte will have this issue: when it enables mte
> > it wants to enable it for all threads in the process. (or at
> > least all threads managed by the c runtime).
> 
> What are the situations where we anticipate a need to twiddle MTE in
> multiple threads simultaneously, other than during process startup?
> 
> > - even if user code is not allowed to call the prctl directly,
> > i.e. the prctl settings are owned by the libc, there will be
> > cases when the settings have to be changed in a multithreaded
> > process (e.g. dlopening a library that requires a particular
> > mte state).
> 
> Could be avoided by refusing to dlopen a library that is incompatible
> with the current process.
> 
> dlopen()ing a library that doesn't support tagged addresses, in a
> process that does use tagged addresses, seems undesirable even if tag
> checking is currently turned off.

yes but it can go the other way too:

at startup the libc does not enable tag checks
for performance reasons, but at dlopen time a
library is detected to use mte (e.g. stack
tagging or custom allocator).

then libc or the dlopened library has to ensure
that checks are enabled in all threads. (in case
of stack tagging the libc has to mark existing
stacks with PROT_MTE too, there is mechanism for
this in glibc to deal with dlopened libraries
that require executable stack and only reject
the dlopen if this cannot be performed.)

another usecase is that the libc is mte-safe
(it accepts tagged pointers and memory in its
interfaces), but it does not enable mte (this
will be the case with glibc 2.32) and user
libraries have to enable mte to use it (custom
allocator or malloc interposition are examples).

and i think this is necessary if userpsace wants
to turn async tag check into sync tag check at
runtime when a failure is detected.

> > a solution is to introduce a flag like SECCOMP_FILTER_FLAG_TSYNC
> > that means the prctl is for all threads in the process not just
> > for the current one. however the exact semantics is not obvious
> > if there are inconsistent settings in different threads or user
> > code tries to use the prctl concurrently: first checking then
> > setting the mte state via separate prctl calls is racy. but if
> > the userspace contract for enabling mte limits who and when can
> > call the prctl then i think the simple sync flag approach works.
> > 
> > (the sync flag should apply to all prctl settings: tagged addr
> > syscall abi, mte check fault mode, irg tag excludes. ideally it
> > would work for getting the process wide state and it would fail
> > in case of inconsistent settings.)
> 
> If going down this route, perhaps we could have sets of settings:
> so for each setting we have a process-wide value and a per-thread
> value, with defines rules about how they combine.
> 
> Since MTE is a debugging feature, we might be able to be less aggressive
> about synchronisation than in the SECCOMP case.

separate process-wide and per-thread value
works for me and i expect most uses will
be process wide settings.

i don't think mte is less of a security
feature than seccomp.

if linux does not want to add a per process
setting then only libc will be able to opt-in
to mte and only at very early in the startup
process (before executing any user code that
may start threads). this is not out of question,
but i think it limits the usage and deployment
options.

> > we may need to document some memory ordering details when
> > memory accesses in other threads are affected, but i think
> > that can be something simple that leaves it unspecified
> > what happens with memory accesses that are not synchrnized
> > with the prctl call.
> 
> Hmmm...

e.g. it may be enough if the spec only works if
there is no PROT_MTE memory mapped yet, and no
tagged addresses are present in the multi-threaded
process when the prctl is called.
Catalin Marinas July 28, 2020, 7:59 p.m. UTC | #4
On Tue, Jul 28, 2020 at 03:53:51PM +0100, Szabolcs Nagy wrote:
> The 07/28/2020 12:08, Dave Martin wrote:
> > On Mon, Jul 27, 2020 at 05:36:35PM +0100, Szabolcs Nagy wrote:
> > > a solution is to introduce a flag like SECCOMP_FILTER_FLAG_TSYNC
> > > that means the prctl is for all threads in the process not just
> > > for the current one. however the exact semantics is not obvious
> > > if there are inconsistent settings in different threads or user
> > > code tries to use the prctl concurrently: first checking then
> > > setting the mte state via separate prctl calls is racy. but if
> > > the userspace contract for enabling mte limits who and when can
> > > call the prctl then i think the simple sync flag approach works.
> > > 
> > > (the sync flag should apply to all prctl settings: tagged addr
> > > syscall abi, mte check fault mode, irg tag excludes. ideally it
> > > would work for getting the process wide state and it would fail
> > > in case of inconsistent settings.)
> > 
> > If going down this route, perhaps we could have sets of settings:
> > so for each setting we have a process-wide value and a per-thread
> > value, with defines rules about how they combine.
> > 
> > Since MTE is a debugging feature, we might be able to be less aggressive
> > about synchronisation than in the SECCOMP case.
> 
> separate process-wide and per-thread value
> works for me and i expect most uses will
> be process wide settings.

The problem with the thread synchronisation is, unlike SECCOMP, that we
need to update the SCTLR_EL1.TCF0 field across all the CPUs that may run
threads of the current process. I haven't convinced myself that this is
race-free without heavy locking. If we go for some heavy mechanism like
stop_machine(), that opens the kernel to DoS attacks from user. Still
investigating if something like membarrier() would be sufficient.

SECCOMP gets away with this as it only needs to set some variable
without IPI'ing the other CPUs.

> i don't think mte is less of a security
> feature than seccomp.

Well, MTE is probabilistic, SECCOMP seems to be more precise ;).

> if linux does not want to add a per process
> setting then only libc will be able to opt-in
> to mte and only at very early in the startup
> process (before executing any user code that
> may start threads). this is not out of question,
> but i think it limits the usage and deployment
> options.

There is also the risk that we try to be too flexible at this stage
without a real use-case.
Szabolcs Nagy Aug. 3, 2020, 12:43 p.m. UTC | #5
The 07/28/2020 20:59, Catalin Marinas wrote:
> On Tue, Jul 28, 2020 at 03:53:51PM +0100, Szabolcs Nagy wrote:
> > if linux does not want to add a per process
> > setting then only libc will be able to opt-in
> > to mte and only at very early in the startup
> > process (before executing any user code that
> > may start threads). this is not out of question,
> > but i think it limits the usage and deployment
> > options.
> 
> There is also the risk that we try to be too flexible at this stage
> without a real use-case.

i don't know how mte will be turned on in libc.

if we can always turn sync tag checks on early
whenever mte is available then i think there is
no issue.

but if we have to make the decision later for
compatibility or performance reasons then per
thread setting is problematic.

use of the prctl outside of libc is very limited
if it's per thread only:

- application code may use it in a (elf specific)
  pre-initialization function, but that's a bit
  obscure (not exposed in c) and it is reasonable
  for an application to enable mte checks after
  it registered a signal handler for mte faults.
  (and at that point it may be multi-threaded).

- library code normally initializes per thread
  state on the first call into the library from a
  given thread, but with mte, as soon as memory /
  pointers are tagged in one thread, all threads
  are affected: not performing checks in other
  threads is less secure (may be ok) and it means
  incompatible syscall abi (not ok). so at least
  PR_TAGGED_ADDR_ENABLE should have process wide
  setting for this usage.

but i guess it is fine to design the mechanism
for these in a later linux version, until then
such usage will be unreliable (will depend on
how early threads are created).
Catalin Marinas Aug. 7, 2020, 3:19 p.m. UTC | #6
Hi Szabolcs,

On Mon, Aug 03, 2020 at 01:43:10PM +0100, Szabolcs Nagy wrote:
> The 07/28/2020 20:59, Catalin Marinas wrote:
> > On Tue, Jul 28, 2020 at 03:53:51PM +0100, Szabolcs Nagy wrote:
> > > if linux does not want to add a per process setting then only libc
> > > will be able to opt-in to mte and only at very early in the
> > > startup process (before executing any user code that may start
> > > threads). this is not out of question, but i think it limits the
> > > usage and deployment options.
> > 
> > There is also the risk that we try to be too flexible at this stage
> > without a real use-case.
> 
> i don't know how mte will be turned on in libc.
> 
> if we can always turn sync tag checks on early whenever mte is
> available then i think there is no issue.
> 
> but if we have to make the decision later for compatibility or
> performance reasons then per thread setting is problematic.

At least for libc, I'm not sure how you could even turn MTE on at
run-time. The heap allocations would have to be mapped with PROT_MTE as
we can't easily change them (well, you could mprotect(), assuming the
user doesn't use tagged pointers on them).

There is a case to switch tag checking from asynchronous to synchronous
at run-time based on a signal but that's rather specific to Android
where zygote controls the signal handler. I don't think you can do this
with libc. Even on Android, since the async fault signal is delivered
per thread, it probably does this lazily (alternatively, it could issue
a SIGUSRx to the other threads for synchronisation).

> use of the prctl outside of libc is very limited if it's per thread
> only:

In the non-Android context, I think the prctl() for MTE control should
be restricted to the libc. You can control the mode prior to the process
being started using environment variables. I really don't see how the
libc could handle the changing of the MTE behaviour at run-time without
itself handling signals.

> - application code may use it in a (elf specific) pre-initialization
>   function, but that's a bit obscure (not exposed in c) and it is
>   reasonable for an application to enable mte checks after it
>   registered a signal handler for mte faults. (and at that point it
>   may be multi-threaded).

Since the app can install signal handlers, it can also deal with
notifying other threads with a SIGUSRx, assuming that it decided this
after multiple threads were created. If it does this while
single-threaded, subsequent threads would inherit the first one.

The only use-case I see for doing this in the kernel is if the code
requiring an MTE behaviour change cannot install signal handlers. More
on this below.

> - library code normally initializes per thread state on the first call
>   into the library from a given thread, but with mte, as soon as
>   memory / pointers are tagged in one thread, all threads are
>   affected: not performing checks in other threads is less secure (may
>   be ok) and it means incompatible syscall abi (not ok). so at least
>   PR_TAGGED_ADDR_ENABLE should have process wide setting for this
>   usage.

My assumption with MTE is that the libc will initialise it when the
library is loaded (something __attribute__((constructor))) and it's
still in single-threaded mode. Does it wait until the first malloc()
call? Also, is there such thing as a per-thread initialiser for a
dynamic library (not sure it can be implemented in practice though)?

The PR_TAGGED_ADDR_ENABLE synchronisation at least doesn't require IPIs
to other CPUs to change the hardware state. However, it can still race
with thread creation or a prctl() on another thread, not sure what we
can define here, especially as it depends on the kernel internals: e.g.
thread creation copies some data structures of the calling thread but at
the same time another thread wants to change such structures for all
threads of that process. The ordering of events here looks pretty
fragile.

Maybe with another global status (per process) which takes priority over
the per thread one would be easier. But such priority is not temporal
(i.e. whoever called prctl() last) but pretty strict: once a global
control was requested, it will remain global no matter what subsequent
threads request (or we can do it the other way around).

> but i guess it is fine to design the mechanism for these in a later
> linux version, until then such usage will be unreliable (will depend
> on how early threads are created).

Until we have a real use-case, I'd not complicate the matters further.
For example, I'm still not sure how realistic it is for an application
to load a new heap allocator after some threads were created. Even the
glibc support, I don't think it needs this.

Could an LD_PRELOADED library be initialised after threads were created
(I guess it could if another preloaded library created threads)? Even if
it does, do we have an example or it's rather theoretical.

If this becomes an essential use-case, we can look at adding a new flag
for prctl() which would set the option globally, with the caveats
mentioned above. It doesn't need to be in the initial ABI (and the
PR_TAGGED_ADDR_ENABLE is already upstream).

Thanks.
Szabolcs Nagy Aug. 10, 2020, 2:13 p.m. UTC | #7
The 08/07/2020 16:19, Catalin Marinas wrote:
> On Mon, Aug 03, 2020 at 01:43:10PM +0100, Szabolcs Nagy wrote:
> > if we can always turn sync tag checks on early whenever mte is
> > available then i think there is no issue.
> > 
> > but if we have to make the decision later for compatibility or
> > performance reasons then per thread setting is problematic.
> 
> At least for libc, I'm not sure how you could even turn MTE on at
> run-time. The heap allocations would have to be mapped with PROT_MTE as
> we can't easily change them (well, you could mprotect(), assuming the
> user doesn't use tagged pointers on them).

e.g. dlopen of library with stack tagging.
(libc can mark stacks with PROT_MTE at that time)

or just turn on sync tag checks later when using
heap tagging.

> 
> There is a case to switch tag checking from asynchronous to synchronous
> at run-time based on a signal but that's rather specific to Android
> where zygote controls the signal handler. I don't think you can do this
> with libc. Even on Android, since the async fault signal is delivered
> per thread, it probably does this lazily (alternatively, it could issue
> a SIGUSRx to the other threads for synchronisation).

i think what that zygote is doing is a valid use-case but
in a normal linux setup the application owns the signal
handlers so the tag check switch has to be done by the
application. the libc can expose some api for it, so in
principle it's enough if the libc can do the runtime
switch, but we dont plan to add new libc apis for mte.

> > use of the prctl outside of libc is very limited if it's per thread
> > only:
> 
> In the non-Android context, I think the prctl() for MTE control should
> be restricted to the libc. You can control the mode prior to the process
> being started using environment variables. I really don't see how the
> libc could handle the changing of the MTE behaviour at run-time without
> itself handling signals.
> 
> > - application code may use it in a (elf specific) pre-initialization
> >   function, but that's a bit obscure (not exposed in c) and it is
> >   reasonable for an application to enable mte checks after it
> >   registered a signal handler for mte faults. (and at that point it
> >   may be multi-threaded).
> 
> Since the app can install signal handlers, it can also deal with
> notifying other threads with a SIGUSRx, assuming that it decided this
> after multiple threads were created. If it does this while
> single-threaded, subsequent threads would inherit the first one.

the application does not know what libraries create what
threads in the background, i dont think there is a way to
send signals to each thread (e.g. /proc/self/task cannot
be read atomically with respect to thread creation/exit).

the libc controls thread creation and exit so it can have
a list of threads it can notify, but an application cannot
do this. (libc could provide an api so applications can
do some per thread operation, but a libc would not do this
happily: currently there are locks around thread creation
and exit that are only needed for this "signal all threads"
mechanism which makes it hard to expose to users)

one way applications sometimes work this around is to
self re-exec. but that's a big hammer and not entirely
reliable (e.g. the exe may not be available on the
filesystem any more or the commandline args may need
to be preserved but they are clobbered, or some complex
application state needs to be recreated etc)

> 
> The only use-case I see for doing this in the kernel is if the code
> requiring an MTE behaviour change cannot install signal handlers. More
> on this below.
> 
> > - library code normally initializes per thread state on the first call
> >   into the library from a given thread, but with mte, as soon as
> >   memory / pointers are tagged in one thread, all threads are
> >   affected: not performing checks in other threads is less secure (may
> >   be ok) and it means incompatible syscall abi (not ok). so at least
> >   PR_TAGGED_ADDR_ENABLE should have process wide setting for this
> >   usage.
> 
> My assumption with MTE is that the libc will initialise it when the
> library is loaded (something __attribute__((constructor))) and it's
> still in single-threaded mode. Does it wait until the first malloc()
> call? Also, is there such thing as a per-thread initialiser for a
> dynamic library (not sure it can be implemented in practice though)?

there is no per thread initializer in an elf module.
(tls state is usually initialized lazily in threads
when necessary.)

malloc calls can happen before the ctors of an LD_PRELOAD
library and threads can be created before both.
glibc runs ldpreload ctors after other library ctors.

custom allocator can be of course dlopened.
(i'd expect several language runtimes to have their
own allocator and support dlopening the runtime)

> 
> The PR_TAGGED_ADDR_ENABLE synchronisation at least doesn't require IPIs
> to other CPUs to change the hardware state. However, it can still race
> with thread creation or a prctl() on another thread, not sure what we
> can define here, especially as it depends on the kernel internals: e.g.
> thread creation copies some data structures of the calling thread but at
> the same time another thread wants to change such structures for all
> threads of that process. The ordering of events here looks pretty
> fragile.
> 
> Maybe with another global status (per process) which takes priority over
> the per thread one would be easier. But such priority is not temporal
> (i.e. whoever called prctl() last) but pretty strict: once a global
> control was requested, it will remain global no matter what subsequent
> threads request (or we can do it the other way around).

i see.

> > but i guess it is fine to design the mechanism for these in a later
> > linux version, until then such usage will be unreliable (will depend
> > on how early threads are created).
> 
> Until we have a real use-case, I'd not complicate the matters further.
> For example, I'm still not sure how realistic it is for an application
> to load a new heap allocator after some threads were created. Even the
> glibc support, I don't think it needs this.
> 
> Could an LD_PRELOADED library be initialised after threads were created
> (I guess it could if another preloaded library created threads)? Even if
> it does, do we have an example or it's rather theoretical.

i believe this happens e.g. in applications built
with tsan. (the thread sanitizer creates a
background thread early which i think does not
call malloc itself but may want to access
malloced memory, but i don't have a setup with
tsan support to test this)

> 
> If this becomes an essential use-case, we can look at adding a new flag
> for prctl() which would set the option globally, with the caveats
> mentioned above. It doesn't need to be in the initial ABI (and the
> PR_TAGGED_ADDR_ENABLE is already upstream).
> 
> Thanks.
> 
> -- 
> Catalin
Catalin Marinas Aug. 11, 2020, 5:20 p.m. UTC | #8
On Mon, Aug 10, 2020 at 03:13:09PM +0100, Szabolcs Nagy wrote:
> The 08/07/2020 16:19, Catalin Marinas wrote:
> > On Mon, Aug 03, 2020 at 01:43:10PM +0100, Szabolcs Nagy wrote:
> > > if we can always turn sync tag checks on early whenever mte is
> > > available then i think there is no issue.
> > > 
> > > but if we have to make the decision later for compatibility or
> > > performance reasons then per thread setting is problematic.
> > 
> > At least for libc, I'm not sure how you could even turn MTE on at
> > run-time. The heap allocations would have to be mapped with PROT_MTE as
> > we can't easily change them (well, you could mprotect(), assuming the
> > user doesn't use tagged pointers on them).
> 
> e.g. dlopen of library with stack tagging. (libc can mark stacks with
> PROT_MTE at that time)

If we allow such mixed object support with stack tagging enabled at
dlopen, PROT_MTE would need to be turned on for each thread stack. This
wouldn't require synchronisation, only knowing where the thread stacks
are, but you'd need to make sure threads don't call into the new library
until the stacks have been mprotect'ed. Doing this midway through a
function execution may corrupt the tags.

So I'm not sure how safe any of this is without explicit user
synchronisation (i.e. don't call into the library until all threads have
been updated). Even changing options like GCR_EL1.Excl across multiple
threads may have unwanted effects. See this comment from Peter, the
difference being that instead of an explicit prctl() call on the current
stack, another thread would do it:

https://lore.kernel.org/linux-arch/CAMn1gO5rhOG1W+nVe103v=smvARcFFp_Ct9XqH2Ca4BUMfpDdg@mail.gmail.com/

> or just turn on sync tag checks later when using heap tagging.

I wonder whether setting the synchronous tag check mode by default would
improve this aspect. This would not have any effect until PROT_MTE is
used. If software wants some better performance they can explicitly opt
in to asynchronous mode or disable tag checking after some SIGSEGV +
reporting (this shouldn't exclude the environment variables you
currently use for controlling the tag check mode).

Also, if there are saner defaults for the user GCR_EL1.Excl (currently
all masked), we should decide them now.

If stack tagging will come with some ELF information, we could make the
default tag checking and GCR_EL1.Excl choices based on that, otherwise
maybe we should revisit the default configuration the kernel sets for
the user in the absence of any other information.

> > There is a case to switch tag checking from asynchronous to synchronous
> > at run-time based on a signal but that's rather specific to Android
> > where zygote controls the signal handler. I don't think you can do this
> > with libc. Even on Android, since the async fault signal is delivered
> > per thread, it probably does this lazily (alternatively, it could issue
> > a SIGUSRx to the other threads for synchronisation).
> 
> i think what that zygote is doing is a valid use-case but
> in a normal linux setup the application owns the signal
> handlers so the tag check switch has to be done by the
> application. the libc can expose some api for it, so in
> principle it's enough if the libc can do the runtime
> switch, but we dont plan to add new libc apis for mte.

Due to the synchronisation aspect especially regarding the stack
tagging, I'm not sure the kernel alone can safely do this.

Changing the tagged address syscall ABI across multiple threads should
be safer (well, at least the relaxing part). But if we don't solve the
other aspects I mentioned above, I don't think there is much point in
only doing it for this.

> > > - library code normally initializes per thread state on the first call
> > >   into the library from a given thread, but with mte, as soon as
> > >   memory / pointers are tagged in one thread, all threads are
> > >   affected: not performing checks in other threads is less secure (may
> > >   be ok) and it means incompatible syscall abi (not ok). so at least
> > >   PR_TAGGED_ADDR_ENABLE should have process wide setting for this
> > >   usage.
> > 
> > My assumption with MTE is that the libc will initialise it when the
> > library is loaded (something __attribute__((constructor))) and it's
> > still in single-threaded mode. Does it wait until the first malloc()
> > call? Also, is there such thing as a per-thread initialiser for a
> > dynamic library (not sure it can be implemented in practice though)?
> 
> there is no per thread initializer in an elf module.
> (tls state is usually initialized lazily in threads
> when necessary.)
> 
> malloc calls can happen before the ctors of an LD_PRELOAD
> library and threads can be created before both.
> glibc runs ldpreload ctors after other library ctors.

In the presence of stack tagging, I think any subsequent MTE config
change across all threads is unsafe, irrespective of whether it's done
by the kernel or user via SIGUSRx. I think the best we can do here is
start with more appropriate defaults or enable them based on an ELF note
before the application is started. The dynamic loader would not have to
do anything extra here.

If we ignore stack tagging, the global configuration change may be
achievable. I think for the MTE bits, this could be done lazily by the
libc (e.g. on malloc()/free() call). The tag checking won't happen
before such calls unless we change the kernel defaults. There is still
the tagged address ABI enabling, could this be done lazily on syscall by
the libc? If not, the kernel could synchronise (force) this on syscall
entry from each thread based on some global prctl() bit.

Patch
diff mbox series

diff --git a/Documentation/arm64/cpu-feature-registers.rst b/Documentation/arm64/cpu-feature-registers.rst
index 314fa5bc2655..27d8559d565b 100644
--- a/Documentation/arm64/cpu-feature-registers.rst
+++ b/Documentation/arm64/cpu-feature-registers.rst
@@ -174,6 +174,8 @@  infrastructure:
      +------------------------------+---------+---------+
      | Name                         |  bits   | visible |
      +------------------------------+---------+---------+
+     | MTE                          | [11-8]  |    y    |
+     +------------------------------+---------+---------+
      | SSBS                         | [7-4]   |    y    |
      +------------------------------+---------+---------+
      | BT                           | [3-0]   |    y    |
diff --git a/Documentation/arm64/elf_hwcaps.rst b/Documentation/arm64/elf_hwcaps.rst
index 84a9fd2d41b4..bbd9cf54db6c 100644
--- a/Documentation/arm64/elf_hwcaps.rst
+++ b/Documentation/arm64/elf_hwcaps.rst
@@ -240,6 +240,10 @@  HWCAP2_BTI
 
     Functionality implied by ID_AA64PFR0_EL1.BT == 0b0001.
 
+HWCAP2_MTE
+
+    Functionality implied by ID_AA64PFR1_EL1.MTE == 0b0010, as described
+    by Documentation/arm64/memory-tagging-extension.rst.
 
 4. Unused AT_HWCAP bits
 -----------------------
diff --git a/Documentation/arm64/index.rst b/Documentation/arm64/index.rst
index 09cbb4ed2237..4cd0e696f064 100644
--- a/Documentation/arm64/index.rst
+++ b/Documentation/arm64/index.rst
@@ -14,6 +14,7 @@  ARM64 Architecture
     hugetlbpage
     legacy_instructions
     memory
+    memory-tagging-extension
     pointer-authentication
     silicon-errata
     sve
diff --git a/Documentation/arm64/memory-tagging-extension.rst b/Documentation/arm64/memory-tagging-extension.rst
new file mode 100644
index 000000000000..e3709b536b89
--- /dev/null
+++ b/Documentation/arm64/memory-tagging-extension.rst
@@ -0,0 +1,305 @@ 
+===============================================
+Memory Tagging Extension (MTE) in AArch64 Linux
+===============================================
+
+Authors: Vincenzo Frascino <vincenzo.frascino@arm.com>
+         Catalin Marinas <catalin.marinas@arm.com>
+
+Date: 2020-02-25
+
+This document describes the provision of the Memory Tagging Extension
+functionality in AArch64 Linux.
+
+Introduction
+============
+
+ARMv8.5 based processors introduce the Memory Tagging Extension (MTE)
+feature. MTE is built on top of the ARMv8.0 virtual address tagging TBI
+(Top Byte Ignore) feature and allows software to access a 4-bit
+allocation tag for each 16-byte granule in the physical address space.
+Such memory range must be mapped with the Normal-Tagged memory
+attribute. A logical tag is derived from bits 59-56 of the virtual
+address used for the memory access. A CPU with MTE enabled will compare
+the logical tag against the allocation tag and potentially raise an
+exception on mismatch, subject to system registers configuration.
+
+Userspace Support
+=================
+
+When ``CONFIG_ARM64_MTE`` is selected and Memory Tagging Extension is
+supported by the hardware, the kernel advertises the feature to
+userspace via ``HWCAP2_MTE``.
+
+PROT_MTE
+--------
+
+To access the allocation tags, a user process must enable the Tagged
+memory attribute on an address range using a new ``prot`` flag for
+``mmap()`` and ``mprotect()``:
+
+``PROT_MTE`` - Pages allow access to the MTE allocation tags.
+
+The allocation tag is set to 0 when such pages are first mapped in the
+user address space and preserved on copy-on-write. ``MAP_SHARED`` is
+supported and the allocation tags can be shared between processes.
+
+**Note**: ``PROT_MTE`` is only supported on ``MAP_ANONYMOUS`` and
+RAM-based file mappings (``tmpfs``, ``memfd``). Passing it to other
+types of mapping will result in ``-EINVAL`` returned by these system
+calls.
+
+**Note**: The ``PROT_MTE`` flag (and corresponding memory type) cannot
+be cleared by ``mprotect()``.
+
+**Note**: ``madvise()`` memory ranges with ``MADV_DONTNEED`` and
+``MADV_FREE`` may have the allocation tags cleared (set to 0) at any
+point after the system call.
+
+Tag Check Faults
+----------------
+
+When ``PROT_MTE`` is enabled on an address range and a mismatch between
+the logical and allocation tags occurs on access, there are three
+configurable behaviours:
+
+- *Ignore* - This is the default mode. The CPU (and kernel) ignores the
+  tag check fault.
+
+- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with
+  ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The
+  memory access is not performed. If ``SIGSEGV`` is ignored or blocked
+  by the offending thread, the containing process is terminated with a
+  ``coredump``.
+
+- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the offending
+  thread, asynchronously following one or multiple tag check faults,
+  with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0`` (the faulting
+  address is unknown).
+
+The user can select the above modes, per thread, using the
+``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where
+``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK``
+bit-field:
+
+- ``PR_MTE_TCF_NONE``  - *Ignore* tag check faults
+- ``PR_MTE_TCF_SYNC``  - *Synchronous* tag check fault mode
+- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode
+
+The current tag check fault mode can be read using the
+``prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0)`` system call.
+
+Tag checking can also be disabled for a user thread by setting the
+``PSTATE.TCO`` bit with ``MSR TCO, #1``.
+
+**Note**: Signal handlers are always invoked with ``PSTATE.TCO = 0``,
+irrespective of the interrupted context. ``PSTATE.TCO`` is restored on
+``sigreturn()``.
+
+**Note**: There are no *match-all* logical tags available for user
+applications.
+
+**Note**: Kernel accesses to the user address space (e.g. ``read()``
+system call) are not checked if the user thread tag checking mode is
+``PR_MTE_TCF_NONE`` or ``PR_MTE_TCF_ASYNC``. If the tag checking mode is
+``PR_MTE_TCF_SYNC``, the kernel makes a best effort to check its user
+address accesses, however it cannot always guarantee it.
+
+Excluding Tags in the ``IRG``, ``ADDG`` and ``SUBG`` instructions
+-----------------------------------------------------------------
+
+The architecture allows excluding certain tags to be randomly generated
+via the ``GCR_EL1.Exclude`` register bit-field. By default, Linux
+excludes all tags other than 0. A user thread can enable specific tags
+in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL,
+flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap
+in the ``PR_MTE_TAG_MASK`` bit-field.
+
+**Note**: The hardware uses an exclude mask but the ``prctl()``
+interface provides an include mask. An include mask of ``0`` (exclusion
+mask ``0xffff``) results in the CPU always generating tag ``0``.
+
+Initial process state
+---------------------
+
+On ``execve()``, the new process has the following configuration:
+
+- ``PR_TAGGED_ADDR_ENABLE`` set to 0 (disabled)
+- Tag checking mode set to ``PR_MTE_TCF_NONE``
+- ``PR_MTE_TAG_MASK`` set to 0 (all tags excluded)
+- ``PSTATE.TCO`` set to 0
+- ``PROT_MTE`` not set on any of the initial memory maps
+
+On ``fork()``, the new process inherits the parent's configuration and
+memory map attributes with the exception of the ``madvise()`` ranges
+with ``MADV_WIPEONFORK`` which will have the data and tags cleared (set
+to 0).
+
+The ``ptrace()`` interface
+--------------------------
+
+``PTRACE_PEEKMTETAGS`` and ``PTRACE_POKEMTETAGS`` allow a tracer to read
+the tags from or set the tags to a tracee's address space. The
+``ptrace()`` system call is invoked as ``ptrace(request, pid, addr,
+data)`` where:
+
+- ``request`` - one of ``PTRACE_PEEKMTETAGS`` or ``PTRACE_PEEKMTETAGS``.
+- ``pid`` - the tracee's PID.
+- ``addr`` - address in the tracee's address space.
+- ``data`` - pointer to a ``struct iovec`` where ``iov_base`` points to
+  a buffer of ``iov_len`` length in the tracer's address space.
+
+The tags in the tracer's ``iov_base`` buffer are represented as one
+4-bit tag per byte and correspond to a 16-byte MTE tag granule in the
+tracee's address space.
+
+**Note**: If ``addr`` is not aligned to a 16-byte granule, the kernel
+will use the corresponding aligned address.
+
+``ptrace()`` return value:
+
+- 0 - tags were copied, the tracer's ``iov_len`` was updated to the
+  number of tags transferred. This may be smaller than the requested
+  ``iov_len`` if the requested address range in the tracee's or the
+  tracer's space cannot be accessed or does not have valid tags.
+- ``-EPERM`` - the specified process cannot be traced.
+- ``-EIO`` - the tracee's address range cannot be accessed (e.g. invalid
+  address) and no tags copied. ``iov_len`` not updated.
+- ``-EFAULT`` - fault on accessing the tracer's memory (``struct iovec``
+  or ``iov_base`` buffer) and no tags copied. ``iov_len`` not updated.
+- ``-EOPNOTSUPP`` - the tracee's address does not have valid tags (never
+  mapped with the ``PROT_MTE`` flag). ``iov_len`` not updated.
+
+**Note**: There are no transient errors for the requests above, so user
+programs should not retry in case of a non-zero system call return.
+
+``PTRACE_GETREGSET`` and ``PTRACE_SETREGSET`` with ``addr ==
+``NT_ARM_TAGGED_ADDR_CTRL`` allow ``ptrace()`` access to the tagged
+address ABI control and MTE configuration of a process as per the
+``prctl()`` options described in
+Documentation/arm64/tagged-address-abi.rst and above. The corresponding
+``regset`` is 1 element of 8 bytes (``sizeof(long))``).
+
+Example of correct usage
+========================
+
+*MTE Example code*
+
+.. code-block:: c
+
+    /*
+     * To be compiled with -march=armv8.5-a+memtag
+     */
+    #include <errno.h>
+    #include <stdint.h>
+    #include <stdio.h>
+    #include <stdlib.h>
+    #include <unistd.h>
+    #include <sys/auxv.h>
+    #include <sys/mman.h>
+    #include <sys/prctl.h>
+
+    /*
+     * From arch/arm64/include/uapi/asm/hwcap.h
+     */
+    #define HWCAP2_MTE              (1 << 18)
+
+    /*
+     * From arch/arm64/include/uapi/asm/mman.h
+     */
+    #define PROT_MTE                 0x20
+
+    /*
+     * From include/uapi/linux/prctl.h
+     */
+    #define PR_SET_TAGGED_ADDR_CTRL 55
+    #define PR_GET_TAGGED_ADDR_CTRL 56
+    # define PR_TAGGED_ADDR_ENABLE  (1UL << 0)
+    # define PR_MTE_TCF_SHIFT       1
+    # define PR_MTE_TCF_NONE        (0UL << PR_MTE_TCF_SHIFT)
+    # define PR_MTE_TCF_SYNC        (1UL << PR_MTE_TCF_SHIFT)
+    # define PR_MTE_TCF_ASYNC       (2UL << PR_MTE_TCF_SHIFT)
+    # define PR_MTE_TCF_MASK        (3UL << PR_MTE_TCF_SHIFT)
+    # define PR_MTE_TAG_SHIFT       3
+    # define PR_MTE_TAG_MASK        (0xffffUL << PR_MTE_TAG_SHIFT)
+
+    /*
+     * Insert a random logical tag into the given pointer.
+     */
+    #define insert_random_tag(ptr) ({                       \
+            uint64_t __val;                                 \
+            asm("irg %0, %1" : "=r" (__val) : "r" (ptr));   \
+            __val;                                          \
+    })
+
+    /*
+     * Set the allocation tag on the destination address.
+     */
+    #define set_tag(tagged_addr) do {                                      \
+            asm volatile("stg %0, [%0]" : : "r" (tagged_addr) : "memory"); \
+    } while (0)
+
+    int main()
+    {
+            unsigned char *a;
+            unsigned long page_sz = sysconf(_SC_PAGESIZE);
+            unsigned long hwcap2 = getauxval(AT_HWCAP2);
+
+            /* check if MTE is present */
+            if (!(hwcap2 & HWCAP2_MTE))
+                    return EXIT_FAILURE;
+
+            /*
+             * Enable the tagged address ABI, synchronous MTE tag check faults and
+             * allow all non-zero tags in the randomly generated set.
+             */
+            if (prctl(PR_SET_TAGGED_ADDR_CTRL,
+                      PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC | (0xfffe << PR_MTE_TAG_SHIFT),
+                      0, 0, 0)) {
+                    perror("prctl() failed");
+                    return EXIT_FAILURE;
+            }
+
+            a = mmap(0, page_sz, PROT_READ | PROT_WRITE,
+                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+            if (a == MAP_FAILED) {
+                    perror("mmap() failed");
+                    return EXIT_FAILURE;
+            }
+
+            /*
+             * Enable MTE on the above anonymous mmap. The flag could be passed
+             * directly to mmap() and skip this step.
+             */
+            if (mprotect(a, page_sz, PROT_READ | PROT_WRITE | PROT_MTE)) {
+                    perror("mprotect() failed");
+                    return EXIT_FAILURE;
+            }
+
+            /* access with the default tag (0) */
+            a[0] = 1;
+            a[1] = 2;
+
+            printf("a[0] = %hhu a[1] = %hhu\n", a[0], a[1]);
+
+            /* set the logical and allocation tags */
+            a = (unsigned char *)insert_random_tag(a);
+            set_tag(a);
+
+            printf("%p\n", a);
+
+            /* non-zero tag access */
+            a[0] = 3;
+            printf("a[0] = %hhu a[1] = %hhu\n", a[0], a[1]);
+
+            /*
+             * If MTE is enabled correctly the next instruction will generate an
+             * exception.
+             */
+            printf("Expecting SIGSEGV...\n");
+            a[16] = 0xdd;
+
+            /* this should not be printed in the PR_MTE_TCF_SYNC mode */
+            printf("...haven't got one\n");
+
+            return EXIT_FAILURE;
+    }