diff mbox series

[v3,23/23] arm64: mte: Add Memory Tagging Extension documentation

Message ID 20200421142603.3894-24-catalin.marinas@arm.com (mailing list archive)
State New, archived
Headers show
Series arm64: Memory Tagging Extension user-space support | expand

Commit Message

Catalin Marinas April 21, 2020, 2:26 p.m. UTC
From: Vincenzo Frascino <vincenzo.frascino@arm.com>

Memory Tagging Extension (part of the ARMv8.5 Extensions) provides
a mechanism to detect the sources of memory related errors which
may be vulnerable to exploitation, including bounds violations,
use-after-free, use-after-return, use-out-of-scope and use before
initialization errors.

Add Memory Tagging Extension documentation for the arm64 linux
kernel support.

Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Co-developed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
---

Notes:
    v3:
    - Modify the uaccess checking conditions: only when the sync mode is
      selected by the user. In async mode, the kernel uaccesses are not
      checked.
    - Clarify that an include mask of 0 (exclude mask 0xffff) results in
      always generating tag 0.
    - Document the ptrace() interface.
    
    v2:
    - Documented the uaccess kernel tag checking mode.
    - Removed the BTI definitions from cpu-feature-registers.rst.
    - Removed the paragraph stating that MTE depends on the tagged address
      ABI (while the Kconfig entry does, there is no requirement for the
      user to enable both).
    - Changed the GCR_EL1.Exclude handling description following the change
      in the prctl() interface (include vs exclude mask).
    - Updated the example code.

 Documentation/arm64/cpu-feature-registers.rst |   2 +
 Documentation/arm64/elf_hwcaps.rst            |   5 +
 Documentation/arm64/index.rst                 |   1 +
 .../arm64/memory-tagging-extension.rst        | 260 ++++++++++++++++++
 4 files changed, 268 insertions(+)
 create mode 100644 Documentation/arm64/memory-tagging-extension.rst

Comments

Dave Martin April 29, 2020, 4:47 p.m. UTC | #1
On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote:
> From: Vincenzo Frascino <vincenzo.frascino@arm.com>
> 
> Memory Tagging Extension (part of the ARMv8.5 Extensions) provides
> a mechanism to detect the sources of memory related errors which
> may be vulnerable to exploitation, including bounds violations,
> use-after-free, use-after-return, use-out-of-scope and use before
> initialization errors.
> 
> Add Memory Tagging Extension documentation for the arm64 linux
> kernel support.
> 
> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
> Co-developed-by: Catalin Marinas <catalin.marinas@arm.com>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> ---
> 
> Notes:
>     v3:
>     - Modify the uaccess checking conditions: only when the sync mode is
>       selected by the user. In async mode, the kernel uaccesses are not
>       checked.
>     - Clarify that an include mask of 0 (exclude mask 0xffff) results in
>       always generating tag 0.
>     - Document the ptrace() interface.
>     
>     v2:
>     - Documented the uaccess kernel tag checking mode.
>     - Removed the BTI definitions from cpu-feature-registers.rst.
>     - Removed the paragraph stating that MTE depends on the tagged address
>       ABI (while the Kconfig entry does, there is no requirement for the
>       user to enable both).
>     - Changed the GCR_EL1.Exclude handling description following the change
>       in the prctl() interface (include vs exclude mask).
>     - Updated the example code.
> 
>  Documentation/arm64/cpu-feature-registers.rst |   2 +
>  Documentation/arm64/elf_hwcaps.rst            |   5 +
>  Documentation/arm64/index.rst                 |   1 +
>  .../arm64/memory-tagging-extension.rst        | 260 ++++++++++++++++++
>  4 files changed, 268 insertions(+)
>  create mode 100644 Documentation/arm64/memory-tagging-extension.rst
> 
> diff --git a/Documentation/arm64/cpu-feature-registers.rst b/Documentation/arm64/cpu-feature-registers.rst
> index 41937a8091aa..b5679fa85ad9 100644
> --- a/Documentation/arm64/cpu-feature-registers.rst
> +++ b/Documentation/arm64/cpu-feature-registers.rst
> @@ -174,6 +174,8 @@ infrastructure:
>       +------------------------------+---------+---------+
>       | Name                         |  bits   | visible |
>       +------------------------------+---------+---------+
> +     | MTE                          | [11-8]  |    y    |
> +     +------------------------------+---------+---------+
>       | SSBS                         | [7-4]   |    y    |
>       +------------------------------+---------+---------+
>  
> diff --git a/Documentation/arm64/elf_hwcaps.rst b/Documentation/arm64/elf_hwcaps.rst
> index 7dfb97dfe416..ca7f90e99e3a 100644
> --- a/Documentation/arm64/elf_hwcaps.rst
> +++ b/Documentation/arm64/elf_hwcaps.rst
> @@ -236,6 +236,11 @@ HWCAP2_RNG
>  
>      Functionality implied by ID_AA64ISAR0_EL1.RNDR == 0b0001.
>  
> +HWCAP2_MTE
> +
> +    Functionality implied by ID_AA64PFR1_EL1.MTE == 0b0010, as described
> +    by Documentation/arm64/memory-tagging-extension.rst.
> +
>  4. Unused AT_HWCAP bits
>  -----------------------
>  
> diff --git a/Documentation/arm64/index.rst b/Documentation/arm64/index.rst
> index 09cbb4ed2237..4cd0e696f064 100644
> --- a/Documentation/arm64/index.rst
> +++ b/Documentation/arm64/index.rst
> @@ -14,6 +14,7 @@ ARM64 Architecture
>      hugetlbpage
>      legacy_instructions
>      memory
> +    memory-tagging-extension
>      pointer-authentication
>      silicon-errata
>      sve
> diff --git a/Documentation/arm64/memory-tagging-extension.rst b/Documentation/arm64/memory-tagging-extension.rst
> new file mode 100644
> index 000000000000..f82dfbd70061
> --- /dev/null
> +++ b/Documentation/arm64/memory-tagging-extension.rst
> @@ -0,0 +1,260 @@
> +===============================================
> +Memory Tagging Extension (MTE) in AArch64 Linux
> +===============================================
> +
> +Authors: Vincenzo Frascino <vincenzo.frascino@arm.com>
> +         Catalin Marinas <catalin.marinas@arm.com>
> +
> +Date: 2020-02-25
> +
> +This document describes the provision of the Memory Tagging Extension
> +functionality in AArch64 Linux.
> +
> +Introduction
> +============
> +
> +ARMv8.5 based processors introduce the Memory Tagging Extension (MTE)
> +feature. MTE is built on top of the ARMv8.0 virtual address tagging TBI
> +(Top Byte Ignore) feature and allows software to access a 4-bit
> +allocation tag for each 16-byte granule in the physical address space.
> +Such memory range must be mapped with the Normal-Tagged memory
> +attribute. A logical tag is derived from bits 59-56 of the virtual
> +address used for the memory access. A CPU with MTE enabled will compare
> +the logical tag against the allocation tag and potentially raise an
> +exception on mismatch, subject to system registers configuration.
> +
> +Userspace Support
> +=================
> +
> +When ``CONFIG_ARM64_MTE`` is selected and Memory Tagging Extension is
> +supported by the hardware, the kernel advertises the feature to
> +userspace via ``HWCAP2_MTE``.
> +
> +PROT_MTE
> +--------
> +
> +To access the allocation tags, a user process must enable the Tagged
> +memory attribute on an address range using a new ``prot`` flag for
> +``mmap()`` and ``mprotect()``:
> +
> +``PROT_MTE`` - Pages allow access to the MTE allocation tags.
> +
> +The allocation tag is set to 0 when such pages are first mapped in the
> +user address space and preserved on copy-on-write. ``MAP_SHARED`` is
> +supported and the allocation tags can be shared between processes.
> +
> +**Note**: ``PROT_MTE`` is only supported on ``MAP_ANONYMOUS`` and
> +RAM-based file mappings (``tmpfs``, ``memfd``). Passing it to other
> +types of mapping will result in ``-EINVAL`` returned by these system
> +calls.
> +
> +**Note**: The ``PROT_MTE`` flag (and corresponding memory type) cannot
> +be cleared by ``mprotect()``.

What enforces this?  I don't have my head fully around the code yet.

I'm wondering whether attempting to clear PROT_MTE should be reported as
an error.  Is there any rationale for not doing so?


> +
> +Tag Check Faults
> +----------------
> +
> +When ``PROT_MTE`` is enabled on an address range and a mismatch between
> +the logical and allocation tags occurs on access, there are three
> +configurable behaviours:
> +
> +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the
> +  tag check fault.
> +
> +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with
> +  ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The
> +  memory access is not performed.

Also say that if in this case, if SIGSEGV is ignored or blocked by the
offending thread then containing processes is terminated with a coredump
(at least, that's what ought to happen).

> +
> +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current
> +  thread, asynchronously following one or multiple tag check faults,
> +  with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``.

For "current thread": that's a kernel concept.  For user-facing
documentation, can we say "the offending thread" or similar?

For clarity, it's worth saying that the faulting address is not
reported.  Or, we could be optimistic that someday this information will
be available and say that si_addr is the faulting address if available,
with 0 meaning the address is not available.

Maybe (void *)-1 would be better duff address, but I can't see it
mattering much.  If there's already precedent for si_addr==0 elsewhere,
it makes sense to follow it.

> +
> +**Note**: There are no *match-all* logical tags available for user
> +applications.

This note seems misplaced.

> +
> +The user can select the above modes, per thread, using the
> +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where

PR_GET_TAGGED_ADDR_CTRL seems to be missing here.

> +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK``
> +bit-field:
> +
> +- ``PR_MTE_TCF_NONE``  - *Ignore* tag check faults
> +- ``PR_MTE_TCF_SYNC``  - *Synchronous* tag check fault mode
> +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode

Done naively, this will destroy the PR_MTE_TAG_MASK field.  Is there a
preferred way to change only parts of this control word?  If the answer
is "cache the value in userspace if you care about performance, or
otherwise use PR_GET_TAGGED_ADDR_CTRL as part of a read-modify-write,"
so be it.

If we think this might be an issue for software, it might be worth
splitting out separate prctls for each field.)

> +
> +Tag checking can also be disabled for a user thread by setting the
> +``PSTATE.TCO`` bit with ``MSR TCO, #1``.

Users should probably not touch this unless they know what they're
doing -- should this flag ever be left set across function boundaries
etc.?

What's it for?  Temporarily masking MTE faults in critical sections?
Is this self-synchronising... what happens to pending asynchronous
faults?  Are faults occurring while the flag is set pended or discarded?

(Deliberately not reading the spec here -- if the explanation is not
straightforward, then it may be sufficient to tell people to go read
it.)

> +
> +**Note**: Signal handlers are always invoked with ``PSTATE.TCO = 0``,
> +irrespective of the interrupted context.

Rationale?  Do we have advice on what signal handlers should do?

Is PSTATE.TC0 restored by sigreturn?

> +
> +**Note**: Kernel accesses to user memory (e.g. ``read()`` system call)
> +are only checked if the current thread tag checking mode is
> +PR_MTE_TCF_SYNC.

Vague?  Can we make a precise statement about when the kernel will and
won't check such accesses?  And aren't there limitations (like use of
get_user_pages() etc.)?

> +
> +Excluding Tags in the ``IRG``, ``ADDG`` and ``SUBG`` instructions
> +-----------------------------------------------------------------
> +
> +The architecture allows excluding certain tags to be randomly generated
> +via the ``GCR_EL1.Exclude`` register bit-field. By default, Linux

Can we have a separate section on what execve() and fork()/clone() do
to the MTE controls and PSTATE.TCO?  "By default" could mean a variety
of things, and I'm not sure we cover everything.

Is PROT_MTE ever set on the initial pages mapped by execve()?

> +excludes all tags other than 0. A user thread can enable specific tags
> +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL,
> +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap
> +in the ``PR_MTE_TAG_MASK`` bit-field.
> +
> +**Note**: The hardware uses an exclude mask but the ``prctl()``
> +interface provides an include mask. An include mask of ``0`` (exclusion
> +mask ``0xffff``) results in the CPU always generating tag ``0``.

Is there no way to make this default to 1 rather than having a magic
meaning for 0?

> +
> +The ``ptrace()`` interface
> +--------------------------
> +
> +``PTRACE_PEEKMTETAGS`` and ``PTRACE_POKEMTETAGS`` allow a tracer to read
> +the tags from or set the tags to a tracee's address space. The
> +``ptrace()`` syscall is invoked as ``ptrace(request, pid, addr, data)``
> +where:
> +
> +- ``request`` - one of ``PTRACE_PEEKMTETAGS`` or ``PTRACE_PEEKMTETAGS``.
> +- ``pid`` - the tracee's PID.
> +- ``addr`` - address in the tracee's address space.

What if addr is not 16-byte aligned?  Is this considered valid use?

> +- ``data`` - pointer to a ``struct iovec`` where ``iov_base`` points to
> +  a buffer of ``iov_len`` length in the tracer's address space.

What's the data format for the copied tags?

> +
> +The tags in the tracer's ``iov_base`` buffer are represented as one tag
> +per byte and correspond to a 16-byte MTE tag granule in the tracee's
> +address space.

We could say that the whole operation accesses the tags for 16 * iov_len
bytes of the tracee's address space.  Maybe superfluous though.

> +
> +``ptrace()`` return value:
> +
> +- 0 - success, the tracer's ``iov_len`` was updated to the number of
> +  tags copied (it may be smaller than the requested ``iov_len`` if the
> +  requested address range in the tracee's or the tracer's space cannot
> +  be fully accessed).

I'd replace "success" with something like "some tags were copied:
``iov_len`` is updated to indicate the actual number of tags
transferred.  This may be fewer than requested: [...]"

Can we get a short PEEKTAGS/POKETAGS for transient reasons (like minor
page faults)?  i.e., should the caller attempt to retry, or is that a
a stupid thing to do?

> +- ``-EPERM`` - the specified process cannot be traced.
> +- ``-EIO`` - the tracee's address range cannot be accessed (e.g. invalid
> +  address) and no tags copied. ``iov_len`` not updated.
> +- ``-EFAULT`` - fault on accessing the tracer's memory (``struct iovec``
> +  or ``iov_base`` buffer) and no tags copied. ``iov_len`` not updated.
> +
> +Example of correct usage
> +========================
> +
> +*MTE Example code*
> +
> +.. code-block:: c
> +
> +    /*
> +     * To be compiled with -march=armv8.5-a+memtag
> +     */
> +    #include <errno.h>
> +    #include <stdio.h>
> +    #include <stdlib.h>
> +    #include <unistd.h>
> +    #include <sys/auxv.h>
> +    #include <sys/mman.h>
> +    #include <sys/prctl.h>
> +
> +    /*
> +     * From arch/arm64/include/uapi/asm/hwcap.h
> +     */
> +    #define HWCAP2_MTE              (1 << 18)
> +
> +    /*
> +     * From arch/arm64/include/uapi/asm/mman.h
> +     */
> +    #define PROT_MTE                 0x20
> +
> +    /*
> +     * From include/uapi/linux/prctl.h
> +     */
> +    #define PR_SET_TAGGED_ADDR_CTRL 55
> +    #define PR_GET_TAGGED_ADDR_CTRL 56
> +    # define PR_TAGGED_ADDR_ENABLE  (1UL << 0)
> +    # define PR_MTE_TCF_SHIFT       1
> +    # define PR_MTE_TCF_NONE        (0UL << PR_MTE_TCF_SHIFT)
> +    # define PR_MTE_TCF_SYNC        (1UL << PR_MTE_TCF_SHIFT)
> +    # define PR_MTE_TCF_ASYNC       (2UL << PR_MTE_TCF_SHIFT)
> +    # define PR_MTE_TCF_MASK        (3UL << PR_MTE_TCF_SHIFT)
> +    # define PR_MTE_TAG_SHIFT       3
> +    # define PR_MTE_TAG_MASK        (0xffffUL << PR_MTE_TAG_SHIFT)
> +
> +    /*
> +     * Insert a random logical tag into the given pointer.
> +     */
> +    #define insert_random_tag(ptr) ({                       \
> +            __u64 __val;                                    \
> +            asm("irg %0, %1" : "=r" (__val) : "r" (ptr));   \
> +            __val;                                          \
> +    })
> +
> +    /*
> +     * Set the allocation tag on the destination address.
> +     */
> +    #define set_tag(tagged_addr) do {                                      \
> +            asm volatile("stg %0, [%0]" : : "r" (tagged_addr) : "memory"); \
> +    } while (0)
> +
> +    int main()
> +    {
> +            unsigned long *a;
> +            unsigned long page_sz = getpagesize();

Nit: obsolete in POSIX.  Prefer sysconf(_SC_PAGESIZE).

> +            unsigned long hwcap2 = getauxval(AT_HWCAP2);
> +
> +            /* check if MTE is present */
> +            if (!(hwcap2 & HWCAP2_MTE))
> +                    return -1;

Nit: -1 isn't a valid exit code, so it's preferable to return 1 or
EXIT_FAILURE.

> +
> +            /*
> +             * Enable the tagged address ABI, synchronous MTE tag check faults and
> +             * allow all non-zero tags in the randomly generated set.
> +             */
> +            if (prctl(PR_SET_TAGGED_ADDR_CTRL,
> +                      PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC | (0xfffe << PR_MTE_TAG_SHIFT),
> +                      0, 0, 0)) {
> +                    perror("prctl() failed");
> +                    return -1;
> +            }
> +
> +            a = mmap(0, page_sz, PROT_READ | PROT_WRITE,
> +                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

Is this a vaild assignment?

I can't remember whether C's "pointer values must be correctly aligned"
rule applies only to dereferences, or whether it applies to conversions
too.  From memory I have a feeling that it does.

If so, the compiler could legimitately optimise the failure check away,
since MAP_FAILED is not correctly aligned for unsigned long.

> +            if (a == MAP_FAILED) {
> +                    perror("mmap() failed");
> +                    return -1;
> +            }
> +
> +            /*
> +             * Enable MTE on the above anonymous mmap. The flag could be passed
> +             * directly to mmap() and skip this step.
> +             */
> +            if (mprotect(a, page_sz, PROT_READ | PROT_WRITE | PROT_MTE)) {
> +                    perror("mprotect() failed");
> +                    return -1;
> +            }
> +
> +            /* access with the default tag (0) */
> +            a[0] = 1;
> +            a[1] = 2;
> +
> +            printf("a[0] = %lu a[1] = %lu\n", a[0], a[1]);
> +
> +            /* set the logical and allocation tags */
> +            a = (unsigned long *)insert_random_tag(a);
> +            set_tag(a);
> +
> +            printf("%p\n", a);
> +
> +            /* non-zero tag access */
> +            a[0] = 3;
> +            printf("a[0] = %lu a[1] = %lu\n", a[0], a[1]);
> +
> +            /*
> +             * If MTE is enabled correctly the next instruction will generate an
> +             * exception.
> +             */
> +            printf("Expecting SIGSEGV...\n");
> +            a[2] = 0xdead;
> +
> +            /* this should not be printed in the PR_MTE_TCF_SYNC mode */
> +            printf("...done\n");
> +
> +            return 0;
> +    }

Since this shouldn't happen, can we print an error and return nonzero?

[...]

Cheers
---Dave
Catalin Marinas April 30, 2020, 4:23 p.m. UTC | #2
On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote:
> On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote:
> > +Userspace Support
> > +=================
> > +
> > +When ``CONFIG_ARM64_MTE`` is selected and Memory Tagging Extension is
> > +supported by the hardware, the kernel advertises the feature to
> > +userspace via ``HWCAP2_MTE``.
> > +
> > +PROT_MTE
> > +--------
> > +
> > +To access the allocation tags, a user process must enable the Tagged
> > +memory attribute on an address range using a new ``prot`` flag for
> > +``mmap()`` and ``mprotect()``:
> > +
> > +``PROT_MTE`` - Pages allow access to the MTE allocation tags.
> > +
> > +The allocation tag is set to 0 when such pages are first mapped in the
> > +user address space and preserved on copy-on-write. ``MAP_SHARED`` is
> > +supported and the allocation tags can be shared between processes.
> > +
> > +**Note**: ``PROT_MTE`` is only supported on ``MAP_ANONYMOUS`` and
> > +RAM-based file mappings (``tmpfs``, ``memfd``). Passing it to other
> > +types of mapping will result in ``-EINVAL`` returned by these system
> > +calls.
> > +
> > +**Note**: The ``PROT_MTE`` flag (and corresponding memory type) cannot
> > +be cleared by ``mprotect()``.
> 
> What enforces this?  I don't have my head fully around the code yet.
> 
> I'm wondering whether attempting to clear PROT_MTE should be reported as
> an error.  Is there any rationale for not doing so?

A use-case is a JIT compiler where the memory is allocated by some
malloc() code with PROT_MTE set and passed down to a code generator
library which may not be MTE aware (and doesn't need to be, only tagged
ptr aware). Such library, once it generated the code, may do an
mprotect(PROT_READ|PROT_EXEC) without PROT_MTE. We didn't want to
inadvertently clear PROT_MTE, especially if the memory will be given
back to the original allocator (free) at some point.

Basically mprotect() may be done outside the heap allocator but it
should not interfere with allocator's decision to use MTE. For this
reason, I wouldn't report an error but silently ignore the lack of
PROT_MTE.

The way we handle this is by not including VM_MTE in VM_ARCH_CLEAR
(VM_MPX isn't either, though VM_SPARC_ADI is but when they added it, the
syscall ABI didn't even accept tagged pointers).

> > +Tag Check Faults
> > +----------------
> > +
> > +When ``PROT_MTE`` is enabled on an address range and a mismatch between
> > +the logical and allocation tags occurs on access, there are three
> > +configurable behaviours:
> > +
> > +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the
> > +  tag check fault.
> > +
> > +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with
> > +  ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The
> > +  memory access is not performed.
> 
> Also say that if in this case, if SIGSEGV is ignored or blocked by the
> offending thread then containing processes is terminated with a coredump
> (at least, that's what ought to happen).

Makes sense.

> > +
> > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current
> > +  thread, asynchronously following one or multiple tag check faults,
> > +  with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``.
> 
> For "current thread": that's a kernel concept.  For user-facing
> documentation, can we say "the offending thread" or similar?
> 
> For clarity, it's worth saying that the faulting address is not
> reported.  Or, we could be optimistic that someday this information will
> be available and say that si_addr is the faulting address if available,
> with 0 meaning the address is not available.
> 
> Maybe (void *)-1 would be better duff address, but I can't see it
> mattering much.  If there's already precedent for si_addr==0 elsewhere,
> it makes sense to follow it.

At a quick grep, I can see a few instances on other architectures where
si_addr==0. I'll add a comment here.

If the hardware gives us something in the future, it will likely be in a
separate register and we can present it as a new sigcontext structure.
In the meantime I'll add a some text that the faulting address is
unknown.

> > +**Note**: There are no *match-all* logical tags available for user
> > +applications.
> 
> This note seems misplaced.

This was in the context of tag checking. I'll move it further down when
talking about PSTATE.TCO.

> > +
> > +The user can select the above modes, per thread, using the
> > +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where
> 
> PR_GET_TAGGED_ADDR_CTRL seems to be missing here.

Added.

> > +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK``
> > +bit-field:
> > +
> > +- ``PR_MTE_TCF_NONE``  - *Ignore* tag check faults
> > +- ``PR_MTE_TCF_SYNC``  - *Synchronous* tag check fault mode
> > +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode
> 
> Done naively, this will destroy the PR_MTE_TAG_MASK field.  Is there a
> preferred way to change only parts of this control word?  If the answer
> is "cache the value in userspace if you care about performance, or
> otherwise use PR_GET_TAGGED_ADDR_CTRL as part of a read-modify-write,"
> so be it.
> 
> If we think this might be an issue for software, it might be worth
> splitting out separate prctls for each field.)

We lack some feedback from user space people on how this prctl is going
to be used. I worked on the assumption that it is a one-off event during
libc setup, potentially driven by some environment variable (but that's
user's problem).

There were some suggestions that on an async SIGSEGV, the handler may
switch to synchronous mode. Since that's a rare event, a get/set
approach would be fine.

Anyway, with an additional argument to prctl (we have 3 spare), we could
do a set/clear mask approach. The current behaviour could be emulated
as:

  prctl(PR_SET_TAGGED_ADDR_CTRL, PR_MTE_bits, -1UL, 0, 0);

where -1 is the clear mask. The mask can be 0 for the initial prctl() or
we can say that if the mask is non-zero, only the bits in the mask will
be set.

If you want to only set the TCF bits:

  prctl(PR_SET_TAGGED_ADDR_CTRL, PR_MTE_TCF_SYNC, PR_MTE_TCF_MASK, 0, 0);

> > +Tag checking can also be disabled for a user thread by setting the
> > +``PSTATE.TCO`` bit with ``MSR TCO, #1``.
> 
> Users should probably not touch this unless they know what they're
> doing -- should this flag ever be left set across function boundaries
> etc.?

We can't control function boundaries from the kernel anyway.

> What's it for?  Temporarily masking MTE faults in critical sections?
> Is this self-synchronising... what happens to pending asynchronous
> faults?  Are faults occurring while the flag is set pended or discarded?

Something like a garbage collector scanning the memory. Since we do not
allow tag 0 as a match-all, it needs a cheaper option than prctl().

> > +**Note**: Signal handlers are always invoked with ``PSTATE.TCO = 0``,
> > +irrespective of the interrupted context.
> 
> Rationale?  Do we have advice on what signal handlers should do?

Well, that's the default mode - tag check override = 0, it means that
tag checking takes place.

> Is PSTATE.TC0 restored by sigreturn?

s/TC0/TCO/

Yes, it is restored on sigreturn.

> > +**Note**: Kernel accesses to user memory (e.g. ``read()`` system call)
> > +are only checked if the current thread tag checking mode is
> > +PR_MTE_TCF_SYNC.
> 
> Vague?  Can we make a precise statement about when the kernel will and
> won't check such accesses?  And aren't there limitations (like use of
> get_user_pages() etc.)?

We could make it slightly clearer by say "kernel accesses to the user
address space".

> > +Excluding Tags in the ``IRG``, ``ADDG`` and ``SUBG`` instructions
> > +-----------------------------------------------------------------
> > +
> > +The architecture allows excluding certain tags to be randomly generated
> > +via the ``GCR_EL1.Exclude`` register bit-field. By default, Linux
> 
> Can we have a separate section on what execve() and fork()/clone() do
> to the MTE controls and PSTATE.TCO?  "By default" could mean a variety
> of things, and I'm not sure we cover everything.

Good point. I'll add a note on initial state for processes and threads.

> Is PROT_MTE ever set on the initial pages mapped by execve()?

No. There were discussions about mapping the initial stack with PROT_MTE
based on some ELF note but it can also be done in userspace with
mprotect(). I think we concluded that the .data/.bss sections will be
untagged.

> > +excludes all tags other than 0. A user thread can enable specific tags
> > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL,
> > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap
> > +in the ``PR_MTE_TAG_MASK`` bit-field.
> > +
> > +**Note**: The hardware uses an exclude mask but the ``prctl()``
> > +interface provides an include mask. An include mask of ``0`` (exclusion
> > +mask ``0xffff``) results in the CPU always generating tag ``0``.
> 
> Is there no way to make this default to 1 rather than having a magic
> meaning for 0?

We follow the hardware behaviour where 0xffff and 0xfffe give the same
result.

> > +The ``ptrace()`` interface
> > +--------------------------
> > +
> > +``PTRACE_PEEKMTETAGS`` and ``PTRACE_POKEMTETAGS`` allow a tracer to read
> > +the tags from or set the tags to a tracee's address space. The
> > +``ptrace()`` syscall is invoked as ``ptrace(request, pid, addr, data)``
> > +where:
> > +
> > +- ``request`` - one of ``PTRACE_PEEKMTETAGS`` or ``PTRACE_PEEKMTETAGS``.
> > +- ``pid`` - the tracee's PID.
> > +- ``addr`` - address in the tracee's address space.
> 
> What if addr is not 16-byte aligned?  Is this considered valid use?

Yes, I don't think we should impose a restriction here. Each address in
a 16-byte range has the same (shared) tag.

> > +- ``data`` - pointer to a ``struct iovec`` where ``iov_base`` points to
> > +  a buffer of ``iov_len`` length in the tracer's address space.
> 
> What's the data format for the copied tags?

I could state that the tag are placed in the lower 4-bit of the byte
with the upper 4-bit set to 0.

> > +The tags in the tracer's ``iov_base`` buffer are represented as one tag
> > +per byte and correspond to a 16-byte MTE tag granule in the tracee's
> > +address space.
> 
> We could say that the whole operation accesses the tags for 16 * iov_len
> bytes of the tracee's address space.  Maybe superfluous though.
> 
> > +
> > +``ptrace()`` return value:
> > +
> > +- 0 - success, the tracer's ``iov_len`` was updated to the number of
> > +  tags copied (it may be smaller than the requested ``iov_len`` if the
> > +  requested address range in the tracee's or the tracer's space cannot
> > +  be fully accessed).
> 
> I'd replace "success" with something like "some tags were copied:
> ``iov_len`` is updated to indicate the actual number of tags
> transferred.  This may be fewer than requested: [...]"
> 
> Can we get a short PEEKTAGS/POKETAGS for transient reasons (like minor
> page faults)?  i.e., should the caller attempt to retry, or is that a
> a stupid thing to do?

I initially thought it should retry but managed to get the interface so
that no retries are needed. If fewer tags were transferred, it's for a
good reason (e.g. permission fault).

[...]

> > +            a = mmap(0, page_sz, PROT_READ | PROT_WRITE,
> > +                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 
> Is this a vaild assignment?
> 
> I can't remember whether C's "pointer values must be correctly aligned"
> rule applies only to dereferences, or whether it applies to conversions
> too.  From memory I have a feeling that it does.
> 
> If so, the compiler could legimitately optimise the failure check away,
> since MAP_FAILED is not correctly aligned for unsigned long.

I'm not going to dig into standards ;). I can change this to an unsigned
char *.

> > +            printf("Expecting SIGSEGV...\n");
> > +            a[2] = 0xdead;
> > +
> > +            /* this should not be printed in the PR_MTE_TCF_SYNC mode */
> > +            printf("...done\n");
> > +
> > +            return 0;
> > +    }
> 
> Since this shouldn't happen, can we print an error and return nonzero?

Fair enough. I also agree with the other points you raised but to which
I haven't explicitly commented.

Thanks for the review, really useful.
Dave Martin May 4, 2020, 4:46 p.m. UTC | #3
On Thu, Apr 30, 2020 at 05:23:17PM +0100, Catalin Marinas wrote:
> On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote:
> > On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote:
> > > +Userspace Support
> > > +=================
> > > +
> > > +When ``CONFIG_ARM64_MTE`` is selected and Memory Tagging Extension is
> > > +supported by the hardware, the kernel advertises the feature to
> > > +userspace via ``HWCAP2_MTE``.
> > > +
> > > +PROT_MTE
> > > +--------
> > > +
> > > +To access the allocation tags, a user process must enable the Tagged
> > > +memory attribute on an address range using a new ``prot`` flag for
> > > +``mmap()`` and ``mprotect()``:
> > > +
> > > +``PROT_MTE`` - Pages allow access to the MTE allocation tags.
> > > +
> > > +The allocation tag is set to 0 when such pages are first mapped in the
> > > +user address space and preserved on copy-on-write. ``MAP_SHARED`` is
> > > +supported and the allocation tags can be shared between processes.
> > > +
> > > +**Note**: ``PROT_MTE`` is only supported on ``MAP_ANONYMOUS`` and
> > > +RAM-based file mappings (``tmpfs``, ``memfd``). Passing it to other
> > > +types of mapping will result in ``-EINVAL`` returned by these system
> > > +calls.
> > > +
> > > +**Note**: The ``PROT_MTE`` flag (and corresponding memory type) cannot
> > > +be cleared by ``mprotect()``.
> > 
> > What enforces this?  I don't have my head fully around the code yet.
> > 
> > I'm wondering whether attempting to clear PROT_MTE should be reported as
> > an error.  Is there any rationale for not doing so?
> 
> A use-case is a JIT compiler where the memory is allocated by some
> malloc() code with PROT_MTE set and passed down to a code generator
> library which may not be MTE aware (and doesn't need to be, only tagged
> ptr aware). Such library, once it generated the code, may do an
> mprotect(PROT_READ|PROT_EXEC) without PROT_MTE. We didn't want to
> inadvertently clear PROT_MTE, especially if the memory will be given
> back to the original allocator (free) at some point.
> 
> Basically mprotect() may be done outside the heap allocator but it
> should not interfere with allocator's decision to use MTE. For this
> reason, I wouldn't report an error but silently ignore the lack of
> PROT_MTE.
> 
> The way we handle this is by not including VM_MTE in VM_ARCH_CLEAR
> (VM_MPX isn't either, though VM_SPARC_ADI is but when they added it, the
> syscall ABI didn't even accept tagged pointers).

OK, I think this makes sense.

For BTI, I think mprotect() will clear PROT_BTI unless it's included in
prot, but that's a bit different: PROT_BTI relates to the memory
contents (i.e., it's BTI-aware code), where PROT_MTE is a property of
the memory itself.

> > > +Tag Check Faults
> > > +----------------
> > > +
> > > +When ``PROT_MTE`` is enabled on an address range and a mismatch between
> > > +the logical and allocation tags occurs on access, there are three
> > > +configurable behaviours:
> > > +
> > > +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the
> > > +  tag check fault.
> > > +
> > > +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with
> > > +  ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The
> > > +  memory access is not performed.
> > 
> > Also say that if in this case, if SIGSEGV is ignored or blocked by the
> > offending thread then containing processes is terminated with a coredump
> > (at least, that's what ought to happen).
> 
> Makes sense.
> 
> > > +
> > > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current
> > > +  thread, asynchronously following one or multiple tag check faults,
> > > +  with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``.
> > 
> > For "current thread": that's a kernel concept.  For user-facing
> > documentation, can we say "the offending thread" or similar?
> > 
> > For clarity, it's worth saying that the faulting address is not
> > reported.  Or, we could be optimistic that someday this information will
> > be available and say that si_addr is the faulting address if available,
> > with 0 meaning the address is not available.
> > 
> > Maybe (void *)-1 would be better duff address, but I can't see it
> > mattering much.  If there's already precedent for si_addr==0 elsewhere,
> > it makes sense to follow it.
> 
> At a quick grep, I can see a few instances on other architectures where
> si_addr==0. I'll add a comment here.

OK, cool

Except: what if we're in PR_MTE_TCF_ASYNC mode.  If the SIGSEGV handler
triggers an asynchronous MTE fault itself, we could then get into a
spin.  Hmm.

I take it we drain any pending MTE faults when crossing EL boundaries?
In that case, an asynchronous MTE fault pending at sigreturn must have
been caused by the signal handler.  We could make that particular case
of MTE_AERR a force_sig.

> If the hardware gives us something in the future, it will likely be in a
> separate register and we can present it as a new sigcontext structure.
> In the meantime I'll add a some text that the faulting address is
> unknown.

I guess we can decide that later.  I think that if we can put something
sensible in si_addr we should do so, but that doesn't stop us also
putting more detailed info somewhere else.

> 
> > > +**Note**: There are no *match-all* logical tags available for user
> > > +applications.
> > 
> > This note seems misplaced.
> 
> This was in the context of tag checking. I'll move it further down when
> talking about PSTATE.TCO.

OK

> > > +
> > > +The user can select the above modes, per thread, using the
> > > +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where
> > 
> > PR_GET_TAGGED_ADDR_CTRL seems to be missing here.
> 
> Added.
> 
> > > +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK``
> > > +bit-field:
> > > +
> > > +- ``PR_MTE_TCF_NONE``  - *Ignore* tag check faults
> > > +- ``PR_MTE_TCF_SYNC``  - *Synchronous* tag check fault mode
> > > +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode
> > 
> > Done naively, this will destroy the PR_MTE_TAG_MASK field.  Is there a
> > preferred way to change only parts of this control word?  If the answer
> > is "cache the value in userspace if you care about performance, or
> > otherwise use PR_GET_TAGGED_ADDR_CTRL as part of a read-modify-write,"
> > so be it.
> > 
> > If we think this might be an issue for software, it might be worth
> > splitting out separate prctls for each field.)
> 
> We lack some feedback from user space people on how this prctl is going
> to be used. I worked on the assumption that it is a one-off event during
> libc setup, potentially driven by some environment variable (but that's
> user's problem).
> 
> There were some suggestions that on an async SIGSEGV, the handler may
> switch to synchronous mode. Since that's a rare event, a get/set
> approach would be fine.
> 
> Anyway, with an additional argument to prctl (we have 3 spare), we could
> do a set/clear mask approach. The current behaviour could be emulated
> as:
> 
>   prctl(PR_SET_TAGGED_ADDR_CTRL, PR_MTE_bits, -1UL, 0, 0);
> 
> where -1 is the clear mask. The mask can be 0 for the initial prctl() or
> we can say that if the mask is non-zero, only the bits in the mask will
> be set.
> 
> If you want to only set the TCF bits:
> 
>   prctl(PR_SET_TAGGED_ADDR_CTRL, PR_MTE_TCF_SYNC, PR_MTE_TCF_MASK, 0, 0);

If this isn't critical path, I guess it's not a big deal either way.

If we make that mask argument an mask of bits _not_ to change than we
can add it as a backwards compatible extension later on without having
to define it now.  As you suggest, it may never matter.

So, I don't object to this staying as-is.

> > > +Tag checking can also be disabled for a user thread by setting the
> > > +``PSTATE.TCO`` bit with ``MSR TCO, #1``.
> > 
> > Users should probably not touch this unless they know what they're
> > doing -- should this flag ever be left set across function boundaries
> > etc.?
> 
> We can't control function boundaries from the kernel anyway.
> 
> > What's it for?  Temporarily masking MTE faults in critical sections?
> > Is this self-synchronising... what happens to pending asynchronous
> > faults?  Are faults occurring while the flag is set pended or discarded?
> 
> Something like a garbage collector scanning the memory. Since we do not
> allow tag 0 as a match-all, it needs a cheaper option than prctl().
> 
> > > +**Note**: Signal handlers are always invoked with ``PSTATE.TCO = 0``,
> > > +irrespective of the interrupted context.
> > 
> > Rationale?  Do we have advice on what signal handlers should do?
> 
> Well, that's the default mode - tag check override = 0, it means that
> tag checking takes place.

Sort of implies that a SIGSEGV handler must be careful not to trigger
any more faults.  But I guess that's nothing new.

> 
> > Is PSTATE.TC0 restored by sigreturn?
> 
> s/TC0/TCO/
> 
> Yes, it is restored on sigreturn.

OK.  I think it's worth mentioning (does no harm, anyway).

> 
> > > +**Note**: Kernel accesses to user memory (e.g. ``read()`` system call)
> > > +are only checked if the current thread tag checking mode is
> > > +PR_MTE_TCF_SYNC.
> > 
> > Vague?  Can we make a precise statement about when the kernel will and
> > won't check such accesses?  And aren't there limitations (like use of
> > get_user_pages() etc.)?
> 
> We could make it slightly clearer by say "kernel accesses to the user
> address space".

That's not the ambiguity.

My question is

1) Does the kernel guarantee not to check tags on kernel accesses to user memory without PR_MTE_TCF_SYNC?

2) Does the kernel guarantee to check tags on kernel accesses to user memory with PR_MTE_TCF_SYNC?


In practice, this note sounds to be more like a kernel implementation
detail rather than advice to userspace.

Would it make sense to say something like:

 * PR_MTE_TCF_NONE: the kernel does not check tags for kernel accesses
   to use memory done by syscalls in the thread.

 * PR_MTE_TCF_ASYNC: the kernel may check some tags for kernel accesses
   to user memory done by syscalls.  (Should we guarantee that such
   faults are reported synchronously on syscall exit?  In practice I
   think they are.  Should we use SEGV_MTESERR in this case?  Perhaps
   it's not worth making this a special case.)
   
 * PR_MTE_TCF_SYNC: the kernel makes best efforts to check tags for
   kernel accesses to user memory done by the syscalls, but does not
   guarantee to check everything (or does it?  I thought we can't really
   do that for some odd cases...)

> > > +Excluding Tags in the ``IRG``, ``ADDG`` and ``SUBG`` instructions
> > > +-----------------------------------------------------------------
> > > +
> > > +The architecture allows excluding certain tags to be randomly generated
> > > +via the ``GCR_EL1.Exclude`` register bit-field. By default, Linux
> > 
> > Can we have a separate section on what execve() and fork()/clone() do
> > to the MTE controls and PSTATE.TCO?  "By default" could mean a variety
> > of things, and I'm not sure we cover everything.
> 
> Good point. I'll add a note on initial state for processes and threads.
> 
> > Is PROT_MTE ever set on the initial pages mapped by execve()?
> 
> No. There were discussions about mapping the initial stack with PROT_MTE
> based on some ELF note but it can also be done in userspace with
> mprotect(). I think we concluded that the .data/.bss sections will be
> untagged.

Yes, I recall.  Sounds fine: probably worth mentioning here that
PROT_MTE is never set on the exec mappings for now.

> > > +excludes all tags other than 0. A user thread can enable specific tags
> > > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL,
> > > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap
> > > +in the ``PR_MTE_TAG_MASK`` bit-field.
> > > +
> > > +**Note**: The hardware uses an exclude mask but the ``prctl()``
> > > +interface provides an include mask. An include mask of ``0`` (exclusion
> > > +mask ``0xffff``) results in the CPU always generating tag ``0``.
> > 
> > Is there no way to make this default to 1 rather than having a magic
> > meaning for 0?
> 
> We follow the hardware behaviour where 0xffff and 0xfffe give the same
> result.

Exposing this through a purely software interface seems a bit odd:
because the exclude mask is privileged-access-only, the architecture
could amend it to assign a different meaning to 0xffff, providing this
was an opt-in change.  Then we'd have to make a mess here.

Can't we just forbid the nonsense value 0 here, or are there other
reasons why that's problematic?

I presume the architecture defines a meaning for 0 to avoid making
it UNPREDICTABLE etc., not because this is deemed useful.

> > > +The ``ptrace()`` interface
> > > +--------------------------
> > > +
> > > +``PTRACE_PEEKMTETAGS`` and ``PTRACE_POKEMTETAGS`` allow a tracer to read
> > > +the tags from or set the tags to a tracee's address space. The
> > > +``ptrace()`` syscall is invoked as ``ptrace(request, pid, addr, data)``
> > > +where:
> > > +
> > > +- ``request`` - one of ``PTRACE_PEEKMTETAGS`` or ``PTRACE_PEEKMTETAGS``.
> > > +- ``pid`` - the tracee's PID.
> > > +- ``addr`` - address in the tracee's address space.
> > 
> > What if addr is not 16-byte aligned?  Is this considered valid use?
> 
> Yes, I don't think we should impose a restriction here. Each address in
> a 16-byte range has the same (shared) tag.

OK.  We might want to clarify what this means when addr is misaligned:
we do not colour the 16 bytes starting at addr, but the reader might
assume that's what happens.

> > > +- ``data`` - pointer to a ``struct iovec`` where ``iov_base`` points to
> > > +  a buffer of ``iov_len`` length in the tracer's address space.
> > 
> > What's the data format for the copied tags?
> 
> I could state that the tag are placed in the lower 4-bit of the byte
> with the upper 4-bit set to 0.

What if it's not?  I didn't find this in the architecture spec, but I
didn't look very hard so far...

> > > +The tags in the tracer's ``iov_base`` buffer are represented as one tag
> > > +per byte and correspond to a 16-byte MTE tag granule in the tracee's
> > > +address space.
> > 
> > We could say that the whole operation accesses the tags for 16 * iov_len
> > bytes of the tracee's address space.  Maybe superfluous though.
> > 
> > > +
> > > +``ptrace()`` return value:
> > > +
> > > +- 0 - success, the tracer's ``iov_len`` was updated to the number of
> > > +  tags copied (it may be smaller than the requested ``iov_len`` if the
> > > +  requested address range in the tracee's or the tracer's space cannot
> > > +  be fully accessed).
> > 
> > I'd replace "success" with something like "some tags were copied:
> > ``iov_len`` is updated to indicate the actual number of tags
> > transferred.  This may be fewer than requested: [...]"
> > 
> > Can we get a short PEEKTAGS/POKETAGS for transient reasons (like minor
> > page faults)?  i.e., should the caller attempt to retry, or is that a
> > a stupid thing to do?
> 
> I initially thought it should retry but managed to get the interface so
> that no retries are needed. If fewer tags were transferred, it's for a
> good reason (e.g. permission fault).

OK, we should mention that here then.  Software that retries things that
can't make progress can get stuck in a loop (or at least waste cycles).

> > > +            a = mmap(0, page_sz, PROT_READ | PROT_WRITE,
> > > +                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > 
> > Is this a vaild assignment?
> > 
> > I can't remember whether C's "pointer values must be correctly aligned"
> > rule applies only to dereferences, or whether it applies to conversions
> > too.  From memory I have a feeling that it does.
> > 
> > If so, the compiler could legimitately optimise the failure check away,
> > since MAP_FAILED is not correctly aligned for unsigned long.
> 
> I'm not going to dig into standards ;). I can change this to an unsigned
> char *.

Sure, I guess that solves the problem.

Something like

	void *p;
	unsigned long *a;

	p = mmap( ... );
	if (p == MAP_FAILED) {
		/* barf */
	}

	a = p;

might provide a clue that care is needed, but it's not essential.

> 
> > > +            printf("Expecting SIGSEGV...\n");
> > > +            a[2] = 0xdead;
> > > +
> > > +            /* this should not be printed in the PR_MTE_TCF_SYNC mode */
> > > +            printf("...done\n");
> > > +
> > > +            return 0;
> > > +    }
> > 
> > Since this shouldn't happen, can we print an error and return nonzero?
> 
> Fair enough. I also agree with the other points you raised but to which
> I haven't explicitly commented.
> 
> Thanks for the review, really useful.

Np

Cheers
---Dave
Szabolcs Nagy May 5, 2020, 10:32 a.m. UTC | #4
The 04/21/2020 15:26, Catalin Marinas wrote:
> diff --git a/Documentation/arm64/memory-tagging-extension.rst b/Documentation/arm64/memory-tagging-extension.rst
> new file mode 100644
> index 000000000000..f82dfbd70061
> --- /dev/null
> +++ b/Documentation/arm64/memory-tagging-extension.rst
> @@ -0,0 +1,260 @@
> +===============================================
> +Memory Tagging Extension (MTE) in AArch64 Linux
> +===============================================
> +
> +Authors: Vincenzo Frascino <vincenzo.frascino@arm.com>
> +         Catalin Marinas <catalin.marinas@arm.com>
> +
> +Date: 2020-02-25
> +
> +This document describes the provision of the Memory Tagging Extension
> +functionality in AArch64 Linux.
> +
> +Introduction
> +============
> +
> +ARMv8.5 based processors introduce the Memory Tagging Extension (MTE)
> +feature. MTE is built on top of the ARMv8.0 virtual address tagging TBI
> +(Top Byte Ignore) feature and allows software to access a 4-bit
> +allocation tag for each 16-byte granule in the physical address space.
> +Such memory range must be mapped with the Normal-Tagged memory
> +attribute. A logical tag is derived from bits 59-56 of the virtual
> +address used for the memory access. A CPU with MTE enabled will compare
> +the logical tag against the allocation tag and potentially raise an
> +exception on mismatch, subject to system registers configuration.
> +
> +Userspace Support
> +=================
> +
> +When ``CONFIG_ARM64_MTE`` is selected and Memory Tagging Extension is
> +supported by the hardware, the kernel advertises the feature to
> +userspace via ``HWCAP2_MTE``.
> +
> +PROT_MTE
> +--------
> +
> +To access the allocation tags, a user process must enable the Tagged
> +memory attribute on an address range using a new ``prot`` flag for
> +``mmap()`` and ``mprotect()``:
> +
> +``PROT_MTE`` - Pages allow access to the MTE allocation tags.
> +
> +The allocation tag is set to 0 when such pages are first mapped in the
> +user address space and preserved on copy-on-write. ``MAP_SHARED`` is
> +supported and the allocation tags can be shared between processes.
> +
> +**Note**: ``PROT_MTE`` is only supported on ``MAP_ANONYMOUS`` and
> +RAM-based file mappings (``tmpfs``, ``memfd``). Passing it to other
> +types of mapping will result in ``-EINVAL`` returned by these system
> +calls.
> +
> +**Note**: The ``PROT_MTE`` flag (and corresponding memory type) cannot
> +be cleared by ``mprotect()``.

i think there are some non-obvious madvise operations that may
be worth documenting too for mte specific semantics.

e.g. MADV_DONTNEED or MADV_FREE can presumably drop tags which
means that existing pointers can no longer write to the memory
which is a change of behaviour compared to the non-mte case.
(affects most malloc implementations that will have to deal
with this when implementing heap coloring) there might be other
similar problems like MADV_WIPEONFORK that wont work as
currently expected when mte is enabled.

if such behaviour changes cause serious problems to existing
software there may be a need to have a way to opt out from
these changes (e.g. MADV_ flag variant that only affects the
memory content but not the tags) or to make that the default
behaviour. (but i can't tell how widely these are used in
ways that can be expected to work with PROT_MTE)


> +Tag Check Faults
> +----------------
> +
> +When ``PROT_MTE`` is enabled on an address range and a mismatch between
> +the logical and allocation tags occurs on access, there are three
> +configurable behaviours:
> +
> +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the
> +  tag check fault.
> +
> +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with
> +  ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The
> +  memory access is not performed.
> +
> +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current
> +  thread, asynchronously following one or multiple tag check faults,
> +  with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``.
> +
> +**Note**: There are no *match-all* logical tags available for user
> +applications.
> +
> +The user can select the above modes, per thread, using the
> +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where
> +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK``
> +bit-field:
> +
> +- ``PR_MTE_TCF_NONE``  - *Ignore* tag check faults
> +- ``PR_MTE_TCF_SYNC``  - *Synchronous* tag check fault mode
> +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode
> +
> +Tag checking can also be disabled for a user thread by setting the
> +``PSTATE.TCO`` bit with ``MSR TCO, #1``.
> +
> +**Note**: Signal handlers are always invoked with ``PSTATE.TCO = 0``,
> +irrespective of the interrupted context.
> +
> +**Note**: Kernel accesses to user memory (e.g. ``read()`` system call)
> +are only checked if the current thread tag checking mode is
> +PR_MTE_TCF_SYNC.
Catalin Marinas May 5, 2020, 5:30 p.m. UTC | #5
On Tue, May 05, 2020 at 11:32:33AM +0100, Szabolcs Nagy wrote:
> The 04/21/2020 15:26, Catalin Marinas wrote:
> > +PROT_MTE
> > +--------
> > +
> > +To access the allocation tags, a user process must enable the Tagged
> > +memory attribute on an address range using a new ``prot`` flag for
> > +``mmap()`` and ``mprotect()``:
> > +
> > +``PROT_MTE`` - Pages allow access to the MTE allocation tags.
> > +
> > +The allocation tag is set to 0 when such pages are first mapped in the
> > +user address space and preserved on copy-on-write. ``MAP_SHARED`` is
> > +supported and the allocation tags can be shared between processes.
> > +
> > +**Note**: ``PROT_MTE`` is only supported on ``MAP_ANONYMOUS`` and
> > +RAM-based file mappings (``tmpfs``, ``memfd``). Passing it to other
> > +types of mapping will result in ``-EINVAL`` returned by these system
> > +calls.
> > +
> > +**Note**: The ``PROT_MTE`` flag (and corresponding memory type) cannot
> > +be cleared by ``mprotect()``.
> 
> i think there are some non-obvious madvise operations that may
> be worth documenting too for mte specific semantics.
> 
> e.g. MADV_DONTNEED or MADV_FREE can presumably drop tags which
> means that existing pointers can no longer write to the memory
> which is a change of behaviour compared to the non-mte case.
> (affects most malloc implementations that will have to deal
> with this when implementing heap coloring) there might be other
> similar problems like MADV_WIPEONFORK that wont work as
> currently expected when mte is enabled.
> 
> if such behaviour changes cause serious problems to existing
> software there may be a need to have a way to opt out from
> these changes (e.g. MADV_ flag variant that only affects the
> memory content but not the tags) or to make that the default
> behaviour. (but i can't tell how widely these are used in
> ways that can be expected to work with PROT_MTE)

Thanks. I'll document this behaviour as it may not be obvious.

For the record (as we discussed this internally), I think the kernel
behaviour is entirely expected. On mmap(PROT_MTE), the kernel would
return pages with tags set to 0. On madvise(MADV_DONTNEED), the kernel
may free the pages but map them back on access using the same conditions
they were previously given to the user, i.e. tags set to 0. There isn't
any expectations for the kernel to preserve the tags of
MADV_DONTNEED/FREE pages (which defeats the point of dontneed/free).
Catalin Marinas May 11, 2020, 4:40 p.m. UTC | #6
On Mon, May 04, 2020 at 05:46:17PM +0100, Dave P Martin wrote:
> On Thu, Apr 30, 2020 at 05:23:17PM +0100, Catalin Marinas wrote:
> > On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote:
> > > On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote:
> > > > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current
> > > > +  thread, asynchronously following one or multiple tag check faults,
> > > > +  with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``.
> > > 
> > > For "current thread": that's a kernel concept.  For user-facing
> > > documentation, can we say "the offending thread" or similar?
> > > 
> > > For clarity, it's worth saying that the faulting address is not
> > > reported.  Or, we could be optimistic that someday this information will
> > > be available and say that si_addr is the faulting address if available,
> > > with 0 meaning the address is not available.
> > > 
> > > Maybe (void *)-1 would be better duff address, but I can't see it
> > > mattering much.  If there's already precedent for si_addr==0 elsewhere,
> > > it makes sense to follow it.
> > 
> > At a quick grep, I can see a few instances on other architectures where
> > si_addr==0. I'll add a comment here.
> 
> OK, cool
> 
> Except: what if we're in PR_MTE_TCF_ASYNC mode.  If the SIGSEGV handler
> triggers an asynchronous MTE fault itself, we could then get into a
> spin.  Hmm.

How do we handle standard segfaults here? Presumably a signal handler
can trigger a SIGSEGV itself.

> I take it we drain any pending MTE faults when crossing EL boundaries?

We clear the hardware bit on entry to EL1 from EL0 and set a TIF flag.

> In that case, an asynchronous MTE fault pending at sigreturn must have
> been caused by the signal handler.  We could make that particular case
> of MTE_AERR a force_sig.

We clear the TIF flag when delivering the signal. I don't think there is
a way for the kernel to detect when it is running in a signal handler.
sigreturn() is not mandatory either.

> > > > +**Note**: Kernel accesses to user memory (e.g. ``read()`` system call)
> > > > +are only checked if the current thread tag checking mode is
> > > > +PR_MTE_TCF_SYNC.
> > > 
> > > Vague?  Can we make a precise statement about when the kernel will and
> > > won't check such accesses?  And aren't there limitations (like use of
> > > get_user_pages() etc.)?
> > 
> > We could make it slightly clearer by say "kernel accesses to the user
> > address space".
> 
> That's not the ambiguity.
> 
> My question is
> 
> 1) Does the kernel guarantee not to check tags on kernel accesses to
> user memory without PR_MTE_TCF_SYNC?

For ASYNC and NONE, yes, we can guarantee this.

> 2) Does the kernel guarantee to check tags on kernel accesses to user
> memory with PR_MTE_TCF_SYNC?

I'd say yes but it depends on how much knowledge one has about the
syscall implementation. If it's access to user address directly, it
would be checked. If it goes via get_user_pages(), it won't. Since the
user doesn't need to have knowledge of the kernel internals, you are
right that we don't guarantee this.

> In practice, this note sounds to be more like a kernel implementation
> detail rather than advice to userspace.
> 
> Would it make sense to say something like:
> 
>  * PR_MTE_TCF_NONE: the kernel does not check tags for kernel accesses
>    to use memory done by syscalls in the thread.
> 
>  * PR_MTE_TCF_ASYNC: the kernel may check some tags for kernel accesses
>    to user memory done by syscalls.  (Should we guarantee that such
>    faults are reported synchronously on syscall exit?  In practice I
>    think they are.  Should we use SEGV_MTESERR in this case?  Perhaps
>    it's not worth making this a special case.)

Both NONE and ASYNC are now the same for kernel uaccess - not checked.

For background information, I decided against ASYNC uaccess checking
since (1) there are some cases where the kernel overreads
(strncpy_from_user) and (2) we don't normally generate SIGSEGV on
uaccess but rather return -EFAULT. The latter is not possible to contain
since we only learn about the fault asynchronously, usually after the
transfer.

>  * PR_MTE_TCF_SYNC: the kernel makes best efforts to check tags for
>    kernel accesses to user memory done by the syscalls, but does not
>    guarantee to check everything (or does it?  I thought we can't really
>    do that for some odd cases...)

It doesn't. I'll add some notes along the lines of your text above.

> > > > +excludes all tags other than 0. A user thread can enable specific tags
> > > > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL,
> > > > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap
> > > > +in the ``PR_MTE_TAG_MASK`` bit-field.
> > > > +
> > > > +**Note**: The hardware uses an exclude mask but the ``prctl()``
> > > > +interface provides an include mask. An include mask of ``0`` (exclusion
> > > > +mask ``0xffff``) results in the CPU always generating tag ``0``.
> > > 
> > > Is there no way to make this default to 1 rather than having a magic
> > > meaning for 0?
> > 
> > We follow the hardware behaviour where 0xffff and 0xfffe give the same
> > result.
> 
> Exposing this through a purely software interface seems a bit odd:
> because the exclude mask is privileged-access-only, the architecture
> could amend it to assign a different meaning to 0xffff, providing this
> was an opt-in change.  Then we'd have to make a mess here.

You have a point. An include mask of 0 translates to an exclude mask of
0xffff as per the current patches. If the hardware gains support for one
more bit (32 colours), old software running on new hardware may run into
unexpected results with an exclude mask of 0xffff.

> Can't we just forbid the nonsense value 0 here, or are there other
> reasons why that's problematic?

It was just easier to start with a default. I wonder whether we should
actually switch back to the exclude mask, as per the hardware
definition. This way 0 would mean all tags allowed. We can still
disallow 0xffff as an exclude mask.
Dave Martin May 13, 2020, 3:48 p.m. UTC | #7
On Mon, May 11, 2020 at 05:40:19PM +0100, Catalin Marinas wrote:
> On Mon, May 04, 2020 at 05:46:17PM +0100, Dave P Martin wrote:
> > On Thu, Apr 30, 2020 at 05:23:17PM +0100, Catalin Marinas wrote:
> > > On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote:
> > > > On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote:
> > > > > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current
> > > > > +  thread, asynchronously following one or multiple tag check faults,
> > > > > +  with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``.
> > > > 
> > > > For "current thread": that's a kernel concept.  For user-facing
> > > > documentation, can we say "the offending thread" or similar?
> > > > 
> > > > For clarity, it's worth saying that the faulting address is not
> > > > reported.  Or, we could be optimistic that someday this information will
> > > > be available and say that si_addr is the faulting address if available,
> > > > with 0 meaning the address is not available.
> > > > 
> > > > Maybe (void *)-1 would be better duff address, but I can't see it
> > > > mattering much.  If there's already precedent for si_addr==0 elsewhere,
> > > > it makes sense to follow it.
> > > 
> > > At a quick grep, I can see a few instances on other architectures where
> > > si_addr==0. I'll add a comment here.
> > 
> > OK, cool
> > 
> > Except: what if we're in PR_MTE_TCF_ASYNC mode.  If the SIGSEGV handler
> > triggers an asynchronous MTE fault itself, we could then get into a
> > spin.  Hmm.
> 
> How do we handle standard segfaults here? Presumably a signal handler
> can trigger a SIGSEGV itself.

This is similar to the problem is a data abort inside the data abort
handler.  It can of course happen, but if you don't want this to be
fatal then you code the handler carefully so this can't happen.

> > I take it we drain any pending MTE faults when crossing EL boundaries?
> 
> We clear the hardware bit on entry to EL1 from EL0 and set a TIF flag.
> 
> > In that case, an asynchronous MTE fault pending at sigreturn must have
> > been caused by the signal handler.  We could make that particular case
> > of MTE_AERR a force_sig.
> 
> We clear the TIF flag when delivering the signal. I don't think there is
> a way for the kernel to detect when it is running in a signal handler.
> sigreturn() is not mandatory either.

I guess we can put up with this signal not being fatal then.

If you have a SEGV handler at all, you're supposed to code it carefully.

This brings us back to force_sig for SERR and a normal signal for AERR.
That's probably OK.

> 
> > > > > +**Note**: Kernel accesses to user memory (e.g. ``read()`` system call)
> > > > > +are only checked if the current thread tag checking mode is
> > > > > +PR_MTE_TCF_SYNC.
> > > > 
> > > > Vague?  Can we make a precise statement about when the kernel will and
> > > > won't check such accesses?  And aren't there limitations (like use of
> > > > get_user_pages() etc.)?
> > > 
> > > We could make it slightly clearer by say "kernel accesses to the user
> > > address space".
> > 
> > That's not the ambiguity.
> > 
> > My question is
> > 
> > 1) Does the kernel guarantee not to check tags on kernel accesses to
> > user memory without PR_MTE_TCF_SYNC?
> 
> For ASYNC and NONE, yes, we can guarantee this.
> 
> > 2) Does the kernel guarantee to check tags on kernel accesses to user
> > memory with PR_MTE_TCF_SYNC?
> 
> I'd say yes but it depends on how much knowledge one has about the
> syscall implementation. If it's access to user address directly, it
> would be checked. If it goes via get_user_pages(), it won't. Since the
> user doesn't need to have knowledge of the kernel internals, you are
> right that we don't guarantee this.

So, from userspace it's not guaranteed.

This is what I'd describe as "making best efforts", but not a guarantee.

> > In practice, this note sounds to be more like a kernel implementation
> > detail rather than advice to userspace.
> > 
> > Would it make sense to say something like:
> > 
> >  * PR_MTE_TCF_NONE: the kernel does not check tags for kernel accesses
> >    to use memory done by syscalls in the thread.
> > 
> >  * PR_MTE_TCF_ASYNC: the kernel may check some tags for kernel accesses
> >    to user memory done by syscalls.  (Should we guarantee that such
> >    faults are reported synchronously on syscall exit?  In practice I
> >    think they are.  Should we use SEGV_MTESERR in this case?  Perhaps
> >    it's not worth making this a special case.)
> 
> Both NONE and ASYNC are now the same for kernel uaccess - not checked.
>
> For background information, I decided against ASYNC uaccess checking
> since (1) there are some cases where the kernel overreads
> (strncpy_from_user) and (2) we don't normally generate SIGSEGV on
> uaccess but rather return -EFAULT. The latter is not possible to contain
> since we only learn about the fault asynchronously, usually after the
> transfer.

I may be missing something here.  Do we still rely on the hardware to
detect tag mismatches in kernel accesses to user memory?  I was assuming
we do some kind of explicit checking, but now I think that's nonsense
(except for get_user_pages() etc.)


Since MTE is a new opt-in feature, I think we might have the option to
report failures with SIGSEGV instead of -EFAULT.  This seems exactly to
implement the concept of an asynchronous versus synchronous error. 

The kernel may not normally do this, but software usually doesn't use
raw syscalls.  In reality "syscalls" can trigger a SIGSEGV in the libc
wrapper anyway.  From the caller's point of view the whole thing is a
black box.

Probably needs discussion with the bionic / glibc folks though (though
likely this has been discussed already...)


My concern is that the spirit of asynchrous checking in the architecture
is that accesses _are_ checked, and we seem to be breaking that
principle here.

Although MTE's guarantees are statistical, based on small random numbers
not matching, this imperfection is quite different from systematically
not checking at all, ever, on certain major code paths.

> 
> >  * PR_MTE_TCF_SYNC: the kernel makes best efforts to check tags for
> >    kernel accesses to user memory done by the syscalls, but does not
> >    guarantee to check everything (or does it?  I thought we can't really
> >    do that for some odd cases...)
> 
> It doesn't. I'll add some notes along the lines of your text above.

OK

> > > > > +excludes all tags other than 0. A user thread can enable specific tags
> > > > > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL,
> > > > > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap
> > > > > +in the ``PR_MTE_TAG_MASK`` bit-field.
> > > > > +
> > > > > +**Note**: The hardware uses an exclude mask but the ``prctl()``
> > > > > +interface provides an include mask. An include mask of ``0`` (exclusion
> > > > > +mask ``0xffff``) results in the CPU always generating tag ``0``.
> > > > 
> > > > Is there no way to make this default to 1 rather than having a magic
> > > > meaning for 0?
> > > 
> > > We follow the hardware behaviour where 0xffff and 0xfffe give the same
> > > result.
> > 
> > Exposing this through a purely software interface seems a bit odd:
> > because the exclude mask is privileged-access-only, the architecture
> > could amend it to assign a different meaning to 0xffff, providing this
> > was an opt-in change.  Then we'd have to make a mess here.
> 
> You have a point. An include mask of 0 translates to an exclude mask of
> 0xffff as per the current patches. If the hardware gains support for one
> more bit (32 colours), old software running on new hardware may run into
> unexpected results with an exclude mask of 0xffff.
> 
> > Can't we just forbid the nonsense value 0 here, or are there other
> > reasons why that's problematic?
> 
> It was just easier to start with a default. I wonder whether we should
> actually switch back to the exclude mask, as per the hardware
> definition. This way 0 would mean all tags allowed. We can still
> disallow 0xffff as an exclude mask.

If the number of bits might grow, I guess we can make the exclude mask
full-width.

For example, the hardware can trivially exclude tags 16 and up, because
they don't exist anyway.

Similarly, the hardware can trivially include tags 16 and up: inclusion
only means that the hardware is allowed to generate them, not that it
guarantees to.

The only configuration that doesn't make sense is "no tags allowed", so
I'd argue for explicity blocking that, even if the architeture alises
that encoding to something else.

If we prefer 0 as a default value so that init inherits the correct
value from the kernel without any special acrobatics, then we make it an
exclude mask, with the semantics that the hardware is allowed to
generate any of these tags, but does not have to be capable of
generating all of them.

Make sense?  This is bikeshedding from my end...


Cheers
---Dave
Catalin Marinas May 14, 2020, 11:37 a.m. UTC | #8
On Wed, May 13, 2020 at 04:48:46PM +0100, Dave P Martin wrote:
> On Mon, May 11, 2020 at 05:40:19PM +0100, Catalin Marinas wrote:
> > On Mon, May 04, 2020 at 05:46:17PM +0100, Dave P Martin wrote:
> > > On Thu, Apr 30, 2020 at 05:23:17PM +0100, Catalin Marinas wrote:
> > > > On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote:
> > > > > On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote:
> > > > > > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current
> > > > > > +  thread, asynchronously following one or multiple tag check faults,
> > > > > > +  with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``.
> > > > > 
> > > > > For "current thread": that's a kernel concept.  For user-facing
> > > > > documentation, can we say "the offending thread" or similar?
> > > > > 
> > > > > For clarity, it's worth saying that the faulting address is not
> > > > > reported.  Or, we could be optimistic that someday this information will
> > > > > be available and say that si_addr is the faulting address if available,
> > > > > with 0 meaning the address is not available.
> > > > > 
> > > > > Maybe (void *)-1 would be better duff address, but I can't see it
> > > > > mattering much.  If there's already precedent for si_addr==0 elsewhere,
> > > > > it makes sense to follow it.
> > > > 
> > > > At a quick grep, I can see a few instances on other architectures where
> > > > si_addr==0. I'll add a comment here.
> > > 
> > > OK, cool
> > > 
> > > Except: what if we're in PR_MTE_TCF_ASYNC mode.  If the SIGSEGV handler
> > > triggers an asynchronous MTE fault itself, we could then get into a
> > > spin.  Hmm.
[...]
> > > In that case, an asynchronous MTE fault pending at sigreturn must have
> > > been caused by the signal handler.  We could make that particular case
> > > of MTE_AERR a force_sig.
> > 
> > We clear the TIF flag when delivering the signal. I don't think there is
> > a way for the kernel to detect when it is running in a signal handler.
> > sigreturn() is not mandatory either.
> 
> I guess we can put up with this signal not being fatal then.
> 
> If you have a SEGV handler at all, you're supposed to code it carefully.
> 
> This brings us back to force_sig for SERR and a normal signal for AERR.
> That's probably OK.

I think we are in agreement now but please check the patches when I post
the v4.

> > > > > > +**Note**: Kernel accesses to user memory (e.g. ``read()`` system call)
> > > > > > +are only checked if the current thread tag checking mode is
> > > > > > +PR_MTE_TCF_SYNC.
> > > > > 
> > > > > Vague?  Can we make a precise statement about when the kernel will and
> > > > > won't check such accesses?  And aren't there limitations (like use of
> > > > > get_user_pages() etc.)?
> > > > 
> > > > We could make it slightly clearer by say "kernel accesses to the user
> > > > address space".
> > > 
> > > That's not the ambiguity.
> > > 
> > > My question is
> > > 
> > > 1) Does the kernel guarantee not to check tags on kernel accesses to
> > > user memory without PR_MTE_TCF_SYNC?
[...]
> > > 2) Does the kernel guarantee to check tags on kernel accesses to user
> > > memory with PR_MTE_TCF_SYNC?
[...]
> > > In practice, this note sounds to be more like a kernel implementation
> > > detail rather than advice to userspace.
> > > 
> > > Would it make sense to say something like:
> > > 
> > >  * PR_MTE_TCF_NONE: the kernel does not check tags for kernel accesses
> > >    to use memory done by syscalls in the thread.
> > > 
> > >  * PR_MTE_TCF_ASYNC: the kernel may check some tags for kernel accesses
> > >    to user memory done by syscalls.  (Should we guarantee that such
> > >    faults are reported synchronously on syscall exit?  In practice I
> > >    think they are.  Should we use SEGV_MTESERR in this case?  Perhaps
> > >    it's not worth making this a special case.)
> > 
> > Both NONE and ASYNC are now the same for kernel uaccess - not checked.
> >
> > For background information, I decided against ASYNC uaccess checking
> > since (1) there are some cases where the kernel overreads
> > (strncpy_from_user) and (2) we don't normally generate SIGSEGV on
> > uaccess but rather return -EFAULT. The latter is not possible to contain
> > since we only learn about the fault asynchronously, usually after the
> > transfer.
> 
> I may be missing something here.  Do we still rely on the hardware to
> detect tag mismatches in kernel accesses to user memory?  I was assuming
> we do some kind of explicit checking, but now I think that's nonsense
> (except for get_user_pages() etc.)

For synchronous tag checking, we expect the uaccess (via the user
address, e.g. copy_from_user()) to be checked by the hardware. If the
access happens via a kernel mapping (get_user_pages()), the access is
unchecked. There is no point in an explicit tag access+check from the
kernel since the get_user_pages() accesses are not expected to generate
faults anyway (once the pages have been returned). We also most likely
lost the actual user address at the point of access, so not easy to
infer the original tag.

> Since MTE is a new opt-in feature, I think we might have the option to
> report failures with SIGSEGV instead of -EFAULT.  This seems exactly to
> implement the concept of an asynchronous versus synchronous error. 

With synchronous checking, we return -EFAULT, smaller number of bytes
etc. since no/less data was copied. With async, the uaccess would
perform all the accesses, only that the user may get a SIGSEGV delivered
on return from the syscall.

> The kernel may not normally do this, but software usually doesn't use
> raw syscalls.  In reality "syscalls" can trigger a SIGSEGV in the libc
> wrapper anyway.  From the caller's point of view the whole thing is a
> black box.
> 
> Probably needs discussion with the bionic / glibc folks though (though
> likely this has been discussed already...)

The initial plan was to generate SIGSEGV on asynchronous faults for
uaccess (on syscall return). This changed when we noticed (in version 3
I think) that the kernel over-reads buffers in some cases
(strncpy_from_user(), copy_mount_options()) and triggers false
positives.

We could fix the above two cases, though in different ways:
strncpy_from_user() can align its source (user) address and would no
longer be expected to trigger a fault if the string is correctly tagged.
copy_mount_options(), OTOH, always reads 4K (not zero-terminated), so it
will trip over some tag mismatch. The workaround is to contain the async
tag check fault (with DSB before and after the access) and ignore it.

However, are these the only two cases where the kernel over-reads user
buffers? Without MTE, such faults on uaccess (page faults) were handled
by the kernel transparently. We may now start delivering SIGSEGV every
time some piece of uaccess kernel code changes and over-reads.

> My concern is that the spirit of asynchrous checking in the
> architecture is that accesses _are_ checked, and we seem to be
> breaking that principle here.

I agree with you on the principle but my concern is about the
practicality of chasing any future code changes and plugging potentially
fatal SIGSEGVs sent to the user.

Maybe we need a way to log this so that user (admin) can do something
about it like force synchronous. Or we could also toggle synchronous
uaccesses irrespective of the user mode or expose this option as a
prctl().

Also, do we want some big knob (sysctl) to force some of these modes for
all user processes: e.g. force-upgrade async to sync?

> > > > > > +excludes all tags other than 0. A user thread can enable specific tags
> > > > > > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL,
> > > > > > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap
> > > > > > +in the ``PR_MTE_TAG_MASK`` bit-field.
> > > > > > +
> > > > > > +**Note**: The hardware uses an exclude mask but the ``prctl()``
> > > > > > +interface provides an include mask. An include mask of ``0`` (exclusion
> > > > > > +mask ``0xffff``) results in the CPU always generating tag ``0``.
> > > > > 
> > > > > Is there no way to make this default to 1 rather than having a magic
> > > > > meaning for 0?
> > > > 
> > > > We follow the hardware behaviour where 0xffff and 0xfffe give the same
> > > > result.
> > > 
> > > Exposing this through a purely software interface seems a bit odd:
> > > because the exclude mask is privileged-access-only, the architecture
> > > could amend it to assign a different meaning to 0xffff, providing this
> > > was an opt-in change.  Then we'd have to make a mess here.
> > 
> > You have a point. An include mask of 0 translates to an exclude mask of
> > 0xffff as per the current patches. If the hardware gains support for one
> > more bit (32 colours), old software running on new hardware may run into
> > unexpected results with an exclude mask of 0xffff.
> > 
> > > Can't we just forbid the nonsense value 0 here, or are there other
> > > reasons why that's problematic?
> > 
> > It was just easier to start with a default. I wonder whether we should
> > actually switch back to the exclude mask, as per the hardware
> > definition. This way 0 would mean all tags allowed. We can still
> > disallow 0xffff as an exclude mask.
[...]
> The only configuration that doesn't make sense is "no tags allowed", so
> I'd argue for explicity blocking that, even if the architeture aliases
> that encoding to something else.
> 
> If we prefer 0 as a default value so that init inherits the correct
> value from the kernel without any special acrobatics, then we make it an
> exclude mask, with the semantics that the hardware is allowed to
> generate any of these tags, but does not have to be capable of
> generating all of them.

That's more of a question to the libc people and their preference.
We have two options with suboptions:

1. prctl() gets an exclude mask with 0xffff illegal even though the
   hardware accepts it:
   a) default exclude mask 0, allowing all tags to be generated by IRG
   b) default exclude mask of 0xfffe so that only tag 0 is generated

2. prctl() gets an include mask with 0 illegal:
   a) default include mask is 0xffff, allowing all tags to be generated
   b) default include mask 0f 0x0001 so that only tag 0 is generated

We currently have (2) with mask 0 but could be changed to (2.b). If we
are to follow the hardware description (which makes more sense to me but
I don't write the C library), (1.a) is the most appropriate.
Catalin Marinas May 15, 2020, 10:38 a.m. UTC | #9
On Thu, May 14, 2020 at 12:37:22PM +0100, Catalin Marinas wrote:
> On Wed, May 13, 2020 at 04:48:46PM +0100, Dave P Martin wrote:
> > > > > On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote:
> > > > > > On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote:
> > > > > > > +excludes all tags other than 0. A user thread can enable specific tags
> > > > > > > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL,
> > > > > > > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap
> > > > > > > +in the ``PR_MTE_TAG_MASK`` bit-field.
> > > > > > > +
> > > > > > > +**Note**: The hardware uses an exclude mask but the ``prctl()``
> > > > > > > +interface provides an include mask. An include mask of ``0`` (exclusion
> > > > > > > +mask ``0xffff``) results in the CPU always generating tag ``0``.
> > > > > > 
> > > > > > Is there no way to make this default to 1 rather than having a magic
> > > > > > meaning for 0?
> [...]
> > The only configuration that doesn't make sense is "no tags allowed", so
> > I'd argue for explicity blocking that, even if the architeture aliases
> > that encoding to something else.
> > 
> > If we prefer 0 as a default value so that init inherits the correct
> > value from the kernel without any special acrobatics, then we make it an
> > exclude mask, with the semantics that the hardware is allowed to
> > generate any of these tags, but does not have to be capable of
> > generating all of them.
> 
> That's more of a question to the libc people and their preference.
> We have two options with suboptions:
> 
> 1. prctl() gets an exclude mask with 0xffff illegal even though the
>    hardware accepts it:
>    a) default exclude mask 0, allowing all tags to be generated by IRG
>    b) default exclude mask of 0xfffe so that only tag 0 is generated
> 
> 2. prctl() gets an include mask with 0 illegal:
>    a) default include mask is 0xffff, allowing all tags to be generated
>    b) default include mask 0f 0x0001 so that only tag 0 is generated
> 
> We currently have (2) with mask 0 but could be changed to (2.b). If we
> are to follow the hardware description (which makes more sense to me but
> I don't write the C library), (1.a) is the most appropriate.

Thinking some more about this, as we are to expose the GCR_EL1.Excl via
a ptrace interface as a regset, it makes more sense to move back to an
exclude mask here with default 0. That would be option 1.a above.
Szabolcs Nagy May 15, 2020, 11:14 a.m. UTC | #10
The 05/15/2020 11:38, Catalin Marinas wrote:
> On Thu, May 14, 2020 at 12:37:22PM +0100, Catalin Marinas wrote:
> > On Wed, May 13, 2020 at 04:48:46PM +0100, Dave P Martin wrote:
> > > > > > On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote:
> > > > > > > On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote:
> > > > > > > > +excludes all tags other than 0. A user thread can enable specific tags
> > > > > > > > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL,
> > > > > > > > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap
> > > > > > > > +in the ``PR_MTE_TAG_MASK`` bit-field.
> > > > > > > > +
> > > > > > > > +**Note**: The hardware uses an exclude mask but the ``prctl()``
> > > > > > > > +interface provides an include mask. An include mask of ``0`` (exclusion
> > > > > > > > +mask ``0xffff``) results in the CPU always generating tag ``0``.
> > > > > > > 
> > > > > > > Is there no way to make this default to 1 rather than having a magic
> > > > > > > meaning for 0?
> > [...]
> > > The only configuration that doesn't make sense is "no tags allowed", so
> > > I'd argue for explicity blocking that, even if the architeture aliases
> > > that encoding to something else.
> > > 
> > > If we prefer 0 as a default value so that init inherits the correct
> > > value from the kernel without any special acrobatics, then we make it an
> > > exclude mask, with the semantics that the hardware is allowed to
> > > generate any of these tags, but does not have to be capable of
> > > generating all of them.
> > 
> > That's more of a question to the libc people and their preference.
> > We have two options with suboptions:
> > 
> > 1. prctl() gets an exclude mask with 0xffff illegal even though the
> >    hardware accepts it:
> >    a) default exclude mask 0, allowing all tags to be generated by IRG
> >    b) default exclude mask of 0xfffe so that only tag 0 is generated
> > 
> > 2. prctl() gets an include mask with 0 illegal:
> >    a) default include mask is 0xffff, allowing all tags to be generated
> >    b) default include mask 0f 0x0001 so that only tag 0 is generated
> > 
> > We currently have (2) with mask 0 but could be changed to (2.b). If we
> > are to follow the hardware description (which makes more sense to me but
> > I don't write the C library), (1.a) is the most appropriate.
> 
> Thinking some more about this, as we are to expose the GCR_EL1.Excl via
> a ptrace interface as a regset, it makes more sense to move back to an
> exclude mask here with default 0. That would be option 1.a above.

i think the libc has to do a prctl call to set
mte up and at that point it will use whatever
arguments necessary, so 1.a should work (just
like the other options).

likely libc will disable 0 for irg and possibly
one or two other fixed colors (which will have
specific use).

the difference i see between 1 vs 2 is forward
compatibility if the architecture changes (e.g.
adding more tag bits) but then likely new prctl
flag will be needed for handling that so it's
probably not an issue.
Catalin Marinas May 15, 2020, 11:27 a.m. UTC | #11
On Fri, May 15, 2020 at 12:14:00PM +0100, Szabolcs Nagy wrote:
> The 05/15/2020 11:38, Catalin Marinas wrote:
> > On Thu, May 14, 2020 at 12:37:22PM +0100, Catalin Marinas wrote:
> > > We have two options with suboptions:
> > > 
> > > 1. prctl() gets an exclude mask with 0xffff illegal even though the
> > >    hardware accepts it:
> > >    a) default exclude mask 0, allowing all tags to be generated by IRG
> > >    b) default exclude mask of 0xfffe so that only tag 0 is generated
> > > 
> > > 2. prctl() gets an include mask with 0 illegal:
> > >    a) default include mask is 0xffff, allowing all tags to be generated
> > >    b) default include mask 0f 0x0001 so that only tag 0 is generated
> > > 
> > > We currently have (2) with mask 0 but could be changed to (2.b). If we
> > > are to follow the hardware description (which makes more sense to me but
> > > I don't write the C library), (1.a) is the most appropriate.
> > 
> > Thinking some more about this, as we are to expose the GCR_EL1.Excl via
> > a ptrace interface as a regset, it makes more sense to move back to an
> > exclude mask here with default 0. That would be option 1.a above.
> 
> i think the libc has to do a prctl call to set
> mte up and at that point it will use whatever
> arguments necessary, so 1.a should work (just
> like the other options).
> 
> likely libc will disable 0 for irg and possibly
> one or two other fixed colors (which will have
> specific use).
> 
> the difference i see between 1 vs 2 is forward
> compatibility if the architecture changes (e.g.
> adding more tag bits) but then likely new prctl
> flag will be needed for handling that so it's
> probably not an issue.

Thanks Szabolcs. While we are at this, no-one so far asked for the
GCR_EL1.RRND to be exposed to user (and this implies RGSR_EL1.SEED).
Since RRND=1 guarantees a distribution "no worse" than that of RRND=0, I
thought there isn't much point in exposing this configuration to the
user. The only advantage of RRND=0 I see is that the kernel can change
the seed randomly but, with only 4 bits per tag, it really doesn't
matter much.

Anyway, mentioning it here in case anyone is surprised later about the
lack of RRND configurability.
Szabolcs Nagy May 15, 2020, 12:04 p.m. UTC | #12
The 05/15/2020 12:27, Catalin Marinas wrote:
> Thanks Szabolcs. While we are at this, no-one so far asked for the
> GCR_EL1.RRND to be exposed to user (and this implies RGSR_EL1.SEED).
> Since RRND=1 guarantees a distribution "no worse" than that of RRND=0, I
> thought there isn't much point in exposing this configuration to the
> user. The only advantage of RRND=0 I see is that the kernel can change

it seems RRND=1 is the impl specific algorithm.

> the seed randomly but, with only 4 bits per tag, it really doesn't
> matter much.
> 
> Anyway, mentioning it here in case anyone is surprised later about the
> lack of RRND configurability.

i'm not familiar with how irg works.

is the seed per process state that's set
up at process startup in some way?
or shared (and thus effectively irg is
non-deterministic in userspace)?
Catalin Marinas May 15, 2020, 12:13 p.m. UTC | #13
On Fri, May 15, 2020 at 01:04:33PM +0100, Szabolcs Nagy wrote:
> The 05/15/2020 12:27, Catalin Marinas wrote:
> > Thanks Szabolcs. While we are at this, no-one so far asked for the
> > GCR_EL1.RRND to be exposed to user (and this implies RGSR_EL1.SEED).
> > Since RRND=1 guarantees a distribution "no worse" than that of RRND=0, I
> > thought there isn't much point in exposing this configuration to the
> > user. The only advantage of RRND=0 I see is that the kernel can change
> 
> it seems RRND=1 is the impl specific algorithm.

Yes, that's the implementation specific algorithm which shouldn't be
worse than the standard one.

> > the seed randomly but, with only 4 bits per tag, it really doesn't
> > matter much.
> > 
> > Anyway, mentioning it here in case anyone is surprised later about the
> > lack of RRND configurability.
> 
> i'm not familiar with how irg works.

It generates a random tag based on some algorithm.

> is the seed per process state that's set up at process startup in some
> way? or shared (and thus effectively irg is non-deterministic in
> userspace)?

The seed is only relevant if the standard algorithm is used (RRND=0).
Szabolcs Nagy May 15, 2020, 12:53 p.m. UTC | #14
The 05/15/2020 13:13, Catalin Marinas wrote:
> On Fri, May 15, 2020 at 01:04:33PM +0100, Szabolcs Nagy wrote:
> > The 05/15/2020 12:27, Catalin Marinas wrote:
> > > Thanks Szabolcs. While we are at this, no-one so far asked for the
> > > GCR_EL1.RRND to be exposed to user (and this implies RGSR_EL1.SEED).
> > > Since RRND=1 guarantees a distribution "no worse" than that of RRND=0, I
> > > thought there isn't much point in exposing this configuration to the
> > > user. The only advantage of RRND=0 I see is that the kernel can change
> > 
> > it seems RRND=1 is the impl specific algorithm.
> 
> Yes, that's the implementation specific algorithm which shouldn't be
> worse than the standard one.
> 
> > > the seed randomly but, with only 4 bits per tag, it really doesn't
> > > matter much.
> > > 
> > > Anyway, mentioning it here in case anyone is surprised later about the
> > > lack of RRND configurability.
> > 
> > i'm not familiar with how irg works.
> 
> It generates a random tag based on some algorithm.
> 
> > is the seed per process state that's set up at process startup in some
> > way? or shared (and thus effectively irg is non-deterministic in
> > userspace)?
> 
> The seed is only relevant if the standard algorithm is used (RRND=0).

i wanted to understand if we can get deterministic
irg behaviour in user space (which may be useful
for debugging to get reproducible tag failures).

i guess if no control is exposed that means non-
deterministic irg. i think this is fine.
Dave Martin May 18, 2020, 4:52 p.m. UTC | #15
On Fri, May 15, 2020 at 01:53:32PM +0100, Szabolcs Nagy wrote:
> The 05/15/2020 13:13, Catalin Marinas wrote:
> > On Fri, May 15, 2020 at 01:04:33PM +0100, Szabolcs Nagy wrote:
> > > The 05/15/2020 12:27, Catalin Marinas wrote:
> > > > Thanks Szabolcs. While we are at this, no-one so far asked for the
> > > > GCR_EL1.RRND to be exposed to user (and this implies RGSR_EL1.SEED).
> > > > Since RRND=1 guarantees a distribution "no worse" than that of RRND=0, I
> > > > thought there isn't much point in exposing this configuration to the
> > > > user. The only advantage of RRND=0 I see is that the kernel can change
> > > 
> > > it seems RRND=1 is the impl specific algorithm.
> > 
> > Yes, that's the implementation specific algorithm which shouldn't be
> > worse than the standard one.
> > 
> > > > the seed randomly but, with only 4 bits per tag, it really doesn't
> > > > matter much.
> > > > 
> > > > Anyway, mentioning it here in case anyone is surprised later about the
> > > > lack of RRND configurability.
> > > 
> > > i'm not familiar with how irg works.
> > 
> > It generates a random tag based on some algorithm.
> > 
> > > is the seed per process state that's set up at process startup in some
> > > way? or shared (and thus effectively irg is non-deterministic in
> > > userspace)?
> > 
> > The seed is only relevant if the standard algorithm is used (RRND=0).
> 
> i wanted to understand if we can get deterministic
> irg behaviour in user space (which may be useful
> for debugging to get reproducible tag failures).
> 
> i guess if no control is exposed that means non-
> deterministic irg. i think this is fine.

Hmmm, I guess this might eventually be wanted.  But it's probably OK not
to have it to begin with.

Things like CRIU restores won't be reproducible unless the seeds can be
saved/restored.

Doesn't seem essential from day 1 though.

Cheers
---Dave
Catalin Marinas May 18, 2020, 5:13 p.m. UTC | #16
On Thu, May 14, 2020 at 12:37:22PM +0100, Catalin Marinas wrote:
> On Wed, May 13, 2020 at 04:48:46PM +0100, Dave P Martin wrote:
> > On Mon, May 11, 2020 at 05:40:19PM +0100, Catalin Marinas wrote:
> > > On Mon, May 04, 2020 at 05:46:17PM +0100, Dave P Martin wrote:
> > > > On Thu, Apr 30, 2020 at 05:23:17PM +0100, Catalin Marinas wrote:
> > > > > On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote:
> > > > > > On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote:
> > > > > > > +excludes all tags other than 0. A user thread can enable specific tags
> > > > > > > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL,
> > > > > > > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap
> > > > > > > +in the ``PR_MTE_TAG_MASK`` bit-field.
> > > > > > > +
> > > > > > > +**Note**: The hardware uses an exclude mask but the ``prctl()``
> > > > > > > +interface provides an include mask. An include mask of ``0`` (exclusion
> > > > > > > +mask ``0xffff``) results in the CPU always generating tag ``0``.
> > > > > > 
> > > > > > Is there no way to make this default to 1 rather than having a magic
> > > > > > meaning for 0?
> > > > > 
> > > > > We follow the hardware behaviour where 0xffff and 0xfffe give the same
> > > > > result.
> > > > 
> > > > Exposing this through a purely software interface seems a bit odd:
> > > > because the exclude mask is privileged-access-only, the architecture
> > > > could amend it to assign a different meaning to 0xffff, providing this
> > > > was an opt-in change.  Then we'd have to make a mess here.
> > > 
> > > You have a point. An include mask of 0 translates to an exclude mask of
> > > 0xffff as per the current patches. If the hardware gains support for one
> > > more bit (32 colours), old software running on new hardware may run into
> > > unexpected results with an exclude mask of 0xffff.
> > > 
> > > > Can't we just forbid the nonsense value 0 here, or are there other
> > > > reasons why that's problematic?
> > > 
> > > It was just easier to start with a default. I wonder whether we should
> > > actually switch back to the exclude mask, as per the hardware
> > > definition. This way 0 would mean all tags allowed. We can still
> > > disallow 0xffff as an exclude mask.
> [...]
> > The only configuration that doesn't make sense is "no tags allowed", so
> > I'd argue for explicity blocking that, even if the architeture aliases
> > that encoding to something else.
> > 
> > If we prefer 0 as a default value so that init inherits the correct
> > value from the kernel without any special acrobatics, then we make it an
> > exclude mask, with the semantics that the hardware is allowed to
> > generate any of these tags, but does not have to be capable of
> > generating all of them.
> 
> That's more of a question to the libc people and their preference.
> We have two options with suboptions:
> 
> 1. prctl() gets an exclude mask with 0xffff illegal even though the
>    hardware accepts it:
>    a) default exclude mask 0, allowing all tags to be generated by IRG
>    b) default exclude mask of 0xfffe so that only tag 0 is generated
> 
> 2. prctl() gets an include mask with 0 illegal:
>    a) default include mask is 0xffff, allowing all tags to be generated
>    b) default include mask 0f 0x0001 so that only tag 0 is generated
> 
> We currently have (2) with mask 0 but could be changed to (2.b). If we
> are to follow the hardware description (which makes more sense to me but
> I don't write the C library), (1.a) is the most appropriate.

As Peter pointed out on Friday (call), 2.b doesn't work as it breaks the
existing prctl() for turning on the tagged address ABI. So we have to
accept 0 as the tag mask field.

Dave, if you feel strongly about avoiding the exclude mask confusion
with 0xffff equivalent to 0xfffe, I'll go for 1.a. I have not changed
this in the v4 series of the patches (no ABI change in there apart from
some minor ptrace tweaks).
diff mbox series

Patch

diff --git a/Documentation/arm64/cpu-feature-registers.rst b/Documentation/arm64/cpu-feature-registers.rst
index 41937a8091aa..b5679fa85ad9 100644
--- a/Documentation/arm64/cpu-feature-registers.rst
+++ b/Documentation/arm64/cpu-feature-registers.rst
@@ -174,6 +174,8 @@  infrastructure:
      +------------------------------+---------+---------+
      | Name                         |  bits   | visible |
      +------------------------------+---------+---------+
+     | MTE                          | [11-8]  |    y    |
+     +------------------------------+---------+---------+
      | SSBS                         | [7-4]   |    y    |
      +------------------------------+---------+---------+
 
diff --git a/Documentation/arm64/elf_hwcaps.rst b/Documentation/arm64/elf_hwcaps.rst
index 7dfb97dfe416..ca7f90e99e3a 100644
--- a/Documentation/arm64/elf_hwcaps.rst
+++ b/Documentation/arm64/elf_hwcaps.rst
@@ -236,6 +236,11 @@  HWCAP2_RNG
 
     Functionality implied by ID_AA64ISAR0_EL1.RNDR == 0b0001.
 
+HWCAP2_MTE
+
+    Functionality implied by ID_AA64PFR1_EL1.MTE == 0b0010, as described
+    by Documentation/arm64/memory-tagging-extension.rst.
+
 4. Unused AT_HWCAP bits
 -----------------------
 
diff --git a/Documentation/arm64/index.rst b/Documentation/arm64/index.rst
index 09cbb4ed2237..4cd0e696f064 100644
--- a/Documentation/arm64/index.rst
+++ b/Documentation/arm64/index.rst
@@ -14,6 +14,7 @@  ARM64 Architecture
     hugetlbpage
     legacy_instructions
     memory
+    memory-tagging-extension
     pointer-authentication
     silicon-errata
     sve
diff --git a/Documentation/arm64/memory-tagging-extension.rst b/Documentation/arm64/memory-tagging-extension.rst
new file mode 100644
index 000000000000..f82dfbd70061
--- /dev/null
+++ b/Documentation/arm64/memory-tagging-extension.rst
@@ -0,0 +1,260 @@ 
+===============================================
+Memory Tagging Extension (MTE) in AArch64 Linux
+===============================================
+
+Authors: Vincenzo Frascino <vincenzo.frascino@arm.com>
+         Catalin Marinas <catalin.marinas@arm.com>
+
+Date: 2020-02-25
+
+This document describes the provision of the Memory Tagging Extension
+functionality in AArch64 Linux.
+
+Introduction
+============
+
+ARMv8.5 based processors introduce the Memory Tagging Extension (MTE)
+feature. MTE is built on top of the ARMv8.0 virtual address tagging TBI
+(Top Byte Ignore) feature and allows software to access a 4-bit
+allocation tag for each 16-byte granule in the physical address space.
+Such memory range must be mapped with the Normal-Tagged memory
+attribute. A logical tag is derived from bits 59-56 of the virtual
+address used for the memory access. A CPU with MTE enabled will compare
+the logical tag against the allocation tag and potentially raise an
+exception on mismatch, subject to system registers configuration.
+
+Userspace Support
+=================
+
+When ``CONFIG_ARM64_MTE`` is selected and Memory Tagging Extension is
+supported by the hardware, the kernel advertises the feature to
+userspace via ``HWCAP2_MTE``.
+
+PROT_MTE
+--------
+
+To access the allocation tags, a user process must enable the Tagged
+memory attribute on an address range using a new ``prot`` flag for
+``mmap()`` and ``mprotect()``:
+
+``PROT_MTE`` - Pages allow access to the MTE allocation tags.
+
+The allocation tag is set to 0 when such pages are first mapped in the
+user address space and preserved on copy-on-write. ``MAP_SHARED`` is
+supported and the allocation tags can be shared between processes.
+
+**Note**: ``PROT_MTE`` is only supported on ``MAP_ANONYMOUS`` and
+RAM-based file mappings (``tmpfs``, ``memfd``). Passing it to other
+types of mapping will result in ``-EINVAL`` returned by these system
+calls.
+
+**Note**: The ``PROT_MTE`` flag (and corresponding memory type) cannot
+be cleared by ``mprotect()``.
+
+Tag Check Faults
+----------------
+
+When ``PROT_MTE`` is enabled on an address range and a mismatch between
+the logical and allocation tags occurs on access, there are three
+configurable behaviours:
+
+- *Ignore* - This is the default mode. The CPU (and kernel) ignores the
+  tag check fault.
+
+- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with
+  ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The
+  memory access is not performed.
+
+- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current
+  thread, asynchronously following one or multiple tag check faults,
+  with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``.
+
+**Note**: There are no *match-all* logical tags available for user
+applications.
+
+The user can select the above modes, per thread, using the
+``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where
+``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK``
+bit-field:
+
+- ``PR_MTE_TCF_NONE``  - *Ignore* tag check faults
+- ``PR_MTE_TCF_SYNC``  - *Synchronous* tag check fault mode
+- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode
+
+Tag checking can also be disabled for a user thread by setting the
+``PSTATE.TCO`` bit with ``MSR TCO, #1``.
+
+**Note**: Signal handlers are always invoked with ``PSTATE.TCO = 0``,
+irrespective of the interrupted context.
+
+**Note**: Kernel accesses to user memory (e.g. ``read()`` system call)
+are only checked if the current thread tag checking mode is
+PR_MTE_TCF_SYNC.
+
+Excluding Tags in the ``IRG``, ``ADDG`` and ``SUBG`` instructions
+-----------------------------------------------------------------
+
+The architecture allows excluding certain tags to be randomly generated
+via the ``GCR_EL1.Exclude`` register bit-field. By default, Linux
+excludes all tags other than 0. A user thread can enable specific tags
+in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL,
+flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap
+in the ``PR_MTE_TAG_MASK`` bit-field.
+
+**Note**: The hardware uses an exclude mask but the ``prctl()``
+interface provides an include mask. An include mask of ``0`` (exclusion
+mask ``0xffff``) results in the CPU always generating tag ``0``.
+
+The ``ptrace()`` interface
+--------------------------
+
+``PTRACE_PEEKMTETAGS`` and ``PTRACE_POKEMTETAGS`` allow a tracer to read
+the tags from or set the tags to a tracee's address space. The
+``ptrace()`` syscall is invoked as ``ptrace(request, pid, addr, data)``
+where:
+
+- ``request`` - one of ``PTRACE_PEEKMTETAGS`` or ``PTRACE_PEEKMTETAGS``.
+- ``pid`` - the tracee's PID.
+- ``addr`` - address in the tracee's address space.
+- ``data`` - pointer to a ``struct iovec`` where ``iov_base`` points to
+  a buffer of ``iov_len`` length in the tracer's address space.
+
+The tags in the tracer's ``iov_base`` buffer are represented as one tag
+per byte and correspond to a 16-byte MTE tag granule in the tracee's
+address space.
+
+``ptrace()`` return value:
+
+- 0 - success, the tracer's ``iov_len`` was updated to the number of
+  tags copied (it may be smaller than the requested ``iov_len`` if the
+  requested address range in the tracee's or the tracer's space cannot
+  be fully accessed).
+- ``-EPERM`` - the specified process cannot be traced.
+- ``-EIO`` - the tracee's address range cannot be accessed (e.g. invalid
+  address) and no tags copied. ``iov_len`` not updated.
+- ``-EFAULT`` - fault on accessing the tracer's memory (``struct iovec``
+  or ``iov_base`` buffer) and no tags copied. ``iov_len`` not updated.
+
+Example of correct usage
+========================
+
+*MTE Example code*
+
+.. code-block:: c
+
+    /*
+     * To be compiled with -march=armv8.5-a+memtag
+     */
+    #include <errno.h>
+    #include <stdio.h>
+    #include <stdlib.h>
+    #include <unistd.h>
+    #include <sys/auxv.h>
+    #include <sys/mman.h>
+    #include <sys/prctl.h>
+
+    /*
+     * From arch/arm64/include/uapi/asm/hwcap.h
+     */
+    #define HWCAP2_MTE              (1 << 18)
+
+    /*
+     * From arch/arm64/include/uapi/asm/mman.h
+     */
+    #define PROT_MTE                 0x20
+
+    /*
+     * From include/uapi/linux/prctl.h
+     */
+    #define PR_SET_TAGGED_ADDR_CTRL 55
+    #define PR_GET_TAGGED_ADDR_CTRL 56
+    # define PR_TAGGED_ADDR_ENABLE  (1UL << 0)
+    # define PR_MTE_TCF_SHIFT       1
+    # define PR_MTE_TCF_NONE        (0UL << PR_MTE_TCF_SHIFT)
+    # define PR_MTE_TCF_SYNC        (1UL << PR_MTE_TCF_SHIFT)
+    # define PR_MTE_TCF_ASYNC       (2UL << PR_MTE_TCF_SHIFT)
+    # define PR_MTE_TCF_MASK        (3UL << PR_MTE_TCF_SHIFT)
+    # define PR_MTE_TAG_SHIFT       3
+    # define PR_MTE_TAG_MASK        (0xffffUL << PR_MTE_TAG_SHIFT)
+
+    /*
+     * Insert a random logical tag into the given pointer.
+     */
+    #define insert_random_tag(ptr) ({                       \
+            __u64 __val;                                    \
+            asm("irg %0, %1" : "=r" (__val) : "r" (ptr));   \
+            __val;                                          \
+    })
+
+    /*
+     * Set the allocation tag on the destination address.
+     */
+    #define set_tag(tagged_addr) do {                                      \
+            asm volatile("stg %0, [%0]" : : "r" (tagged_addr) : "memory"); \
+    } while (0)
+
+    int main()
+    {
+            unsigned long *a;
+            unsigned long page_sz = getpagesize();
+            unsigned long hwcap2 = getauxval(AT_HWCAP2);
+
+            /* check if MTE is present */
+            if (!(hwcap2 & HWCAP2_MTE))
+                    return -1;
+
+            /*
+             * Enable the tagged address ABI, synchronous MTE tag check faults and
+             * allow all non-zero tags in the randomly generated set.
+             */
+            if (prctl(PR_SET_TAGGED_ADDR_CTRL,
+                      PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC | (0xfffe << PR_MTE_TAG_SHIFT),
+                      0, 0, 0)) {
+                    perror("prctl() failed");
+                    return -1;
+            }
+
+            a = mmap(0, page_sz, PROT_READ | PROT_WRITE,
+                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+            if (a == MAP_FAILED) {
+                    perror("mmap() failed");
+                    return -1;
+            }
+
+            /*
+             * Enable MTE on the above anonymous mmap. The flag could be passed
+             * directly to mmap() and skip this step.
+             */
+            if (mprotect(a, page_sz, PROT_READ | PROT_WRITE | PROT_MTE)) {
+                    perror("mprotect() failed");
+                    return -1;
+            }
+
+            /* access with the default tag (0) */
+            a[0] = 1;
+            a[1] = 2;
+
+            printf("a[0] = %lu a[1] = %lu\n", a[0], a[1]);
+
+            /* set the logical and allocation tags */
+            a = (unsigned long *)insert_random_tag(a);
+            set_tag(a);
+
+            printf("%p\n", a);
+
+            /* non-zero tag access */
+            a[0] = 3;
+            printf("a[0] = %lu a[1] = %lu\n", a[0], a[1]);
+
+            /*
+             * If MTE is enabled correctly the next instruction will generate an
+             * exception.
+             */
+            printf("Expecting SIGSEGV...\n");
+            a[2] = 0xdead;
+
+            /* this should not be printed in the PR_MTE_TCF_SYNC mode */
+            printf("...done\n");
+
+            return 0;
+    }