Message ID | 20200421142603.3894-24-catalin.marinas@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: Memory Tagging Extension user-space support | expand |
On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote: > From: Vincenzo Frascino <vincenzo.frascino@arm.com> > > Memory Tagging Extension (part of the ARMv8.5 Extensions) provides > a mechanism to detect the sources of memory related errors which > may be vulnerable to exploitation, including bounds violations, > use-after-free, use-after-return, use-out-of-scope and use before > initialization errors. > > Add Memory Tagging Extension documentation for the arm64 linux > kernel support. > > Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> > Co-developed-by: Catalin Marinas <catalin.marinas@arm.com> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > Cc: Will Deacon <will@kernel.org> > --- > > Notes: > v3: > - Modify the uaccess checking conditions: only when the sync mode is > selected by the user. In async mode, the kernel uaccesses are not > checked. > - Clarify that an include mask of 0 (exclude mask 0xffff) results in > always generating tag 0. > - Document the ptrace() interface. > > v2: > - Documented the uaccess kernel tag checking mode. > - Removed the BTI definitions from cpu-feature-registers.rst. > - Removed the paragraph stating that MTE depends on the tagged address > ABI (while the Kconfig entry does, there is no requirement for the > user to enable both). > - Changed the GCR_EL1.Exclude handling description following the change > in the prctl() interface (include vs exclude mask). > - Updated the example code. > > Documentation/arm64/cpu-feature-registers.rst | 2 + > Documentation/arm64/elf_hwcaps.rst | 5 + > Documentation/arm64/index.rst | 1 + > .../arm64/memory-tagging-extension.rst | 260 ++++++++++++++++++ > 4 files changed, 268 insertions(+) > create mode 100644 Documentation/arm64/memory-tagging-extension.rst > > diff --git a/Documentation/arm64/cpu-feature-registers.rst b/Documentation/arm64/cpu-feature-registers.rst > index 41937a8091aa..b5679fa85ad9 100644 > --- a/Documentation/arm64/cpu-feature-registers.rst > +++ b/Documentation/arm64/cpu-feature-registers.rst > @@ -174,6 +174,8 @@ infrastructure: > +------------------------------+---------+---------+ > | Name | bits | visible | > +------------------------------+---------+---------+ > + | MTE | [11-8] | y | > + +------------------------------+---------+---------+ > | SSBS | [7-4] | y | > +------------------------------+---------+---------+ > > diff --git a/Documentation/arm64/elf_hwcaps.rst b/Documentation/arm64/elf_hwcaps.rst > index 7dfb97dfe416..ca7f90e99e3a 100644 > --- a/Documentation/arm64/elf_hwcaps.rst > +++ b/Documentation/arm64/elf_hwcaps.rst > @@ -236,6 +236,11 @@ HWCAP2_RNG > > Functionality implied by ID_AA64ISAR0_EL1.RNDR == 0b0001. > > +HWCAP2_MTE > + > + Functionality implied by ID_AA64PFR1_EL1.MTE == 0b0010, as described > + by Documentation/arm64/memory-tagging-extension.rst. > + > 4. Unused AT_HWCAP bits > ----------------------- > > diff --git a/Documentation/arm64/index.rst b/Documentation/arm64/index.rst > index 09cbb4ed2237..4cd0e696f064 100644 > --- a/Documentation/arm64/index.rst > +++ b/Documentation/arm64/index.rst > @@ -14,6 +14,7 @@ ARM64 Architecture > hugetlbpage > legacy_instructions > memory > + memory-tagging-extension > pointer-authentication > silicon-errata > sve > diff --git a/Documentation/arm64/memory-tagging-extension.rst b/Documentation/arm64/memory-tagging-extension.rst > new file mode 100644 > index 000000000000..f82dfbd70061 > --- /dev/null > +++ b/Documentation/arm64/memory-tagging-extension.rst > @@ -0,0 +1,260 @@ > +=============================================== > +Memory Tagging Extension (MTE) in AArch64 Linux > +=============================================== > + > +Authors: Vincenzo Frascino <vincenzo.frascino@arm.com> > + Catalin Marinas <catalin.marinas@arm.com> > + > +Date: 2020-02-25 > + > +This document describes the provision of the Memory Tagging Extension > +functionality in AArch64 Linux. > + > +Introduction > +============ > + > +ARMv8.5 based processors introduce the Memory Tagging Extension (MTE) > +feature. MTE is built on top of the ARMv8.0 virtual address tagging TBI > +(Top Byte Ignore) feature and allows software to access a 4-bit > +allocation tag for each 16-byte granule in the physical address space. > +Such memory range must be mapped with the Normal-Tagged memory > +attribute. A logical tag is derived from bits 59-56 of the virtual > +address used for the memory access. A CPU with MTE enabled will compare > +the logical tag against the allocation tag and potentially raise an > +exception on mismatch, subject to system registers configuration. > + > +Userspace Support > +================= > + > +When ``CONFIG_ARM64_MTE`` is selected and Memory Tagging Extension is > +supported by the hardware, the kernel advertises the feature to > +userspace via ``HWCAP2_MTE``. > + > +PROT_MTE > +-------- > + > +To access the allocation tags, a user process must enable the Tagged > +memory attribute on an address range using a new ``prot`` flag for > +``mmap()`` and ``mprotect()``: > + > +``PROT_MTE`` - Pages allow access to the MTE allocation tags. > + > +The allocation tag is set to 0 when such pages are first mapped in the > +user address space and preserved on copy-on-write. ``MAP_SHARED`` is > +supported and the allocation tags can be shared between processes. > + > +**Note**: ``PROT_MTE`` is only supported on ``MAP_ANONYMOUS`` and > +RAM-based file mappings (``tmpfs``, ``memfd``). Passing it to other > +types of mapping will result in ``-EINVAL`` returned by these system > +calls. > + > +**Note**: The ``PROT_MTE`` flag (and corresponding memory type) cannot > +be cleared by ``mprotect()``. What enforces this? I don't have my head fully around the code yet. I'm wondering whether attempting to clear PROT_MTE should be reported as an error. Is there any rationale for not doing so? > + > +Tag Check Faults > +---------------- > + > +When ``PROT_MTE`` is enabled on an address range and a mismatch between > +the logical and allocation tags occurs on access, there are three > +configurable behaviours: > + > +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the > + tag check fault. > + > +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with > + ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The > + memory access is not performed. Also say that if in this case, if SIGSEGV is ignored or blocked by the offending thread then containing processes is terminated with a coredump (at least, that's what ought to happen). > + > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current > + thread, asynchronously following one or multiple tag check faults, > + with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``. For "current thread": that's a kernel concept. For user-facing documentation, can we say "the offending thread" or similar? For clarity, it's worth saying that the faulting address is not reported. Or, we could be optimistic that someday this information will be available and say that si_addr is the faulting address if available, with 0 meaning the address is not available. Maybe (void *)-1 would be better duff address, but I can't see it mattering much. If there's already precedent for si_addr==0 elsewhere, it makes sense to follow it. > + > +**Note**: There are no *match-all* logical tags available for user > +applications. This note seems misplaced. > + > +The user can select the above modes, per thread, using the > +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where PR_GET_TAGGED_ADDR_CTRL seems to be missing here. > +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK`` > +bit-field: > + > +- ``PR_MTE_TCF_NONE`` - *Ignore* tag check faults > +- ``PR_MTE_TCF_SYNC`` - *Synchronous* tag check fault mode > +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode Done naively, this will destroy the PR_MTE_TAG_MASK field. Is there a preferred way to change only parts of this control word? If the answer is "cache the value in userspace if you care about performance, or otherwise use PR_GET_TAGGED_ADDR_CTRL as part of a read-modify-write," so be it. If we think this might be an issue for software, it might be worth splitting out separate prctls for each field.) > + > +Tag checking can also be disabled for a user thread by setting the > +``PSTATE.TCO`` bit with ``MSR TCO, #1``. Users should probably not touch this unless they know what they're doing -- should this flag ever be left set across function boundaries etc.? What's it for? Temporarily masking MTE faults in critical sections? Is this self-synchronising... what happens to pending asynchronous faults? Are faults occurring while the flag is set pended or discarded? (Deliberately not reading the spec here -- if the explanation is not straightforward, then it may be sufficient to tell people to go read it.) > + > +**Note**: Signal handlers are always invoked with ``PSTATE.TCO = 0``, > +irrespective of the interrupted context. Rationale? Do we have advice on what signal handlers should do? Is PSTATE.TC0 restored by sigreturn? > + > +**Note**: Kernel accesses to user memory (e.g. ``read()`` system call) > +are only checked if the current thread tag checking mode is > +PR_MTE_TCF_SYNC. Vague? Can we make a precise statement about when the kernel will and won't check such accesses? And aren't there limitations (like use of get_user_pages() etc.)? > + > +Excluding Tags in the ``IRG``, ``ADDG`` and ``SUBG`` instructions > +----------------------------------------------------------------- > + > +The architecture allows excluding certain tags to be randomly generated > +via the ``GCR_EL1.Exclude`` register bit-field. By default, Linux Can we have a separate section on what execve() and fork()/clone() do to the MTE controls and PSTATE.TCO? "By default" could mean a variety of things, and I'm not sure we cover everything. Is PROT_MTE ever set on the initial pages mapped by execve()? > +excludes all tags other than 0. A user thread can enable specific tags > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL, > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap > +in the ``PR_MTE_TAG_MASK`` bit-field. > + > +**Note**: The hardware uses an exclude mask but the ``prctl()`` > +interface provides an include mask. An include mask of ``0`` (exclusion > +mask ``0xffff``) results in the CPU always generating tag ``0``. Is there no way to make this default to 1 rather than having a magic meaning for 0? > + > +The ``ptrace()`` interface > +-------------------------- > + > +``PTRACE_PEEKMTETAGS`` and ``PTRACE_POKEMTETAGS`` allow a tracer to read > +the tags from or set the tags to a tracee's address space. The > +``ptrace()`` syscall is invoked as ``ptrace(request, pid, addr, data)`` > +where: > + > +- ``request`` - one of ``PTRACE_PEEKMTETAGS`` or ``PTRACE_PEEKMTETAGS``. > +- ``pid`` - the tracee's PID. > +- ``addr`` - address in the tracee's address space. What if addr is not 16-byte aligned? Is this considered valid use? > +- ``data`` - pointer to a ``struct iovec`` where ``iov_base`` points to > + a buffer of ``iov_len`` length in the tracer's address space. What's the data format for the copied tags? > + > +The tags in the tracer's ``iov_base`` buffer are represented as one tag > +per byte and correspond to a 16-byte MTE tag granule in the tracee's > +address space. We could say that the whole operation accesses the tags for 16 * iov_len bytes of the tracee's address space. Maybe superfluous though. > + > +``ptrace()`` return value: > + > +- 0 - success, the tracer's ``iov_len`` was updated to the number of > + tags copied (it may be smaller than the requested ``iov_len`` if the > + requested address range in the tracee's or the tracer's space cannot > + be fully accessed). I'd replace "success" with something like "some tags were copied: ``iov_len`` is updated to indicate the actual number of tags transferred. This may be fewer than requested: [...]" Can we get a short PEEKTAGS/POKETAGS for transient reasons (like minor page faults)? i.e., should the caller attempt to retry, or is that a a stupid thing to do? > +- ``-EPERM`` - the specified process cannot be traced. > +- ``-EIO`` - the tracee's address range cannot be accessed (e.g. invalid > + address) and no tags copied. ``iov_len`` not updated. > +- ``-EFAULT`` - fault on accessing the tracer's memory (``struct iovec`` > + or ``iov_base`` buffer) and no tags copied. ``iov_len`` not updated. > + > +Example of correct usage > +======================== > + > +*MTE Example code* > + > +.. code-block:: c > + > + /* > + * To be compiled with -march=armv8.5-a+memtag > + */ > + #include <errno.h> > + #include <stdio.h> > + #include <stdlib.h> > + #include <unistd.h> > + #include <sys/auxv.h> > + #include <sys/mman.h> > + #include <sys/prctl.h> > + > + /* > + * From arch/arm64/include/uapi/asm/hwcap.h > + */ > + #define HWCAP2_MTE (1 << 18) > + > + /* > + * From arch/arm64/include/uapi/asm/mman.h > + */ > + #define PROT_MTE 0x20 > + > + /* > + * From include/uapi/linux/prctl.h > + */ > + #define PR_SET_TAGGED_ADDR_CTRL 55 > + #define PR_GET_TAGGED_ADDR_CTRL 56 > + # define PR_TAGGED_ADDR_ENABLE (1UL << 0) > + # define PR_MTE_TCF_SHIFT 1 > + # define PR_MTE_TCF_NONE (0UL << PR_MTE_TCF_SHIFT) > + # define PR_MTE_TCF_SYNC (1UL << PR_MTE_TCF_SHIFT) > + # define PR_MTE_TCF_ASYNC (2UL << PR_MTE_TCF_SHIFT) > + # define PR_MTE_TCF_MASK (3UL << PR_MTE_TCF_SHIFT) > + # define PR_MTE_TAG_SHIFT 3 > + # define PR_MTE_TAG_MASK (0xffffUL << PR_MTE_TAG_SHIFT) > + > + /* > + * Insert a random logical tag into the given pointer. > + */ > + #define insert_random_tag(ptr) ({ \ > + __u64 __val; \ > + asm("irg %0, %1" : "=r" (__val) : "r" (ptr)); \ > + __val; \ > + }) > + > + /* > + * Set the allocation tag on the destination address. > + */ > + #define set_tag(tagged_addr) do { \ > + asm volatile("stg %0, [%0]" : : "r" (tagged_addr) : "memory"); \ > + } while (0) > + > + int main() > + { > + unsigned long *a; > + unsigned long page_sz = getpagesize(); Nit: obsolete in POSIX. Prefer sysconf(_SC_PAGESIZE). > + unsigned long hwcap2 = getauxval(AT_HWCAP2); > + > + /* check if MTE is present */ > + if (!(hwcap2 & HWCAP2_MTE)) > + return -1; Nit: -1 isn't a valid exit code, so it's preferable to return 1 or EXIT_FAILURE. > + > + /* > + * Enable the tagged address ABI, synchronous MTE tag check faults and > + * allow all non-zero tags in the randomly generated set. > + */ > + if (prctl(PR_SET_TAGGED_ADDR_CTRL, > + PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC | (0xfffe << PR_MTE_TAG_SHIFT), > + 0, 0, 0)) { > + perror("prctl() failed"); > + return -1; > + } > + > + a = mmap(0, page_sz, PROT_READ | PROT_WRITE, > + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); Is this a vaild assignment? I can't remember whether C's "pointer values must be correctly aligned" rule applies only to dereferences, or whether it applies to conversions too. From memory I have a feeling that it does. If so, the compiler could legimitately optimise the failure check away, since MAP_FAILED is not correctly aligned for unsigned long. > + if (a == MAP_FAILED) { > + perror("mmap() failed"); > + return -1; > + } > + > + /* > + * Enable MTE on the above anonymous mmap. The flag could be passed > + * directly to mmap() and skip this step. > + */ > + if (mprotect(a, page_sz, PROT_READ | PROT_WRITE | PROT_MTE)) { > + perror("mprotect() failed"); > + return -1; > + } > + > + /* access with the default tag (0) */ > + a[0] = 1; > + a[1] = 2; > + > + printf("a[0] = %lu a[1] = %lu\n", a[0], a[1]); > + > + /* set the logical and allocation tags */ > + a = (unsigned long *)insert_random_tag(a); > + set_tag(a); > + > + printf("%p\n", a); > + > + /* non-zero tag access */ > + a[0] = 3; > + printf("a[0] = %lu a[1] = %lu\n", a[0], a[1]); > + > + /* > + * If MTE is enabled correctly the next instruction will generate an > + * exception. > + */ > + printf("Expecting SIGSEGV...\n"); > + a[2] = 0xdead; > + > + /* this should not be printed in the PR_MTE_TCF_SYNC mode */ > + printf("...done\n"); > + > + return 0; > + } Since this shouldn't happen, can we print an error and return nonzero? [...] Cheers ---Dave
On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote: > On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote: > > +Userspace Support > > +================= > > + > > +When ``CONFIG_ARM64_MTE`` is selected and Memory Tagging Extension is > > +supported by the hardware, the kernel advertises the feature to > > +userspace via ``HWCAP2_MTE``. > > + > > +PROT_MTE > > +-------- > > + > > +To access the allocation tags, a user process must enable the Tagged > > +memory attribute on an address range using a new ``prot`` flag for > > +``mmap()`` and ``mprotect()``: > > + > > +``PROT_MTE`` - Pages allow access to the MTE allocation tags. > > + > > +The allocation tag is set to 0 when such pages are first mapped in the > > +user address space and preserved on copy-on-write. ``MAP_SHARED`` is > > +supported and the allocation tags can be shared between processes. > > + > > +**Note**: ``PROT_MTE`` is only supported on ``MAP_ANONYMOUS`` and > > +RAM-based file mappings (``tmpfs``, ``memfd``). Passing it to other > > +types of mapping will result in ``-EINVAL`` returned by these system > > +calls. > > + > > +**Note**: The ``PROT_MTE`` flag (and corresponding memory type) cannot > > +be cleared by ``mprotect()``. > > What enforces this? I don't have my head fully around the code yet. > > I'm wondering whether attempting to clear PROT_MTE should be reported as > an error. Is there any rationale for not doing so? A use-case is a JIT compiler where the memory is allocated by some malloc() code with PROT_MTE set and passed down to a code generator library which may not be MTE aware (and doesn't need to be, only tagged ptr aware). Such library, once it generated the code, may do an mprotect(PROT_READ|PROT_EXEC) without PROT_MTE. We didn't want to inadvertently clear PROT_MTE, especially if the memory will be given back to the original allocator (free) at some point. Basically mprotect() may be done outside the heap allocator but it should not interfere with allocator's decision to use MTE. For this reason, I wouldn't report an error but silently ignore the lack of PROT_MTE. The way we handle this is by not including VM_MTE in VM_ARCH_CLEAR (VM_MPX isn't either, though VM_SPARC_ADI is but when they added it, the syscall ABI didn't even accept tagged pointers). > > +Tag Check Faults > > +---------------- > > + > > +When ``PROT_MTE`` is enabled on an address range and a mismatch between > > +the logical and allocation tags occurs on access, there are three > > +configurable behaviours: > > + > > +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the > > + tag check fault. > > + > > +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with > > + ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The > > + memory access is not performed. > > Also say that if in this case, if SIGSEGV is ignored or blocked by the > offending thread then containing processes is terminated with a coredump > (at least, that's what ought to happen). Makes sense. > > + > > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current > > + thread, asynchronously following one or multiple tag check faults, > > + with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``. > > For "current thread": that's a kernel concept. For user-facing > documentation, can we say "the offending thread" or similar? > > For clarity, it's worth saying that the faulting address is not > reported. Or, we could be optimistic that someday this information will > be available and say that si_addr is the faulting address if available, > with 0 meaning the address is not available. > > Maybe (void *)-1 would be better duff address, but I can't see it > mattering much. If there's already precedent for si_addr==0 elsewhere, > it makes sense to follow it. At a quick grep, I can see a few instances on other architectures where si_addr==0. I'll add a comment here. If the hardware gives us something in the future, it will likely be in a separate register and we can present it as a new sigcontext structure. In the meantime I'll add a some text that the faulting address is unknown. > > +**Note**: There are no *match-all* logical tags available for user > > +applications. > > This note seems misplaced. This was in the context of tag checking. I'll move it further down when talking about PSTATE.TCO. > > + > > +The user can select the above modes, per thread, using the > > +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where > > PR_GET_TAGGED_ADDR_CTRL seems to be missing here. Added. > > +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK`` > > +bit-field: > > + > > +- ``PR_MTE_TCF_NONE`` - *Ignore* tag check faults > > +- ``PR_MTE_TCF_SYNC`` - *Synchronous* tag check fault mode > > +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode > > Done naively, this will destroy the PR_MTE_TAG_MASK field. Is there a > preferred way to change only parts of this control word? If the answer > is "cache the value in userspace if you care about performance, or > otherwise use PR_GET_TAGGED_ADDR_CTRL as part of a read-modify-write," > so be it. > > If we think this might be an issue for software, it might be worth > splitting out separate prctls for each field.) We lack some feedback from user space people on how this prctl is going to be used. I worked on the assumption that it is a one-off event during libc setup, potentially driven by some environment variable (but that's user's problem). There were some suggestions that on an async SIGSEGV, the handler may switch to synchronous mode. Since that's a rare event, a get/set approach would be fine. Anyway, with an additional argument to prctl (we have 3 spare), we could do a set/clear mask approach. The current behaviour could be emulated as: prctl(PR_SET_TAGGED_ADDR_CTRL, PR_MTE_bits, -1UL, 0, 0); where -1 is the clear mask. The mask can be 0 for the initial prctl() or we can say that if the mask is non-zero, only the bits in the mask will be set. If you want to only set the TCF bits: prctl(PR_SET_TAGGED_ADDR_CTRL, PR_MTE_TCF_SYNC, PR_MTE_TCF_MASK, 0, 0); > > +Tag checking can also be disabled for a user thread by setting the > > +``PSTATE.TCO`` bit with ``MSR TCO, #1``. > > Users should probably not touch this unless they know what they're > doing -- should this flag ever be left set across function boundaries > etc.? We can't control function boundaries from the kernel anyway. > What's it for? Temporarily masking MTE faults in critical sections? > Is this self-synchronising... what happens to pending asynchronous > faults? Are faults occurring while the flag is set pended or discarded? Something like a garbage collector scanning the memory. Since we do not allow tag 0 as a match-all, it needs a cheaper option than prctl(). > > +**Note**: Signal handlers are always invoked with ``PSTATE.TCO = 0``, > > +irrespective of the interrupted context. > > Rationale? Do we have advice on what signal handlers should do? Well, that's the default mode - tag check override = 0, it means that tag checking takes place. > Is PSTATE.TC0 restored by sigreturn? s/TC0/TCO/ Yes, it is restored on sigreturn. > > +**Note**: Kernel accesses to user memory (e.g. ``read()`` system call) > > +are only checked if the current thread tag checking mode is > > +PR_MTE_TCF_SYNC. > > Vague? Can we make a precise statement about when the kernel will and > won't check such accesses? And aren't there limitations (like use of > get_user_pages() etc.)? We could make it slightly clearer by say "kernel accesses to the user address space". > > +Excluding Tags in the ``IRG``, ``ADDG`` and ``SUBG`` instructions > > +----------------------------------------------------------------- > > + > > +The architecture allows excluding certain tags to be randomly generated > > +via the ``GCR_EL1.Exclude`` register bit-field. By default, Linux > > Can we have a separate section on what execve() and fork()/clone() do > to the MTE controls and PSTATE.TCO? "By default" could mean a variety > of things, and I'm not sure we cover everything. Good point. I'll add a note on initial state for processes and threads. > Is PROT_MTE ever set on the initial pages mapped by execve()? No. There were discussions about mapping the initial stack with PROT_MTE based on some ELF note but it can also be done in userspace with mprotect(). I think we concluded that the .data/.bss sections will be untagged. > > +excludes all tags other than 0. A user thread can enable specific tags > > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL, > > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap > > +in the ``PR_MTE_TAG_MASK`` bit-field. > > + > > +**Note**: The hardware uses an exclude mask but the ``prctl()`` > > +interface provides an include mask. An include mask of ``0`` (exclusion > > +mask ``0xffff``) results in the CPU always generating tag ``0``. > > Is there no way to make this default to 1 rather than having a magic > meaning for 0? We follow the hardware behaviour where 0xffff and 0xfffe give the same result. > > +The ``ptrace()`` interface > > +-------------------------- > > + > > +``PTRACE_PEEKMTETAGS`` and ``PTRACE_POKEMTETAGS`` allow a tracer to read > > +the tags from or set the tags to a tracee's address space. The > > +``ptrace()`` syscall is invoked as ``ptrace(request, pid, addr, data)`` > > +where: > > + > > +- ``request`` - one of ``PTRACE_PEEKMTETAGS`` or ``PTRACE_PEEKMTETAGS``. > > +- ``pid`` - the tracee's PID. > > +- ``addr`` - address in the tracee's address space. > > What if addr is not 16-byte aligned? Is this considered valid use? Yes, I don't think we should impose a restriction here. Each address in a 16-byte range has the same (shared) tag. > > +- ``data`` - pointer to a ``struct iovec`` where ``iov_base`` points to > > + a buffer of ``iov_len`` length in the tracer's address space. > > What's the data format for the copied tags? I could state that the tag are placed in the lower 4-bit of the byte with the upper 4-bit set to 0. > > +The tags in the tracer's ``iov_base`` buffer are represented as one tag > > +per byte and correspond to a 16-byte MTE tag granule in the tracee's > > +address space. > > We could say that the whole operation accesses the tags for 16 * iov_len > bytes of the tracee's address space. Maybe superfluous though. > > > + > > +``ptrace()`` return value: > > + > > +- 0 - success, the tracer's ``iov_len`` was updated to the number of > > + tags copied (it may be smaller than the requested ``iov_len`` if the > > + requested address range in the tracee's or the tracer's space cannot > > + be fully accessed). > > I'd replace "success" with something like "some tags were copied: > ``iov_len`` is updated to indicate the actual number of tags > transferred. This may be fewer than requested: [...]" > > Can we get a short PEEKTAGS/POKETAGS for transient reasons (like minor > page faults)? i.e., should the caller attempt to retry, or is that a > a stupid thing to do? I initially thought it should retry but managed to get the interface so that no retries are needed. If fewer tags were transferred, it's for a good reason (e.g. permission fault). [...] > > + a = mmap(0, page_sz, PROT_READ | PROT_WRITE, > > + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > Is this a vaild assignment? > > I can't remember whether C's "pointer values must be correctly aligned" > rule applies only to dereferences, or whether it applies to conversions > too. From memory I have a feeling that it does. > > If so, the compiler could legimitately optimise the failure check away, > since MAP_FAILED is not correctly aligned for unsigned long. I'm not going to dig into standards ;). I can change this to an unsigned char *. > > + printf("Expecting SIGSEGV...\n"); > > + a[2] = 0xdead; > > + > > + /* this should not be printed in the PR_MTE_TCF_SYNC mode */ > > + printf("...done\n"); > > + > > + return 0; > > + } > > Since this shouldn't happen, can we print an error and return nonzero? Fair enough. I also agree with the other points you raised but to which I haven't explicitly commented. Thanks for the review, really useful.
On Thu, Apr 30, 2020 at 05:23:17PM +0100, Catalin Marinas wrote: > On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote: > > On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote: > > > +Userspace Support > > > +================= > > > + > > > +When ``CONFIG_ARM64_MTE`` is selected and Memory Tagging Extension is > > > +supported by the hardware, the kernel advertises the feature to > > > +userspace via ``HWCAP2_MTE``. > > > + > > > +PROT_MTE > > > +-------- > > > + > > > +To access the allocation tags, a user process must enable the Tagged > > > +memory attribute on an address range using a new ``prot`` flag for > > > +``mmap()`` and ``mprotect()``: > > > + > > > +``PROT_MTE`` - Pages allow access to the MTE allocation tags. > > > + > > > +The allocation tag is set to 0 when such pages are first mapped in the > > > +user address space and preserved on copy-on-write. ``MAP_SHARED`` is > > > +supported and the allocation tags can be shared between processes. > > > + > > > +**Note**: ``PROT_MTE`` is only supported on ``MAP_ANONYMOUS`` and > > > +RAM-based file mappings (``tmpfs``, ``memfd``). Passing it to other > > > +types of mapping will result in ``-EINVAL`` returned by these system > > > +calls. > > > + > > > +**Note**: The ``PROT_MTE`` flag (and corresponding memory type) cannot > > > +be cleared by ``mprotect()``. > > > > What enforces this? I don't have my head fully around the code yet. > > > > I'm wondering whether attempting to clear PROT_MTE should be reported as > > an error. Is there any rationale for not doing so? > > A use-case is a JIT compiler where the memory is allocated by some > malloc() code with PROT_MTE set and passed down to a code generator > library which may not be MTE aware (and doesn't need to be, only tagged > ptr aware). Such library, once it generated the code, may do an > mprotect(PROT_READ|PROT_EXEC) without PROT_MTE. We didn't want to > inadvertently clear PROT_MTE, especially if the memory will be given > back to the original allocator (free) at some point. > > Basically mprotect() may be done outside the heap allocator but it > should not interfere with allocator's decision to use MTE. For this > reason, I wouldn't report an error but silently ignore the lack of > PROT_MTE. > > The way we handle this is by not including VM_MTE in VM_ARCH_CLEAR > (VM_MPX isn't either, though VM_SPARC_ADI is but when they added it, the > syscall ABI didn't even accept tagged pointers). OK, I think this makes sense. For BTI, I think mprotect() will clear PROT_BTI unless it's included in prot, but that's a bit different: PROT_BTI relates to the memory contents (i.e., it's BTI-aware code), where PROT_MTE is a property of the memory itself. > > > +Tag Check Faults > > > +---------------- > > > + > > > +When ``PROT_MTE`` is enabled on an address range and a mismatch between > > > +the logical and allocation tags occurs on access, there are three > > > +configurable behaviours: > > > + > > > +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the > > > + tag check fault. > > > + > > > +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with > > > + ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The > > > + memory access is not performed. > > > > Also say that if in this case, if SIGSEGV is ignored or blocked by the > > offending thread then containing processes is terminated with a coredump > > (at least, that's what ought to happen). > > Makes sense. > > > > + > > > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current > > > + thread, asynchronously following one or multiple tag check faults, > > > + with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``. > > > > For "current thread": that's a kernel concept. For user-facing > > documentation, can we say "the offending thread" or similar? > > > > For clarity, it's worth saying that the faulting address is not > > reported. Or, we could be optimistic that someday this information will > > be available and say that si_addr is the faulting address if available, > > with 0 meaning the address is not available. > > > > Maybe (void *)-1 would be better duff address, but I can't see it > > mattering much. If there's already precedent for si_addr==0 elsewhere, > > it makes sense to follow it. > > At a quick grep, I can see a few instances on other architectures where > si_addr==0. I'll add a comment here. OK, cool Except: what if we're in PR_MTE_TCF_ASYNC mode. If the SIGSEGV handler triggers an asynchronous MTE fault itself, we could then get into a spin. Hmm. I take it we drain any pending MTE faults when crossing EL boundaries? In that case, an asynchronous MTE fault pending at sigreturn must have been caused by the signal handler. We could make that particular case of MTE_AERR a force_sig. > If the hardware gives us something in the future, it will likely be in a > separate register and we can present it as a new sigcontext structure. > In the meantime I'll add a some text that the faulting address is > unknown. I guess we can decide that later. I think that if we can put something sensible in si_addr we should do so, but that doesn't stop us also putting more detailed info somewhere else. > > > > +**Note**: There are no *match-all* logical tags available for user > > > +applications. > > > > This note seems misplaced. > > This was in the context of tag checking. I'll move it further down when > talking about PSTATE.TCO. OK > > > + > > > +The user can select the above modes, per thread, using the > > > +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where > > > > PR_GET_TAGGED_ADDR_CTRL seems to be missing here. > > Added. > > > > +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK`` > > > +bit-field: > > > + > > > +- ``PR_MTE_TCF_NONE`` - *Ignore* tag check faults > > > +- ``PR_MTE_TCF_SYNC`` - *Synchronous* tag check fault mode > > > +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode > > > > Done naively, this will destroy the PR_MTE_TAG_MASK field. Is there a > > preferred way to change only parts of this control word? If the answer > > is "cache the value in userspace if you care about performance, or > > otherwise use PR_GET_TAGGED_ADDR_CTRL as part of a read-modify-write," > > so be it. > > > > If we think this might be an issue for software, it might be worth > > splitting out separate prctls for each field.) > > We lack some feedback from user space people on how this prctl is going > to be used. I worked on the assumption that it is a one-off event during > libc setup, potentially driven by some environment variable (but that's > user's problem). > > There were some suggestions that on an async SIGSEGV, the handler may > switch to synchronous mode. Since that's a rare event, a get/set > approach would be fine. > > Anyway, with an additional argument to prctl (we have 3 spare), we could > do a set/clear mask approach. The current behaviour could be emulated > as: > > prctl(PR_SET_TAGGED_ADDR_CTRL, PR_MTE_bits, -1UL, 0, 0); > > where -1 is the clear mask. The mask can be 0 for the initial prctl() or > we can say that if the mask is non-zero, only the bits in the mask will > be set. > > If you want to only set the TCF bits: > > prctl(PR_SET_TAGGED_ADDR_CTRL, PR_MTE_TCF_SYNC, PR_MTE_TCF_MASK, 0, 0); If this isn't critical path, I guess it's not a big deal either way. If we make that mask argument an mask of bits _not_ to change than we can add it as a backwards compatible extension later on without having to define it now. As you suggest, it may never matter. So, I don't object to this staying as-is. > > > +Tag checking can also be disabled for a user thread by setting the > > > +``PSTATE.TCO`` bit with ``MSR TCO, #1``. > > > > Users should probably not touch this unless they know what they're > > doing -- should this flag ever be left set across function boundaries > > etc.? > > We can't control function boundaries from the kernel anyway. > > > What's it for? Temporarily masking MTE faults in critical sections? > > Is this self-synchronising... what happens to pending asynchronous > > faults? Are faults occurring while the flag is set pended or discarded? > > Something like a garbage collector scanning the memory. Since we do not > allow tag 0 as a match-all, it needs a cheaper option than prctl(). > > > > +**Note**: Signal handlers are always invoked with ``PSTATE.TCO = 0``, > > > +irrespective of the interrupted context. > > > > Rationale? Do we have advice on what signal handlers should do? > > Well, that's the default mode - tag check override = 0, it means that > tag checking takes place. Sort of implies that a SIGSEGV handler must be careful not to trigger any more faults. But I guess that's nothing new. > > > Is PSTATE.TC0 restored by sigreturn? > > s/TC0/TCO/ > > Yes, it is restored on sigreturn. OK. I think it's worth mentioning (does no harm, anyway). > > > > +**Note**: Kernel accesses to user memory (e.g. ``read()`` system call) > > > +are only checked if the current thread tag checking mode is > > > +PR_MTE_TCF_SYNC. > > > > Vague? Can we make a precise statement about when the kernel will and > > won't check such accesses? And aren't there limitations (like use of > > get_user_pages() etc.)? > > We could make it slightly clearer by say "kernel accesses to the user > address space". That's not the ambiguity. My question is 1) Does the kernel guarantee not to check tags on kernel accesses to user memory without PR_MTE_TCF_SYNC? 2) Does the kernel guarantee to check tags on kernel accesses to user memory with PR_MTE_TCF_SYNC? In practice, this note sounds to be more like a kernel implementation detail rather than advice to userspace. Would it make sense to say something like: * PR_MTE_TCF_NONE: the kernel does not check tags for kernel accesses to use memory done by syscalls in the thread. * PR_MTE_TCF_ASYNC: the kernel may check some tags for kernel accesses to user memory done by syscalls. (Should we guarantee that such faults are reported synchronously on syscall exit? In practice I think they are. Should we use SEGV_MTESERR in this case? Perhaps it's not worth making this a special case.) * PR_MTE_TCF_SYNC: the kernel makes best efforts to check tags for kernel accesses to user memory done by the syscalls, but does not guarantee to check everything (or does it? I thought we can't really do that for some odd cases...) > > > +Excluding Tags in the ``IRG``, ``ADDG`` and ``SUBG`` instructions > > > +----------------------------------------------------------------- > > > + > > > +The architecture allows excluding certain tags to be randomly generated > > > +via the ``GCR_EL1.Exclude`` register bit-field. By default, Linux > > > > Can we have a separate section on what execve() and fork()/clone() do > > to the MTE controls and PSTATE.TCO? "By default" could mean a variety > > of things, and I'm not sure we cover everything. > > Good point. I'll add a note on initial state for processes and threads. > > > Is PROT_MTE ever set on the initial pages mapped by execve()? > > No. There were discussions about mapping the initial stack with PROT_MTE > based on some ELF note but it can also be done in userspace with > mprotect(). I think we concluded that the .data/.bss sections will be > untagged. Yes, I recall. Sounds fine: probably worth mentioning here that PROT_MTE is never set on the exec mappings for now. > > > +excludes all tags other than 0. A user thread can enable specific tags > > > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL, > > > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap > > > +in the ``PR_MTE_TAG_MASK`` bit-field. > > > + > > > +**Note**: The hardware uses an exclude mask but the ``prctl()`` > > > +interface provides an include mask. An include mask of ``0`` (exclusion > > > +mask ``0xffff``) results in the CPU always generating tag ``0``. > > > > Is there no way to make this default to 1 rather than having a magic > > meaning for 0? > > We follow the hardware behaviour where 0xffff and 0xfffe give the same > result. Exposing this through a purely software interface seems a bit odd: because the exclude mask is privileged-access-only, the architecture could amend it to assign a different meaning to 0xffff, providing this was an opt-in change. Then we'd have to make a mess here. Can't we just forbid the nonsense value 0 here, or are there other reasons why that's problematic? I presume the architecture defines a meaning for 0 to avoid making it UNPREDICTABLE etc., not because this is deemed useful. > > > +The ``ptrace()`` interface > > > +-------------------------- > > > + > > > +``PTRACE_PEEKMTETAGS`` and ``PTRACE_POKEMTETAGS`` allow a tracer to read > > > +the tags from or set the tags to a tracee's address space. The > > > +``ptrace()`` syscall is invoked as ``ptrace(request, pid, addr, data)`` > > > +where: > > > + > > > +- ``request`` - one of ``PTRACE_PEEKMTETAGS`` or ``PTRACE_PEEKMTETAGS``. > > > +- ``pid`` - the tracee's PID. > > > +- ``addr`` - address in the tracee's address space. > > > > What if addr is not 16-byte aligned? Is this considered valid use? > > Yes, I don't think we should impose a restriction here. Each address in > a 16-byte range has the same (shared) tag. OK. We might want to clarify what this means when addr is misaligned: we do not colour the 16 bytes starting at addr, but the reader might assume that's what happens. > > > +- ``data`` - pointer to a ``struct iovec`` where ``iov_base`` points to > > > + a buffer of ``iov_len`` length in the tracer's address space. > > > > What's the data format for the copied tags? > > I could state that the tag are placed in the lower 4-bit of the byte > with the upper 4-bit set to 0. What if it's not? I didn't find this in the architecture spec, but I didn't look very hard so far... > > > +The tags in the tracer's ``iov_base`` buffer are represented as one tag > > > +per byte and correspond to a 16-byte MTE tag granule in the tracee's > > > +address space. > > > > We could say that the whole operation accesses the tags for 16 * iov_len > > bytes of the tracee's address space. Maybe superfluous though. > > > > > + > > > +``ptrace()`` return value: > > > + > > > +- 0 - success, the tracer's ``iov_len`` was updated to the number of > > > + tags copied (it may be smaller than the requested ``iov_len`` if the > > > + requested address range in the tracee's or the tracer's space cannot > > > + be fully accessed). > > > > I'd replace "success" with something like "some tags were copied: > > ``iov_len`` is updated to indicate the actual number of tags > > transferred. This may be fewer than requested: [...]" > > > > Can we get a short PEEKTAGS/POKETAGS for transient reasons (like minor > > page faults)? i.e., should the caller attempt to retry, or is that a > > a stupid thing to do? > > I initially thought it should retry but managed to get the interface so > that no retries are needed. If fewer tags were transferred, it's for a > good reason (e.g. permission fault). OK, we should mention that here then. Software that retries things that can't make progress can get stuck in a loop (or at least waste cycles). > > > + a = mmap(0, page_sz, PROT_READ | PROT_WRITE, > > > + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > > > Is this a vaild assignment? > > > > I can't remember whether C's "pointer values must be correctly aligned" > > rule applies only to dereferences, or whether it applies to conversions > > too. From memory I have a feeling that it does. > > > > If so, the compiler could legimitately optimise the failure check away, > > since MAP_FAILED is not correctly aligned for unsigned long. > > I'm not going to dig into standards ;). I can change this to an unsigned > char *. Sure, I guess that solves the problem. Something like void *p; unsigned long *a; p = mmap( ... ); if (p == MAP_FAILED) { /* barf */ } a = p; might provide a clue that care is needed, but it's not essential. > > > > + printf("Expecting SIGSEGV...\n"); > > > + a[2] = 0xdead; > > > + > > > + /* this should not be printed in the PR_MTE_TCF_SYNC mode */ > > > + printf("...done\n"); > > > + > > > + return 0; > > > + } > > > > Since this shouldn't happen, can we print an error and return nonzero? > > Fair enough. I also agree with the other points you raised but to which > I haven't explicitly commented. > > Thanks for the review, really useful. Np Cheers ---Dave
The 04/21/2020 15:26, Catalin Marinas wrote: > diff --git a/Documentation/arm64/memory-tagging-extension.rst b/Documentation/arm64/memory-tagging-extension.rst > new file mode 100644 > index 000000000000..f82dfbd70061 > --- /dev/null > +++ b/Documentation/arm64/memory-tagging-extension.rst > @@ -0,0 +1,260 @@ > +=============================================== > +Memory Tagging Extension (MTE) in AArch64 Linux > +=============================================== > + > +Authors: Vincenzo Frascino <vincenzo.frascino@arm.com> > + Catalin Marinas <catalin.marinas@arm.com> > + > +Date: 2020-02-25 > + > +This document describes the provision of the Memory Tagging Extension > +functionality in AArch64 Linux. > + > +Introduction > +============ > + > +ARMv8.5 based processors introduce the Memory Tagging Extension (MTE) > +feature. MTE is built on top of the ARMv8.0 virtual address tagging TBI > +(Top Byte Ignore) feature and allows software to access a 4-bit > +allocation tag for each 16-byte granule in the physical address space. > +Such memory range must be mapped with the Normal-Tagged memory > +attribute. A logical tag is derived from bits 59-56 of the virtual > +address used for the memory access. A CPU with MTE enabled will compare > +the logical tag against the allocation tag and potentially raise an > +exception on mismatch, subject to system registers configuration. > + > +Userspace Support > +================= > + > +When ``CONFIG_ARM64_MTE`` is selected and Memory Tagging Extension is > +supported by the hardware, the kernel advertises the feature to > +userspace via ``HWCAP2_MTE``. > + > +PROT_MTE > +-------- > + > +To access the allocation tags, a user process must enable the Tagged > +memory attribute on an address range using a new ``prot`` flag for > +``mmap()`` and ``mprotect()``: > + > +``PROT_MTE`` - Pages allow access to the MTE allocation tags. > + > +The allocation tag is set to 0 when such pages are first mapped in the > +user address space and preserved on copy-on-write. ``MAP_SHARED`` is > +supported and the allocation tags can be shared between processes. > + > +**Note**: ``PROT_MTE`` is only supported on ``MAP_ANONYMOUS`` and > +RAM-based file mappings (``tmpfs``, ``memfd``). Passing it to other > +types of mapping will result in ``-EINVAL`` returned by these system > +calls. > + > +**Note**: The ``PROT_MTE`` flag (and corresponding memory type) cannot > +be cleared by ``mprotect()``. i think there are some non-obvious madvise operations that may be worth documenting too for mte specific semantics. e.g. MADV_DONTNEED or MADV_FREE can presumably drop tags which means that existing pointers can no longer write to the memory which is a change of behaviour compared to the non-mte case. (affects most malloc implementations that will have to deal with this when implementing heap coloring) there might be other similar problems like MADV_WIPEONFORK that wont work as currently expected when mte is enabled. if such behaviour changes cause serious problems to existing software there may be a need to have a way to opt out from these changes (e.g. MADV_ flag variant that only affects the memory content but not the tags) or to make that the default behaviour. (but i can't tell how widely these are used in ways that can be expected to work with PROT_MTE) > +Tag Check Faults > +---------------- > + > +When ``PROT_MTE`` is enabled on an address range and a mismatch between > +the logical and allocation tags occurs on access, there are three > +configurable behaviours: > + > +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the > + tag check fault. > + > +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with > + ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The > + memory access is not performed. > + > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current > + thread, asynchronously following one or multiple tag check faults, > + with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``. > + > +**Note**: There are no *match-all* logical tags available for user > +applications. > + > +The user can select the above modes, per thread, using the > +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where > +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK`` > +bit-field: > + > +- ``PR_MTE_TCF_NONE`` - *Ignore* tag check faults > +- ``PR_MTE_TCF_SYNC`` - *Synchronous* tag check fault mode > +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode > + > +Tag checking can also be disabled for a user thread by setting the > +``PSTATE.TCO`` bit with ``MSR TCO, #1``. > + > +**Note**: Signal handlers are always invoked with ``PSTATE.TCO = 0``, > +irrespective of the interrupted context. > + > +**Note**: Kernel accesses to user memory (e.g. ``read()`` system call) > +are only checked if the current thread tag checking mode is > +PR_MTE_TCF_SYNC.
On Tue, May 05, 2020 at 11:32:33AM +0100, Szabolcs Nagy wrote: > The 04/21/2020 15:26, Catalin Marinas wrote: > > +PROT_MTE > > +-------- > > + > > +To access the allocation tags, a user process must enable the Tagged > > +memory attribute on an address range using a new ``prot`` flag for > > +``mmap()`` and ``mprotect()``: > > + > > +``PROT_MTE`` - Pages allow access to the MTE allocation tags. > > + > > +The allocation tag is set to 0 when such pages are first mapped in the > > +user address space and preserved on copy-on-write. ``MAP_SHARED`` is > > +supported and the allocation tags can be shared between processes. > > + > > +**Note**: ``PROT_MTE`` is only supported on ``MAP_ANONYMOUS`` and > > +RAM-based file mappings (``tmpfs``, ``memfd``). Passing it to other > > +types of mapping will result in ``-EINVAL`` returned by these system > > +calls. > > + > > +**Note**: The ``PROT_MTE`` flag (and corresponding memory type) cannot > > +be cleared by ``mprotect()``. > > i think there are some non-obvious madvise operations that may > be worth documenting too for mte specific semantics. > > e.g. MADV_DONTNEED or MADV_FREE can presumably drop tags which > means that existing pointers can no longer write to the memory > which is a change of behaviour compared to the non-mte case. > (affects most malloc implementations that will have to deal > with this when implementing heap coloring) there might be other > similar problems like MADV_WIPEONFORK that wont work as > currently expected when mte is enabled. > > if such behaviour changes cause serious problems to existing > software there may be a need to have a way to opt out from > these changes (e.g. MADV_ flag variant that only affects the > memory content but not the tags) or to make that the default > behaviour. (but i can't tell how widely these are used in > ways that can be expected to work with PROT_MTE) Thanks. I'll document this behaviour as it may not be obvious. For the record (as we discussed this internally), I think the kernel behaviour is entirely expected. On mmap(PROT_MTE), the kernel would return pages with tags set to 0. On madvise(MADV_DONTNEED), the kernel may free the pages but map them back on access using the same conditions they were previously given to the user, i.e. tags set to 0. There isn't any expectations for the kernel to preserve the tags of MADV_DONTNEED/FREE pages (which defeats the point of dontneed/free).
On Mon, May 04, 2020 at 05:46:17PM +0100, Dave P Martin wrote: > On Thu, Apr 30, 2020 at 05:23:17PM +0100, Catalin Marinas wrote: > > On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote: > > > On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote: > > > > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current > > > > + thread, asynchronously following one or multiple tag check faults, > > > > + with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``. > > > > > > For "current thread": that's a kernel concept. For user-facing > > > documentation, can we say "the offending thread" or similar? > > > > > > For clarity, it's worth saying that the faulting address is not > > > reported. Or, we could be optimistic that someday this information will > > > be available and say that si_addr is the faulting address if available, > > > with 0 meaning the address is not available. > > > > > > Maybe (void *)-1 would be better duff address, but I can't see it > > > mattering much. If there's already precedent for si_addr==0 elsewhere, > > > it makes sense to follow it. > > > > At a quick grep, I can see a few instances on other architectures where > > si_addr==0. I'll add a comment here. > > OK, cool > > Except: what if we're in PR_MTE_TCF_ASYNC mode. If the SIGSEGV handler > triggers an asynchronous MTE fault itself, we could then get into a > spin. Hmm. How do we handle standard segfaults here? Presumably a signal handler can trigger a SIGSEGV itself. > I take it we drain any pending MTE faults when crossing EL boundaries? We clear the hardware bit on entry to EL1 from EL0 and set a TIF flag. > In that case, an asynchronous MTE fault pending at sigreturn must have > been caused by the signal handler. We could make that particular case > of MTE_AERR a force_sig. We clear the TIF flag when delivering the signal. I don't think there is a way for the kernel to detect when it is running in a signal handler. sigreturn() is not mandatory either. > > > > +**Note**: Kernel accesses to user memory (e.g. ``read()`` system call) > > > > +are only checked if the current thread tag checking mode is > > > > +PR_MTE_TCF_SYNC. > > > > > > Vague? Can we make a precise statement about when the kernel will and > > > won't check such accesses? And aren't there limitations (like use of > > > get_user_pages() etc.)? > > > > We could make it slightly clearer by say "kernel accesses to the user > > address space". > > That's not the ambiguity. > > My question is > > 1) Does the kernel guarantee not to check tags on kernel accesses to > user memory without PR_MTE_TCF_SYNC? For ASYNC and NONE, yes, we can guarantee this. > 2) Does the kernel guarantee to check tags on kernel accesses to user > memory with PR_MTE_TCF_SYNC? I'd say yes but it depends on how much knowledge one has about the syscall implementation. If it's access to user address directly, it would be checked. If it goes via get_user_pages(), it won't. Since the user doesn't need to have knowledge of the kernel internals, you are right that we don't guarantee this. > In practice, this note sounds to be more like a kernel implementation > detail rather than advice to userspace. > > Would it make sense to say something like: > > * PR_MTE_TCF_NONE: the kernel does not check tags for kernel accesses > to use memory done by syscalls in the thread. > > * PR_MTE_TCF_ASYNC: the kernel may check some tags for kernel accesses > to user memory done by syscalls. (Should we guarantee that such > faults are reported synchronously on syscall exit? In practice I > think they are. Should we use SEGV_MTESERR in this case? Perhaps > it's not worth making this a special case.) Both NONE and ASYNC are now the same for kernel uaccess - not checked. For background information, I decided against ASYNC uaccess checking since (1) there are some cases where the kernel overreads (strncpy_from_user) and (2) we don't normally generate SIGSEGV on uaccess but rather return -EFAULT. The latter is not possible to contain since we only learn about the fault asynchronously, usually after the transfer. > * PR_MTE_TCF_SYNC: the kernel makes best efforts to check tags for > kernel accesses to user memory done by the syscalls, but does not > guarantee to check everything (or does it? I thought we can't really > do that for some odd cases...) It doesn't. I'll add some notes along the lines of your text above. > > > > +excludes all tags other than 0. A user thread can enable specific tags > > > > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL, > > > > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap > > > > +in the ``PR_MTE_TAG_MASK`` bit-field. > > > > + > > > > +**Note**: The hardware uses an exclude mask but the ``prctl()`` > > > > +interface provides an include mask. An include mask of ``0`` (exclusion > > > > +mask ``0xffff``) results in the CPU always generating tag ``0``. > > > > > > Is there no way to make this default to 1 rather than having a magic > > > meaning for 0? > > > > We follow the hardware behaviour where 0xffff and 0xfffe give the same > > result. > > Exposing this through a purely software interface seems a bit odd: > because the exclude mask is privileged-access-only, the architecture > could amend it to assign a different meaning to 0xffff, providing this > was an opt-in change. Then we'd have to make a mess here. You have a point. An include mask of 0 translates to an exclude mask of 0xffff as per the current patches. If the hardware gains support for one more bit (32 colours), old software running on new hardware may run into unexpected results with an exclude mask of 0xffff. > Can't we just forbid the nonsense value 0 here, or are there other > reasons why that's problematic? It was just easier to start with a default. I wonder whether we should actually switch back to the exclude mask, as per the hardware definition. This way 0 would mean all tags allowed. We can still disallow 0xffff as an exclude mask.
On Mon, May 11, 2020 at 05:40:19PM +0100, Catalin Marinas wrote: > On Mon, May 04, 2020 at 05:46:17PM +0100, Dave P Martin wrote: > > On Thu, Apr 30, 2020 at 05:23:17PM +0100, Catalin Marinas wrote: > > > On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote: > > > > On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote: > > > > > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current > > > > > + thread, asynchronously following one or multiple tag check faults, > > > > > + with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``. > > > > > > > > For "current thread": that's a kernel concept. For user-facing > > > > documentation, can we say "the offending thread" or similar? > > > > > > > > For clarity, it's worth saying that the faulting address is not > > > > reported. Or, we could be optimistic that someday this information will > > > > be available and say that si_addr is the faulting address if available, > > > > with 0 meaning the address is not available. > > > > > > > > Maybe (void *)-1 would be better duff address, but I can't see it > > > > mattering much. If there's already precedent for si_addr==0 elsewhere, > > > > it makes sense to follow it. > > > > > > At a quick grep, I can see a few instances on other architectures where > > > si_addr==0. I'll add a comment here. > > > > OK, cool > > > > Except: what if we're in PR_MTE_TCF_ASYNC mode. If the SIGSEGV handler > > triggers an asynchronous MTE fault itself, we could then get into a > > spin. Hmm. > > How do we handle standard segfaults here? Presumably a signal handler > can trigger a SIGSEGV itself. This is similar to the problem is a data abort inside the data abort handler. It can of course happen, but if you don't want this to be fatal then you code the handler carefully so this can't happen. > > I take it we drain any pending MTE faults when crossing EL boundaries? > > We clear the hardware bit on entry to EL1 from EL0 and set a TIF flag. > > > In that case, an asynchronous MTE fault pending at sigreturn must have > > been caused by the signal handler. We could make that particular case > > of MTE_AERR a force_sig. > > We clear the TIF flag when delivering the signal. I don't think there is > a way for the kernel to detect when it is running in a signal handler. > sigreturn() is not mandatory either. I guess we can put up with this signal not being fatal then. If you have a SEGV handler at all, you're supposed to code it carefully. This brings us back to force_sig for SERR and a normal signal for AERR. That's probably OK. > > > > > > +**Note**: Kernel accesses to user memory (e.g. ``read()`` system call) > > > > > +are only checked if the current thread tag checking mode is > > > > > +PR_MTE_TCF_SYNC. > > > > > > > > Vague? Can we make a precise statement about when the kernel will and > > > > won't check such accesses? And aren't there limitations (like use of > > > > get_user_pages() etc.)? > > > > > > We could make it slightly clearer by say "kernel accesses to the user > > > address space". > > > > That's not the ambiguity. > > > > My question is > > > > 1) Does the kernel guarantee not to check tags on kernel accesses to > > user memory without PR_MTE_TCF_SYNC? > > For ASYNC and NONE, yes, we can guarantee this. > > > 2) Does the kernel guarantee to check tags on kernel accesses to user > > memory with PR_MTE_TCF_SYNC? > > I'd say yes but it depends on how much knowledge one has about the > syscall implementation. If it's access to user address directly, it > would be checked. If it goes via get_user_pages(), it won't. Since the > user doesn't need to have knowledge of the kernel internals, you are > right that we don't guarantee this. So, from userspace it's not guaranteed. This is what I'd describe as "making best efforts", but not a guarantee. > > In practice, this note sounds to be more like a kernel implementation > > detail rather than advice to userspace. > > > > Would it make sense to say something like: > > > > * PR_MTE_TCF_NONE: the kernel does not check tags for kernel accesses > > to use memory done by syscalls in the thread. > > > > * PR_MTE_TCF_ASYNC: the kernel may check some tags for kernel accesses > > to user memory done by syscalls. (Should we guarantee that such > > faults are reported synchronously on syscall exit? In practice I > > think they are. Should we use SEGV_MTESERR in this case? Perhaps > > it's not worth making this a special case.) > > Both NONE and ASYNC are now the same for kernel uaccess - not checked. > > For background information, I decided against ASYNC uaccess checking > since (1) there are some cases where the kernel overreads > (strncpy_from_user) and (2) we don't normally generate SIGSEGV on > uaccess but rather return -EFAULT. The latter is not possible to contain > since we only learn about the fault asynchronously, usually after the > transfer. I may be missing something here. Do we still rely on the hardware to detect tag mismatches in kernel accesses to user memory? I was assuming we do some kind of explicit checking, but now I think that's nonsense (except for get_user_pages() etc.) Since MTE is a new opt-in feature, I think we might have the option to report failures with SIGSEGV instead of -EFAULT. This seems exactly to implement the concept of an asynchronous versus synchronous error. The kernel may not normally do this, but software usually doesn't use raw syscalls. In reality "syscalls" can trigger a SIGSEGV in the libc wrapper anyway. From the caller's point of view the whole thing is a black box. Probably needs discussion with the bionic / glibc folks though (though likely this has been discussed already...) My concern is that the spirit of asynchrous checking in the architecture is that accesses _are_ checked, and we seem to be breaking that principle here. Although MTE's guarantees are statistical, based on small random numbers not matching, this imperfection is quite different from systematically not checking at all, ever, on certain major code paths. > > > * PR_MTE_TCF_SYNC: the kernel makes best efforts to check tags for > > kernel accesses to user memory done by the syscalls, but does not > > guarantee to check everything (or does it? I thought we can't really > > do that for some odd cases...) > > It doesn't. I'll add some notes along the lines of your text above. OK > > > > > +excludes all tags other than 0. A user thread can enable specific tags > > > > > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL, > > > > > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap > > > > > +in the ``PR_MTE_TAG_MASK`` bit-field. > > > > > + > > > > > +**Note**: The hardware uses an exclude mask but the ``prctl()`` > > > > > +interface provides an include mask. An include mask of ``0`` (exclusion > > > > > +mask ``0xffff``) results in the CPU always generating tag ``0``. > > > > > > > > Is there no way to make this default to 1 rather than having a magic > > > > meaning for 0? > > > > > > We follow the hardware behaviour where 0xffff and 0xfffe give the same > > > result. > > > > Exposing this through a purely software interface seems a bit odd: > > because the exclude mask is privileged-access-only, the architecture > > could amend it to assign a different meaning to 0xffff, providing this > > was an opt-in change. Then we'd have to make a mess here. > > You have a point. An include mask of 0 translates to an exclude mask of > 0xffff as per the current patches. If the hardware gains support for one > more bit (32 colours), old software running on new hardware may run into > unexpected results with an exclude mask of 0xffff. > > > Can't we just forbid the nonsense value 0 here, or are there other > > reasons why that's problematic? > > It was just easier to start with a default. I wonder whether we should > actually switch back to the exclude mask, as per the hardware > definition. This way 0 would mean all tags allowed. We can still > disallow 0xffff as an exclude mask. If the number of bits might grow, I guess we can make the exclude mask full-width. For example, the hardware can trivially exclude tags 16 and up, because they don't exist anyway. Similarly, the hardware can trivially include tags 16 and up: inclusion only means that the hardware is allowed to generate them, not that it guarantees to. The only configuration that doesn't make sense is "no tags allowed", so I'd argue for explicity blocking that, even if the architeture alises that encoding to something else. If we prefer 0 as a default value so that init inherits the correct value from the kernel without any special acrobatics, then we make it an exclude mask, with the semantics that the hardware is allowed to generate any of these tags, but does not have to be capable of generating all of them. Make sense? This is bikeshedding from my end... Cheers ---Dave
On Wed, May 13, 2020 at 04:48:46PM +0100, Dave P Martin wrote: > On Mon, May 11, 2020 at 05:40:19PM +0100, Catalin Marinas wrote: > > On Mon, May 04, 2020 at 05:46:17PM +0100, Dave P Martin wrote: > > > On Thu, Apr 30, 2020 at 05:23:17PM +0100, Catalin Marinas wrote: > > > > On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote: > > > > > On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote: > > > > > > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current > > > > > > + thread, asynchronously following one or multiple tag check faults, > > > > > > + with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``. > > > > > > > > > > For "current thread": that's a kernel concept. For user-facing > > > > > documentation, can we say "the offending thread" or similar? > > > > > > > > > > For clarity, it's worth saying that the faulting address is not > > > > > reported. Or, we could be optimistic that someday this information will > > > > > be available and say that si_addr is the faulting address if available, > > > > > with 0 meaning the address is not available. > > > > > > > > > > Maybe (void *)-1 would be better duff address, but I can't see it > > > > > mattering much. If there's already precedent for si_addr==0 elsewhere, > > > > > it makes sense to follow it. > > > > > > > > At a quick grep, I can see a few instances on other architectures where > > > > si_addr==0. I'll add a comment here. > > > > > > OK, cool > > > > > > Except: what if we're in PR_MTE_TCF_ASYNC mode. If the SIGSEGV handler > > > triggers an asynchronous MTE fault itself, we could then get into a > > > spin. Hmm. [...] > > > In that case, an asynchronous MTE fault pending at sigreturn must have > > > been caused by the signal handler. We could make that particular case > > > of MTE_AERR a force_sig. > > > > We clear the TIF flag when delivering the signal. I don't think there is > > a way for the kernel to detect when it is running in a signal handler. > > sigreturn() is not mandatory either. > > I guess we can put up with this signal not being fatal then. > > If you have a SEGV handler at all, you're supposed to code it carefully. > > This brings us back to force_sig for SERR and a normal signal for AERR. > That's probably OK. I think we are in agreement now but please check the patches when I post the v4. > > > > > > +**Note**: Kernel accesses to user memory (e.g. ``read()`` system call) > > > > > > +are only checked if the current thread tag checking mode is > > > > > > +PR_MTE_TCF_SYNC. > > > > > > > > > > Vague? Can we make a precise statement about when the kernel will and > > > > > won't check such accesses? And aren't there limitations (like use of > > > > > get_user_pages() etc.)? > > > > > > > > We could make it slightly clearer by say "kernel accesses to the user > > > > address space". > > > > > > That's not the ambiguity. > > > > > > My question is > > > > > > 1) Does the kernel guarantee not to check tags on kernel accesses to > > > user memory without PR_MTE_TCF_SYNC? [...] > > > 2) Does the kernel guarantee to check tags on kernel accesses to user > > > memory with PR_MTE_TCF_SYNC? [...] > > > In practice, this note sounds to be more like a kernel implementation > > > detail rather than advice to userspace. > > > > > > Would it make sense to say something like: > > > > > > * PR_MTE_TCF_NONE: the kernel does not check tags for kernel accesses > > > to use memory done by syscalls in the thread. > > > > > > * PR_MTE_TCF_ASYNC: the kernel may check some tags for kernel accesses > > > to user memory done by syscalls. (Should we guarantee that such > > > faults are reported synchronously on syscall exit? In practice I > > > think they are. Should we use SEGV_MTESERR in this case? Perhaps > > > it's not worth making this a special case.) > > > > Both NONE and ASYNC are now the same for kernel uaccess - not checked. > > > > For background information, I decided against ASYNC uaccess checking > > since (1) there are some cases where the kernel overreads > > (strncpy_from_user) and (2) we don't normally generate SIGSEGV on > > uaccess but rather return -EFAULT. The latter is not possible to contain > > since we only learn about the fault asynchronously, usually after the > > transfer. > > I may be missing something here. Do we still rely on the hardware to > detect tag mismatches in kernel accesses to user memory? I was assuming > we do some kind of explicit checking, but now I think that's nonsense > (except for get_user_pages() etc.) For synchronous tag checking, we expect the uaccess (via the user address, e.g. copy_from_user()) to be checked by the hardware. If the access happens via a kernel mapping (get_user_pages()), the access is unchecked. There is no point in an explicit tag access+check from the kernel since the get_user_pages() accesses are not expected to generate faults anyway (once the pages have been returned). We also most likely lost the actual user address at the point of access, so not easy to infer the original tag. > Since MTE is a new opt-in feature, I think we might have the option to > report failures with SIGSEGV instead of -EFAULT. This seems exactly to > implement the concept of an asynchronous versus synchronous error. With synchronous checking, we return -EFAULT, smaller number of bytes etc. since no/less data was copied. With async, the uaccess would perform all the accesses, only that the user may get a SIGSEGV delivered on return from the syscall. > The kernel may not normally do this, but software usually doesn't use > raw syscalls. In reality "syscalls" can trigger a SIGSEGV in the libc > wrapper anyway. From the caller's point of view the whole thing is a > black box. > > Probably needs discussion with the bionic / glibc folks though (though > likely this has been discussed already...) The initial plan was to generate SIGSEGV on asynchronous faults for uaccess (on syscall return). This changed when we noticed (in version 3 I think) that the kernel over-reads buffers in some cases (strncpy_from_user(), copy_mount_options()) and triggers false positives. We could fix the above two cases, though in different ways: strncpy_from_user() can align its source (user) address and would no longer be expected to trigger a fault if the string is correctly tagged. copy_mount_options(), OTOH, always reads 4K (not zero-terminated), so it will trip over some tag mismatch. The workaround is to contain the async tag check fault (with DSB before and after the access) and ignore it. However, are these the only two cases where the kernel over-reads user buffers? Without MTE, such faults on uaccess (page faults) were handled by the kernel transparently. We may now start delivering SIGSEGV every time some piece of uaccess kernel code changes and over-reads. > My concern is that the spirit of asynchrous checking in the > architecture is that accesses _are_ checked, and we seem to be > breaking that principle here. I agree with you on the principle but my concern is about the practicality of chasing any future code changes and plugging potentially fatal SIGSEGVs sent to the user. Maybe we need a way to log this so that user (admin) can do something about it like force synchronous. Or we could also toggle synchronous uaccesses irrespective of the user mode or expose this option as a prctl(). Also, do we want some big knob (sysctl) to force some of these modes for all user processes: e.g. force-upgrade async to sync? > > > > > > +excludes all tags other than 0. A user thread can enable specific tags > > > > > > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL, > > > > > > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap > > > > > > +in the ``PR_MTE_TAG_MASK`` bit-field. > > > > > > + > > > > > > +**Note**: The hardware uses an exclude mask but the ``prctl()`` > > > > > > +interface provides an include mask. An include mask of ``0`` (exclusion > > > > > > +mask ``0xffff``) results in the CPU always generating tag ``0``. > > > > > > > > > > Is there no way to make this default to 1 rather than having a magic > > > > > meaning for 0? > > > > > > > > We follow the hardware behaviour where 0xffff and 0xfffe give the same > > > > result. > > > > > > Exposing this through a purely software interface seems a bit odd: > > > because the exclude mask is privileged-access-only, the architecture > > > could amend it to assign a different meaning to 0xffff, providing this > > > was an opt-in change. Then we'd have to make a mess here. > > > > You have a point. An include mask of 0 translates to an exclude mask of > > 0xffff as per the current patches. If the hardware gains support for one > > more bit (32 colours), old software running on new hardware may run into > > unexpected results with an exclude mask of 0xffff. > > > > > Can't we just forbid the nonsense value 0 here, or are there other > > > reasons why that's problematic? > > > > It was just easier to start with a default. I wonder whether we should > > actually switch back to the exclude mask, as per the hardware > > definition. This way 0 would mean all tags allowed. We can still > > disallow 0xffff as an exclude mask. [...] > The only configuration that doesn't make sense is "no tags allowed", so > I'd argue for explicity blocking that, even if the architeture aliases > that encoding to something else. > > If we prefer 0 as a default value so that init inherits the correct > value from the kernel without any special acrobatics, then we make it an > exclude mask, with the semantics that the hardware is allowed to > generate any of these tags, but does not have to be capable of > generating all of them. That's more of a question to the libc people and their preference. We have two options with suboptions: 1. prctl() gets an exclude mask with 0xffff illegal even though the hardware accepts it: a) default exclude mask 0, allowing all tags to be generated by IRG b) default exclude mask of 0xfffe so that only tag 0 is generated 2. prctl() gets an include mask with 0 illegal: a) default include mask is 0xffff, allowing all tags to be generated b) default include mask 0f 0x0001 so that only tag 0 is generated We currently have (2) with mask 0 but could be changed to (2.b). If we are to follow the hardware description (which makes more sense to me but I don't write the C library), (1.a) is the most appropriate.
On Thu, May 14, 2020 at 12:37:22PM +0100, Catalin Marinas wrote: > On Wed, May 13, 2020 at 04:48:46PM +0100, Dave P Martin wrote: > > > > > On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote: > > > > > > On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote: > > > > > > > +excludes all tags other than 0. A user thread can enable specific tags > > > > > > > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL, > > > > > > > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap > > > > > > > +in the ``PR_MTE_TAG_MASK`` bit-field. > > > > > > > + > > > > > > > +**Note**: The hardware uses an exclude mask but the ``prctl()`` > > > > > > > +interface provides an include mask. An include mask of ``0`` (exclusion > > > > > > > +mask ``0xffff``) results in the CPU always generating tag ``0``. > > > > > > > > > > > > Is there no way to make this default to 1 rather than having a magic > > > > > > meaning for 0? > [...] > > The only configuration that doesn't make sense is "no tags allowed", so > > I'd argue for explicity blocking that, even if the architeture aliases > > that encoding to something else. > > > > If we prefer 0 as a default value so that init inherits the correct > > value from the kernel without any special acrobatics, then we make it an > > exclude mask, with the semantics that the hardware is allowed to > > generate any of these tags, but does not have to be capable of > > generating all of them. > > That's more of a question to the libc people and their preference. > We have two options with suboptions: > > 1. prctl() gets an exclude mask with 0xffff illegal even though the > hardware accepts it: > a) default exclude mask 0, allowing all tags to be generated by IRG > b) default exclude mask of 0xfffe so that only tag 0 is generated > > 2. prctl() gets an include mask with 0 illegal: > a) default include mask is 0xffff, allowing all tags to be generated > b) default include mask 0f 0x0001 so that only tag 0 is generated > > We currently have (2) with mask 0 but could be changed to (2.b). If we > are to follow the hardware description (which makes more sense to me but > I don't write the C library), (1.a) is the most appropriate. Thinking some more about this, as we are to expose the GCR_EL1.Excl via a ptrace interface as a regset, it makes more sense to move back to an exclude mask here with default 0. That would be option 1.a above.
The 05/15/2020 11:38, Catalin Marinas wrote: > On Thu, May 14, 2020 at 12:37:22PM +0100, Catalin Marinas wrote: > > On Wed, May 13, 2020 at 04:48:46PM +0100, Dave P Martin wrote: > > > > > > On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote: > > > > > > > On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote: > > > > > > > > +excludes all tags other than 0. A user thread can enable specific tags > > > > > > > > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL, > > > > > > > > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap > > > > > > > > +in the ``PR_MTE_TAG_MASK`` bit-field. > > > > > > > > + > > > > > > > > +**Note**: The hardware uses an exclude mask but the ``prctl()`` > > > > > > > > +interface provides an include mask. An include mask of ``0`` (exclusion > > > > > > > > +mask ``0xffff``) results in the CPU always generating tag ``0``. > > > > > > > > > > > > > > Is there no way to make this default to 1 rather than having a magic > > > > > > > meaning for 0? > > [...] > > > The only configuration that doesn't make sense is "no tags allowed", so > > > I'd argue for explicity blocking that, even if the architeture aliases > > > that encoding to something else. > > > > > > If we prefer 0 as a default value so that init inherits the correct > > > value from the kernel without any special acrobatics, then we make it an > > > exclude mask, with the semantics that the hardware is allowed to > > > generate any of these tags, but does not have to be capable of > > > generating all of them. > > > > That's more of a question to the libc people and their preference. > > We have two options with suboptions: > > > > 1. prctl() gets an exclude mask with 0xffff illegal even though the > > hardware accepts it: > > a) default exclude mask 0, allowing all tags to be generated by IRG > > b) default exclude mask of 0xfffe so that only tag 0 is generated > > > > 2. prctl() gets an include mask with 0 illegal: > > a) default include mask is 0xffff, allowing all tags to be generated > > b) default include mask 0f 0x0001 so that only tag 0 is generated > > > > We currently have (2) with mask 0 but could be changed to (2.b). If we > > are to follow the hardware description (which makes more sense to me but > > I don't write the C library), (1.a) is the most appropriate. > > Thinking some more about this, as we are to expose the GCR_EL1.Excl via > a ptrace interface as a regset, it makes more sense to move back to an > exclude mask here with default 0. That would be option 1.a above. i think the libc has to do a prctl call to set mte up and at that point it will use whatever arguments necessary, so 1.a should work (just like the other options). likely libc will disable 0 for irg and possibly one or two other fixed colors (which will have specific use). the difference i see between 1 vs 2 is forward compatibility if the architecture changes (e.g. adding more tag bits) but then likely new prctl flag will be needed for handling that so it's probably not an issue.
On Fri, May 15, 2020 at 12:14:00PM +0100, Szabolcs Nagy wrote: > The 05/15/2020 11:38, Catalin Marinas wrote: > > On Thu, May 14, 2020 at 12:37:22PM +0100, Catalin Marinas wrote: > > > We have two options with suboptions: > > > > > > 1. prctl() gets an exclude mask with 0xffff illegal even though the > > > hardware accepts it: > > > a) default exclude mask 0, allowing all tags to be generated by IRG > > > b) default exclude mask of 0xfffe so that only tag 0 is generated > > > > > > 2. prctl() gets an include mask with 0 illegal: > > > a) default include mask is 0xffff, allowing all tags to be generated > > > b) default include mask 0f 0x0001 so that only tag 0 is generated > > > > > > We currently have (2) with mask 0 but could be changed to (2.b). If we > > > are to follow the hardware description (which makes more sense to me but > > > I don't write the C library), (1.a) is the most appropriate. > > > > Thinking some more about this, as we are to expose the GCR_EL1.Excl via > > a ptrace interface as a regset, it makes more sense to move back to an > > exclude mask here with default 0. That would be option 1.a above. > > i think the libc has to do a prctl call to set > mte up and at that point it will use whatever > arguments necessary, so 1.a should work (just > like the other options). > > likely libc will disable 0 for irg and possibly > one or two other fixed colors (which will have > specific use). > > the difference i see between 1 vs 2 is forward > compatibility if the architecture changes (e.g. > adding more tag bits) but then likely new prctl > flag will be needed for handling that so it's > probably not an issue. Thanks Szabolcs. While we are at this, no-one so far asked for the GCR_EL1.RRND to be exposed to user (and this implies RGSR_EL1.SEED). Since RRND=1 guarantees a distribution "no worse" than that of RRND=0, I thought there isn't much point in exposing this configuration to the user. The only advantage of RRND=0 I see is that the kernel can change the seed randomly but, with only 4 bits per tag, it really doesn't matter much. Anyway, mentioning it here in case anyone is surprised later about the lack of RRND configurability.
The 05/15/2020 12:27, Catalin Marinas wrote: > Thanks Szabolcs. While we are at this, no-one so far asked for the > GCR_EL1.RRND to be exposed to user (and this implies RGSR_EL1.SEED). > Since RRND=1 guarantees a distribution "no worse" than that of RRND=0, I > thought there isn't much point in exposing this configuration to the > user. The only advantage of RRND=0 I see is that the kernel can change it seems RRND=1 is the impl specific algorithm. > the seed randomly but, with only 4 bits per tag, it really doesn't > matter much. > > Anyway, mentioning it here in case anyone is surprised later about the > lack of RRND configurability. i'm not familiar with how irg works. is the seed per process state that's set up at process startup in some way? or shared (and thus effectively irg is non-deterministic in userspace)?
On Fri, May 15, 2020 at 01:04:33PM +0100, Szabolcs Nagy wrote: > The 05/15/2020 12:27, Catalin Marinas wrote: > > Thanks Szabolcs. While we are at this, no-one so far asked for the > > GCR_EL1.RRND to be exposed to user (and this implies RGSR_EL1.SEED). > > Since RRND=1 guarantees a distribution "no worse" than that of RRND=0, I > > thought there isn't much point in exposing this configuration to the > > user. The only advantage of RRND=0 I see is that the kernel can change > > it seems RRND=1 is the impl specific algorithm. Yes, that's the implementation specific algorithm which shouldn't be worse than the standard one. > > the seed randomly but, with only 4 bits per tag, it really doesn't > > matter much. > > > > Anyway, mentioning it here in case anyone is surprised later about the > > lack of RRND configurability. > > i'm not familiar with how irg works. It generates a random tag based on some algorithm. > is the seed per process state that's set up at process startup in some > way? or shared (and thus effectively irg is non-deterministic in > userspace)? The seed is only relevant if the standard algorithm is used (RRND=0).
The 05/15/2020 13:13, Catalin Marinas wrote: > On Fri, May 15, 2020 at 01:04:33PM +0100, Szabolcs Nagy wrote: > > The 05/15/2020 12:27, Catalin Marinas wrote: > > > Thanks Szabolcs. While we are at this, no-one so far asked for the > > > GCR_EL1.RRND to be exposed to user (and this implies RGSR_EL1.SEED). > > > Since RRND=1 guarantees a distribution "no worse" than that of RRND=0, I > > > thought there isn't much point in exposing this configuration to the > > > user. The only advantage of RRND=0 I see is that the kernel can change > > > > it seems RRND=1 is the impl specific algorithm. > > Yes, that's the implementation specific algorithm which shouldn't be > worse than the standard one. > > > > the seed randomly but, with only 4 bits per tag, it really doesn't > > > matter much. > > > > > > Anyway, mentioning it here in case anyone is surprised later about the > > > lack of RRND configurability. > > > > i'm not familiar with how irg works. > > It generates a random tag based on some algorithm. > > > is the seed per process state that's set up at process startup in some > > way? or shared (and thus effectively irg is non-deterministic in > > userspace)? > > The seed is only relevant if the standard algorithm is used (RRND=0). i wanted to understand if we can get deterministic irg behaviour in user space (which may be useful for debugging to get reproducible tag failures). i guess if no control is exposed that means non- deterministic irg. i think this is fine.
On Fri, May 15, 2020 at 01:53:32PM +0100, Szabolcs Nagy wrote: > The 05/15/2020 13:13, Catalin Marinas wrote: > > On Fri, May 15, 2020 at 01:04:33PM +0100, Szabolcs Nagy wrote: > > > The 05/15/2020 12:27, Catalin Marinas wrote: > > > > Thanks Szabolcs. While we are at this, no-one so far asked for the > > > > GCR_EL1.RRND to be exposed to user (and this implies RGSR_EL1.SEED). > > > > Since RRND=1 guarantees a distribution "no worse" than that of RRND=0, I > > > > thought there isn't much point in exposing this configuration to the > > > > user. The only advantage of RRND=0 I see is that the kernel can change > > > > > > it seems RRND=1 is the impl specific algorithm. > > > > Yes, that's the implementation specific algorithm which shouldn't be > > worse than the standard one. > > > > > > the seed randomly but, with only 4 bits per tag, it really doesn't > > > > matter much. > > > > > > > > Anyway, mentioning it here in case anyone is surprised later about the > > > > lack of RRND configurability. > > > > > > i'm not familiar with how irg works. > > > > It generates a random tag based on some algorithm. > > > > > is the seed per process state that's set up at process startup in some > > > way? or shared (and thus effectively irg is non-deterministic in > > > userspace)? > > > > The seed is only relevant if the standard algorithm is used (RRND=0). > > i wanted to understand if we can get deterministic > irg behaviour in user space (which may be useful > for debugging to get reproducible tag failures). > > i guess if no control is exposed that means non- > deterministic irg. i think this is fine. Hmmm, I guess this might eventually be wanted. But it's probably OK not to have it to begin with. Things like CRIU restores won't be reproducible unless the seeds can be saved/restored. Doesn't seem essential from day 1 though. Cheers ---Dave
On Thu, May 14, 2020 at 12:37:22PM +0100, Catalin Marinas wrote: > On Wed, May 13, 2020 at 04:48:46PM +0100, Dave P Martin wrote: > > On Mon, May 11, 2020 at 05:40:19PM +0100, Catalin Marinas wrote: > > > On Mon, May 04, 2020 at 05:46:17PM +0100, Dave P Martin wrote: > > > > On Thu, Apr 30, 2020 at 05:23:17PM +0100, Catalin Marinas wrote: > > > > > On Wed, Apr 29, 2020 at 05:47:05PM +0100, Dave P Martin wrote: > > > > > > On Tue, Apr 21, 2020 at 03:26:03PM +0100, Catalin Marinas wrote: > > > > > > > +excludes all tags other than 0. A user thread can enable specific tags > > > > > > > +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL, > > > > > > > +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap > > > > > > > +in the ``PR_MTE_TAG_MASK`` bit-field. > > > > > > > + > > > > > > > +**Note**: The hardware uses an exclude mask but the ``prctl()`` > > > > > > > +interface provides an include mask. An include mask of ``0`` (exclusion > > > > > > > +mask ``0xffff``) results in the CPU always generating tag ``0``. > > > > > > > > > > > > Is there no way to make this default to 1 rather than having a magic > > > > > > meaning for 0? > > > > > > > > > > We follow the hardware behaviour where 0xffff and 0xfffe give the same > > > > > result. > > > > > > > > Exposing this through a purely software interface seems a bit odd: > > > > because the exclude mask is privileged-access-only, the architecture > > > > could amend it to assign a different meaning to 0xffff, providing this > > > > was an opt-in change. Then we'd have to make a mess here. > > > > > > You have a point. An include mask of 0 translates to an exclude mask of > > > 0xffff as per the current patches. If the hardware gains support for one > > > more bit (32 colours), old software running on new hardware may run into > > > unexpected results with an exclude mask of 0xffff. > > > > > > > Can't we just forbid the nonsense value 0 here, or are there other > > > > reasons why that's problematic? > > > > > > It was just easier to start with a default. I wonder whether we should > > > actually switch back to the exclude mask, as per the hardware > > > definition. This way 0 would mean all tags allowed. We can still > > > disallow 0xffff as an exclude mask. > [...] > > The only configuration that doesn't make sense is "no tags allowed", so > > I'd argue for explicity blocking that, even if the architeture aliases > > that encoding to something else. > > > > If we prefer 0 as a default value so that init inherits the correct > > value from the kernel without any special acrobatics, then we make it an > > exclude mask, with the semantics that the hardware is allowed to > > generate any of these tags, but does not have to be capable of > > generating all of them. > > That's more of a question to the libc people and their preference. > We have two options with suboptions: > > 1. prctl() gets an exclude mask with 0xffff illegal even though the > hardware accepts it: > a) default exclude mask 0, allowing all tags to be generated by IRG > b) default exclude mask of 0xfffe so that only tag 0 is generated > > 2. prctl() gets an include mask with 0 illegal: > a) default include mask is 0xffff, allowing all tags to be generated > b) default include mask 0f 0x0001 so that only tag 0 is generated > > We currently have (2) with mask 0 but could be changed to (2.b). If we > are to follow the hardware description (which makes more sense to me but > I don't write the C library), (1.a) is the most appropriate. As Peter pointed out on Friday (call), 2.b doesn't work as it breaks the existing prctl() for turning on the tagged address ABI. So we have to accept 0 as the tag mask field. Dave, if you feel strongly about avoiding the exclude mask confusion with 0xffff equivalent to 0xfffe, I'll go for 1.a. I have not changed this in the v4 series of the patches (no ABI change in there apart from some minor ptrace tweaks).
diff --git a/Documentation/arm64/cpu-feature-registers.rst b/Documentation/arm64/cpu-feature-registers.rst index 41937a8091aa..b5679fa85ad9 100644 --- a/Documentation/arm64/cpu-feature-registers.rst +++ b/Documentation/arm64/cpu-feature-registers.rst @@ -174,6 +174,8 @@ infrastructure: +------------------------------+---------+---------+ | Name | bits | visible | +------------------------------+---------+---------+ + | MTE | [11-8] | y | + +------------------------------+---------+---------+ | SSBS | [7-4] | y | +------------------------------+---------+---------+ diff --git a/Documentation/arm64/elf_hwcaps.rst b/Documentation/arm64/elf_hwcaps.rst index 7dfb97dfe416..ca7f90e99e3a 100644 --- a/Documentation/arm64/elf_hwcaps.rst +++ b/Documentation/arm64/elf_hwcaps.rst @@ -236,6 +236,11 @@ HWCAP2_RNG Functionality implied by ID_AA64ISAR0_EL1.RNDR == 0b0001. +HWCAP2_MTE + + Functionality implied by ID_AA64PFR1_EL1.MTE == 0b0010, as described + by Documentation/arm64/memory-tagging-extension.rst. + 4. Unused AT_HWCAP bits ----------------------- diff --git a/Documentation/arm64/index.rst b/Documentation/arm64/index.rst index 09cbb4ed2237..4cd0e696f064 100644 --- a/Documentation/arm64/index.rst +++ b/Documentation/arm64/index.rst @@ -14,6 +14,7 @@ ARM64 Architecture hugetlbpage legacy_instructions memory + memory-tagging-extension pointer-authentication silicon-errata sve diff --git a/Documentation/arm64/memory-tagging-extension.rst b/Documentation/arm64/memory-tagging-extension.rst new file mode 100644 index 000000000000..f82dfbd70061 --- /dev/null +++ b/Documentation/arm64/memory-tagging-extension.rst @@ -0,0 +1,260 @@ +=============================================== +Memory Tagging Extension (MTE) in AArch64 Linux +=============================================== + +Authors: Vincenzo Frascino <vincenzo.frascino@arm.com> + Catalin Marinas <catalin.marinas@arm.com> + +Date: 2020-02-25 + +This document describes the provision of the Memory Tagging Extension +functionality in AArch64 Linux. + +Introduction +============ + +ARMv8.5 based processors introduce the Memory Tagging Extension (MTE) +feature. MTE is built on top of the ARMv8.0 virtual address tagging TBI +(Top Byte Ignore) feature and allows software to access a 4-bit +allocation tag for each 16-byte granule in the physical address space. +Such memory range must be mapped with the Normal-Tagged memory +attribute. A logical tag is derived from bits 59-56 of the virtual +address used for the memory access. A CPU with MTE enabled will compare +the logical tag against the allocation tag and potentially raise an +exception on mismatch, subject to system registers configuration. + +Userspace Support +================= + +When ``CONFIG_ARM64_MTE`` is selected and Memory Tagging Extension is +supported by the hardware, the kernel advertises the feature to +userspace via ``HWCAP2_MTE``. + +PROT_MTE +-------- + +To access the allocation tags, a user process must enable the Tagged +memory attribute on an address range using a new ``prot`` flag for +``mmap()`` and ``mprotect()``: + +``PROT_MTE`` - Pages allow access to the MTE allocation tags. + +The allocation tag is set to 0 when such pages are first mapped in the +user address space and preserved on copy-on-write. ``MAP_SHARED`` is +supported and the allocation tags can be shared between processes. + +**Note**: ``PROT_MTE`` is only supported on ``MAP_ANONYMOUS`` and +RAM-based file mappings (``tmpfs``, ``memfd``). Passing it to other +types of mapping will result in ``-EINVAL`` returned by these system +calls. + +**Note**: The ``PROT_MTE`` flag (and corresponding memory type) cannot +be cleared by ``mprotect()``. + +Tag Check Faults +---------------- + +When ``PROT_MTE`` is enabled on an address range and a mismatch between +the logical and allocation tags occurs on access, there are three +configurable behaviours: + +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the + tag check fault. + +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with + ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The + memory access is not performed. + +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the current + thread, asynchronously following one or multiple tag check faults, + with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0``. + +**Note**: There are no *match-all* logical tags available for user +applications. + +The user can select the above modes, per thread, using the +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK`` +bit-field: + +- ``PR_MTE_TCF_NONE`` - *Ignore* tag check faults +- ``PR_MTE_TCF_SYNC`` - *Synchronous* tag check fault mode +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode + +Tag checking can also be disabled for a user thread by setting the +``PSTATE.TCO`` bit with ``MSR TCO, #1``. + +**Note**: Signal handlers are always invoked with ``PSTATE.TCO = 0``, +irrespective of the interrupted context. + +**Note**: Kernel accesses to user memory (e.g. ``read()`` system call) +are only checked if the current thread tag checking mode is +PR_MTE_TCF_SYNC. + +Excluding Tags in the ``IRG``, ``ADDG`` and ``SUBG`` instructions +----------------------------------------------------------------- + +The architecture allows excluding certain tags to be randomly generated +via the ``GCR_EL1.Exclude`` register bit-field. By default, Linux +excludes all tags other than 0. A user thread can enable specific tags +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL, +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap +in the ``PR_MTE_TAG_MASK`` bit-field. + +**Note**: The hardware uses an exclude mask but the ``prctl()`` +interface provides an include mask. An include mask of ``0`` (exclusion +mask ``0xffff``) results in the CPU always generating tag ``0``. + +The ``ptrace()`` interface +-------------------------- + +``PTRACE_PEEKMTETAGS`` and ``PTRACE_POKEMTETAGS`` allow a tracer to read +the tags from or set the tags to a tracee's address space. The +``ptrace()`` syscall is invoked as ``ptrace(request, pid, addr, data)`` +where: + +- ``request`` - one of ``PTRACE_PEEKMTETAGS`` or ``PTRACE_PEEKMTETAGS``. +- ``pid`` - the tracee's PID. +- ``addr`` - address in the tracee's address space. +- ``data`` - pointer to a ``struct iovec`` where ``iov_base`` points to + a buffer of ``iov_len`` length in the tracer's address space. + +The tags in the tracer's ``iov_base`` buffer are represented as one tag +per byte and correspond to a 16-byte MTE tag granule in the tracee's +address space. + +``ptrace()`` return value: + +- 0 - success, the tracer's ``iov_len`` was updated to the number of + tags copied (it may be smaller than the requested ``iov_len`` if the + requested address range in the tracee's or the tracer's space cannot + be fully accessed). +- ``-EPERM`` - the specified process cannot be traced. +- ``-EIO`` - the tracee's address range cannot be accessed (e.g. invalid + address) and no tags copied. ``iov_len`` not updated. +- ``-EFAULT`` - fault on accessing the tracer's memory (``struct iovec`` + or ``iov_base`` buffer) and no tags copied. ``iov_len`` not updated. + +Example of correct usage +======================== + +*MTE Example code* + +.. code-block:: c + + /* + * To be compiled with -march=armv8.5-a+memtag + */ + #include <errno.h> + #include <stdio.h> + #include <stdlib.h> + #include <unistd.h> + #include <sys/auxv.h> + #include <sys/mman.h> + #include <sys/prctl.h> + + /* + * From arch/arm64/include/uapi/asm/hwcap.h + */ + #define HWCAP2_MTE (1 << 18) + + /* + * From arch/arm64/include/uapi/asm/mman.h + */ + #define PROT_MTE 0x20 + + /* + * From include/uapi/linux/prctl.h + */ + #define PR_SET_TAGGED_ADDR_CTRL 55 + #define PR_GET_TAGGED_ADDR_CTRL 56 + # define PR_TAGGED_ADDR_ENABLE (1UL << 0) + # define PR_MTE_TCF_SHIFT 1 + # define PR_MTE_TCF_NONE (0UL << PR_MTE_TCF_SHIFT) + # define PR_MTE_TCF_SYNC (1UL << PR_MTE_TCF_SHIFT) + # define PR_MTE_TCF_ASYNC (2UL << PR_MTE_TCF_SHIFT) + # define PR_MTE_TCF_MASK (3UL << PR_MTE_TCF_SHIFT) + # define PR_MTE_TAG_SHIFT 3 + # define PR_MTE_TAG_MASK (0xffffUL << PR_MTE_TAG_SHIFT) + + /* + * Insert a random logical tag into the given pointer. + */ + #define insert_random_tag(ptr) ({ \ + __u64 __val; \ + asm("irg %0, %1" : "=r" (__val) : "r" (ptr)); \ + __val; \ + }) + + /* + * Set the allocation tag on the destination address. + */ + #define set_tag(tagged_addr) do { \ + asm volatile("stg %0, [%0]" : : "r" (tagged_addr) : "memory"); \ + } while (0) + + int main() + { + unsigned long *a; + unsigned long page_sz = getpagesize(); + unsigned long hwcap2 = getauxval(AT_HWCAP2); + + /* check if MTE is present */ + if (!(hwcap2 & HWCAP2_MTE)) + return -1; + + /* + * Enable the tagged address ABI, synchronous MTE tag check faults and + * allow all non-zero tags in the randomly generated set. + */ + if (prctl(PR_SET_TAGGED_ADDR_CTRL, + PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC | (0xfffe << PR_MTE_TAG_SHIFT), + 0, 0, 0)) { + perror("prctl() failed"); + return -1; + } + + a = mmap(0, page_sz, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (a == MAP_FAILED) { + perror("mmap() failed"); + return -1; + } + + /* + * Enable MTE on the above anonymous mmap. The flag could be passed + * directly to mmap() and skip this step. + */ + if (mprotect(a, page_sz, PROT_READ | PROT_WRITE | PROT_MTE)) { + perror("mprotect() failed"); + return -1; + } + + /* access with the default tag (0) */ + a[0] = 1; + a[1] = 2; + + printf("a[0] = %lu a[1] = %lu\n", a[0], a[1]); + + /* set the logical and allocation tags */ + a = (unsigned long *)insert_random_tag(a); + set_tag(a); + + printf("%p\n", a); + + /* non-zero tag access */ + a[0] = 3; + printf("a[0] = %lu a[1] = %lu\n", a[0], a[1]); + + /* + * If MTE is enabled correctly the next instruction will generate an + * exception. + */ + printf("Expecting SIGSEGV...\n"); + a[2] = 0xdead; + + /* this should not be printed in the PR_MTE_TCF_SYNC mode */ + printf("...done\n"); + + return 0; + }