[v9,0/8] arm64: untag user pointers passed to the kernel

Message ID	cover.1544445454.git.andreyknvl@google.com (mailing list archive)
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Andrey Konovalov <andreyknvl@google.com> To: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will.deacon@arm.com>, Mark Rutland <mark.rutland@arm.com>, Robin Murphy <robin.murphy@arm.com>, Kees Cook <keescook@chromium.org>, Kate Stewart <kstewart@linuxfoundation.org>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Andrew Morton <akpm@linux-foundation.org>, Ingo Molnar <mingo@kernel.org>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Shuah Khan <shuah@kernel.org>, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v9 0/8] arm64: untag user pointers passed to the kernel Date: Mon, 10 Dec 2018 13:50:57 +0100 Message-Id: <cover.1544445454.git.andreyknvl@google.com> MIME-Version: 1.0 Precedence: list Cc: Chintan Pandya <cpandya@codeaurora.org>, Jacob Bramley <Jacob.Bramley@arm.com>, Ruben Ayrapetyan <Ruben.Ayrapetyan@arm.com>, Andrey Konovalov <andreyknvl@google.com>, Lee Smith <Lee.Smith@arm.com>, Kostya Serebryany <kcc@google.com>, Dmitry Vyukov <dvyukov@google.com>, Ramana Radhakrishnan <Ramana.Radhakrishnan@arm.com>, Luc Van Oostenryck <luc.vanoostenryck@gmail.com>, Evgeniy Stepanov <eugenis@google.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
Series	arm64: untag user pointers passed to the kernel \| expand [v9,0/8] arm64: untag user pointers passed to the kernel [v9,1/8] arm64: add type casts to untagged_addr macro [v9,2/8] uaccess: add untagged_addr definition for other arches [v9,3/8] arm64: untag user addresses in access_ok and __uaccess_mask_ptr [v9,4/8] mm, arm64: untag user addresses in mm/gup.c [v9,5/8] lib, arm64: untag addrs passed to strncpy_from_user and strnlen_user [v9,6/8] fs, arm64: untag user address in copy_mount_options [v9,7/8] arm64: update Documentation/arm64/tagged-pointers.txt [v9,8/8] selftests, arm64: add a selftest for passing tagged pointers to kernel

Andrey Konovalov Dec. 10, 2018, 12:50 p.m. UTC

arm64 has a feature called Top Byte Ignore, which allows to embed pointer
tags into the top byte of each pointer. Userspace programs (such as
HWASan, a memory debugging tool [1]) might use this feature and pass
tagged user pointers to the kernel through syscalls or other interfaces.

Right now the kernel is already able to handle user faults with tagged
pointers, due to these patches:

1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
             tagged pointer")
2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
              pointers")
3. 276e9327 ("arm64: entry: improve data abort handling of tagged
              pointers")

When passing tagged pointers to syscalls, there's a special case of such a
pointer being passed to one of the memory syscalls (mmap, mprotect, etc.).
These syscalls don't do memory accesses but rather deal with memory
ranges, hence an untagged pointer is better suited.

This patchset extends tagged pointer support to non-memory syscalls. This
is done by reusing the untagged_addr macro to untag user pointers when the
kernel performs pointer checking to find out whether the pointer comes
from userspace (most notably in access_ok). The untagging is done only
when the pointer is being checked, the tag is preserved as the pointer
makes its way through the kernel.

One of the alternative approaches to untagging that was considered is to
completely strip the pointer tag as the pointer enters the kernel with
some kind of a syscall wrapper, but that won't work with the countless
number of different ioctl calls. With this approach we would need a custom
wrapper for each ioctl variation, which doesn't seem practical.

The following testing approaches has been taken to find potential issues
with user pointer untagging:

1. Static testing (with sparse [2] and separately with a custom static
   analyzer based on Clang) to track casts of __user pointers to integer
   types to find places where untagging needs to be done.

2. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
   a modified syzkaller version that passes tagged pointers to the kernel.

Based on the results of the testing the requried patches have been added
to the patchset.

This patchset has been merged into the Pixel 2 kernel tree and is now
being used to enable testing of Pixel 2 phones with HWASan.

This patchset is a prerequisite for ARM's memory tagging hardware feature
support [3].

Thanks!

[1] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html

[2] https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060e0145f292

[3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a

Changes in v9:
- Rebased onto 4.20-rc6.
- Used u64 instead of __u64 in type casts in the untagged_addr macro for
  arm64.
- Added braces around (addr) in the untagged_addr macro for other arches.

Changes in v8:
- Rebased onto 65102238 (4.20-rc1).
- Added a note to the cover letter on why syscall wrappers/shims that untag
  user pointers won't work.
- Added a note to the cover letter that this patchset has been merged into
  the Pixel 2 kernel tree.
- Documentation fixes, in particular added a list of syscalls that don't
  support tagged user pointers.

Changes in v7:
- Rebased onto 17b57b18 (4.19-rc6).
- Dropped the "arm64: untag user address in __do_user_fault" patch, since
  the existing patches already handle user faults properly.
- Dropped the "usb, arm64: untag user addresses in devio" patch, since the
  passed pointer must come from a vma and therefore be untagged.
- Dropped the "arm64: annotate user pointers casts detected by sparse"
  patch (see the discussion to the replies of the v6 of this patchset).
- Added more context to the cover letter.
- Updated Documentation/arm64/tagged-pointers.txt.

Changes in v6:
  1 From 502466b9652c57a23af3bd72124144319212f30b Mon Sep 17 00:00:00 2001
- Added annotations for user pointer casts found by sparse.
  1 From 502466b9652c57a23af3bd72124144319212f30b Mon Sep 17 00:00:00 2001
- Rebased onto 050cdc6c (4.19-rc1+).
  1 From 502466b9652c57a23af3bd72124144319212f30b Mon Sep 17 00:00:00 2001

Changes in v5:
- Added 3 new patches that add untagging to places found with static
  analysis.
- Rebased onto 44c929e1 (4.18-rc8).

Changes in v4:
- Added a selftest for checking that passing tagged pointers to the
  kernel succeeds.
- Rebased onto 81e97f013 (4.18-rc1+).

Changes in v3:
- Rebased onto e5c51f30 (4.17-rc6+).
- Added linux-arch@ to the list of recipients.

Changes in v2:
- Rebased onto 2d618bdf (4.17-rc3+).
- Removed excessive untagging in gup.c.
- Removed untagging pointers returned from __uaccess_mask_ptr.

Changes in v1:
- Rebased onto 4.17-rc1.

Changes in RFC v2:
- Added "#ifndef untagged_addr..." fallback in linux/uaccess.h instead of
  defining it for each arch individually.
- Updated Documentation/arm64/tagged-pointers.txt.
- Dropped "mm, arm64: untag user addresses in memory syscalls".
- Rebased onto 3eb2ce82 (4.16-rc7).

Reviewed-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>

Andrey Konovalov (8):
  arm64: add type casts to untagged_addr macro
  uaccess: add untagged_addr definition for other arches
  arm64: untag user addresses in access_ok and __uaccess_mask_ptr
  mm, arm64: untag user addresses in mm/gup.c
  lib, arm64: untag addrs passed to strncpy_from_user and strnlen_user
  fs, arm64: untag user address in copy_mount_options
  arm64: update Documentation/arm64/tagged-pointers.txt
  selftests, arm64: add a selftest for passing tagged pointers to kernel

 Documentation/arm64/tagged-pointers.txt       | 25 +++++++++++--------
 arch/arm64/include/asm/uaccess.h              | 14 +++++++----
 fs/namespace.c                                |  2 +-
 include/linux/uaccess.h                       |  4 +++
 lib/strncpy_from_user.c                       |  2 ++
 lib/strnlen_user.c                            |  2 ++
 mm/gup.c                                      |  4 +++
 tools/testing/selftests/arm64/.gitignore      |  1 +
 tools/testing/selftests/arm64/Makefile        | 11 ++++++++
 .../testing/selftests/arm64/run_tags_test.sh  | 12 +++++++++
 tools/testing/selftests/arm64/tags_test.c     | 19 ++++++++++++++
 11 files changed, 80 insertions(+), 16 deletions(-)
 create mode 100644 tools/testing/selftests/arm64/.gitignore
 create mode 100644 tools/testing/selftests/arm64/Makefile
 create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh
 create mode 100644 tools/testing/selftests/arm64/tags_test.c

Vincenzo Frascino Dec. 10, 2018, 2:30 p.m. UTC | #1

On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
the userspace (EL0) is allowed to set a non-zero value in the top
byte but the resulting pointers are not allowed at the user-kernel
syscall ABI boundary.

This patchset proposes a relaxation of the ABI and a mechanism to
advertise it to the userspace via an AT_FLAGS.

The rationale behind the choice of AT_FLAGS is that the Unix System V
ABI defines AT_FLAGS as "flags", leaving some degree of freedom in
interpretation.
There are two previous attempts of using AT_FLAGS in the Linux Kernel
for different reasons: the first was more generic and was used to expose
the support for the GNU STACK NX feature [1] and the second was done for
the MIPS architecture and was used to expose the support of "MIPS ABI
Extension for IEEE Std 754 Non-Compliant Interlinking" [2].
Both the changes are currently _not_ merged in mainline.
The only architecture that reserves some of the bits in AT_FLAGS is
currently MIPS, which introduced the concept of platform specific ABI
(psABI) reserving the top-byte [3].

When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising
to the userspace that a relaxed ABI is supported hence this type
of pointers are now allowed to be passed to the syscalls when they are
in memory ranges obtained by anonymous mmap() or brk().

The userspace _must_ verify that the flag is set before passing tagged
pointers to the syscalls allowed by this relaxation.

More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating
to the software to check that the feature is present, before using the
associated functionality, it provides a degree of control on the decision
of disabling such a feature in future without consequently breaking the
userspace.

The change required a modification of the elf common code, because in Linux
the AT_FLAGS are currently set to zero by default by the kernel.

The newly added flag has been verified on arm64 using the code below.
#include <stdio.h>
#include <stdbool.h>
#include <sys/auxv.h>

#define ARM64_AT_FLAGS_SYSCALL_TBI     (1 << 0)

bool arm64_syscall_tbi_is_present(void)
{
        unsigned long at_flags = getauxval(AT_FLAGS);
        if (at_flags & ARM64_AT_FLAGS_SYSCALL_TBI)
                return true;

        return false;
}

void main()
{
        if (arm64_syscall_tbi_is_present())
                printf("ARM64_AT_FLAGS_SYSCALL_TBI is present\n");
}

This patchset should be merged together with [4].

[1] https://patchwork.ozlabs.org/patch/579578/
[2] https://lore.kernel.org/patchwork/cover/618280/
[3] ftp://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf
[4] https://patchwork.kernel.org/cover/10674351/

ABI References:
---------------
Sco SysV ABI: http://www.sco.com/developers/gabi/2003-12-17/contents.html
PowerPC AUXV: http://openpowerfoundation.org/wp-content/uploads/resources/leabi/content/dbdoclet.50655242_98651.html
AMD64 ABI: https://www.cs.tufts.edu/comp/40-2012f/readings/amd64-abi.pdf
x86 ABI: https://www.uclibc.org/docs/psABI-i386.pdf
MIPS ABI: ftp://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf
ARM ABI: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044f/IHI0044F_aaelf.pdf
SPARC ABI: http://math-atlas.sourceforge.net/devel/assembly/abi_sysV_sparc.pdf

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: Jacob Bramley <Jacob.Bramley@arm.com>
Cc: Ruben Ayrapetyan <Ruben.Ayrapetyan@arm.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Lee Smith <Lee.Smith@arm.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>,
Cc: Ramana Radhakrishnan <Ramana.Radhakrishnan@arm.com>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Evgeniy Stepanov <eugenis@google.com>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>

Vincenzo Frascino (3):
  elf: Make AT_FLAGS arch configurable
  arm64: Define Documentation/arm64/elf_at_flags.txt
  arm64: elf: Advertise relaxed ABI

 Documentation/arm64/elf_at_flags.txt  | 111 ++++++++++++++++++++++++++
 arch/arm64/include/asm/atflags.h      |   7 ++
 arch/arm64/include/asm/elf.h          |   5 ++
 arch/arm64/include/uapi/asm/atflags.h |   8 ++
 fs/binfmt_elf.c                       |   6 +-
 fs/binfmt_elf_fdpic.c                 |   6 +-
 fs/compat_binfmt_elf.c                |   5 ++
 7 files changed, 146 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/arm64/elf_at_flags.txt
 create mode 100644 arch/arm64/include/asm/atflags.h
 create mode 100644 arch/arm64/include/uapi/asm/atflags.h

Andrey Konovalov Dec. 12, 2018, 2:23 p.m. UTC | #2

On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino
<vincenzo.frascino@arm.com> wrote:
>
> On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
> the userspace (EL0) is allowed to set a non-zero value in the top
> byte but the resulting pointers are not allowed at the user-kernel
> syscall ABI boundary.
>
> This patchset proposes a relaxation of the ABI and a mechanism to
> advertise it to the userspace via an AT_FLAGS.
>
> The rationale behind the choice of AT_FLAGS is that the Unix System V
> ABI defines AT_FLAGS as "flags", leaving some degree of freedom in
> interpretation.
> There are two previous attempts of using AT_FLAGS in the Linux Kernel
> for different reasons: the first was more generic and was used to expose
> the support for the GNU STACK NX feature [1] and the second was done for
> the MIPS architecture and was used to expose the support of "MIPS ABI
> Extension for IEEE Std 754 Non-Compliant Interlinking" [2].
> Both the changes are currently _not_ merged in mainline.
> The only architecture that reserves some of the bits in AT_FLAGS is
> currently MIPS, which introduced the concept of platform specific ABI
> (psABI) reserving the top-byte [3].
>
> When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising
> to the userspace that a relaxed ABI is supported hence this type
> of pointers are now allowed to be passed to the syscalls when they are
> in memory ranges obtained by anonymous mmap() or brk().
>
> The userspace _must_ verify that the flag is set before passing tagged
> pointers to the syscalls allowed by this relaxation.
>
> More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating
> to the software to check that the feature is present, before using the
> associated functionality, it provides a degree of control on the decision
> of disabling such a feature in future without consequently breaking the
> userspace.
>
> The change required a modification of the elf common code, because in Linux
> the AT_FLAGS are currently set to zero by default by the kernel.
>
> The newly added flag has been verified on arm64 using the code below.
> #include <stdio.h>
> #include <stdbool.h>
> #include <sys/auxv.h>
>
> #define ARM64_AT_FLAGS_SYSCALL_TBI     (1 << 0)
>
> bool arm64_syscall_tbi_is_present(void)
> {
>         unsigned long at_flags = getauxval(AT_FLAGS);
>         if (at_flags & ARM64_AT_FLAGS_SYSCALL_TBI)
>                 return true;
>
>         return false;
> }
>
> void main()
> {
>         if (arm64_syscall_tbi_is_present())
>                 printf("ARM64_AT_FLAGS_SYSCALL_TBI is present\n");
> }
>
> This patchset should be merged together with [4].
>
> [1] https://patchwork.ozlabs.org/patch/579578/
> [2] https://lore.kernel.org/patchwork/cover/618280/
> [3] ftp://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf
> [4] https://patchwork.kernel.org/cover/10674351/
>
> ABI References:
> ---------------
> Sco SysV ABI: http://www.sco.com/developers/gabi/2003-12-17/contents.html
> PowerPC AUXV: http://openpowerfoundation.org/wp-content/uploads/resources/leabi/content/dbdoclet.50655242_98651.html
> AMD64 ABI: https://www.cs.tufts.edu/comp/40-2012f/readings/amd64-abi.pdf
> x86 ABI: https://www.uclibc.org/docs/psABI-i386.pdf
> MIPS ABI: ftp://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf
> ARM ABI: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044f/IHI0044F_aaelf.pdf
> SPARC ABI: http://math-atlas.sourceforge.net/devel/assembly/abi_sysV_sparc.pdf
>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Kate Stewart <kstewart@linuxfoundation.org>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Chintan Pandya <cpandya@codeaurora.org>
> Cc: Jacob Bramley <Jacob.Bramley@arm.com>
> Cc: Ruben Ayrapetyan <Ruben.Ayrapetyan@arm.com>
> Cc: Andrey Konovalov <andreyknvl@google.com>
> Cc: Lee Smith <Lee.Smith@arm.com>
> Cc: Kostya Serebryany <kcc@google.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>,
> Cc: Ramana Radhakrishnan <Ramana.Radhakrishnan@arm.com>
> Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
> Cc: Evgeniy Stepanov <eugenis@google.com>
> CC: Alexander Viro <viro@zeniv.linux.org.uk>
> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
>
> Vincenzo Frascino (3):
>   elf: Make AT_FLAGS arch configurable
>   arm64: Define Documentation/arm64/elf_at_flags.txt
>   arm64: elf: Advertise relaxed ABI
>
>  Documentation/arm64/elf_at_flags.txt  | 111 ++++++++++++++++++++++++++
>  arch/arm64/include/asm/atflags.h      |   7 ++
>  arch/arm64/include/asm/elf.h          |   5 ++
>  arch/arm64/include/uapi/asm/atflags.h |   8 ++
>  fs/binfmt_elf.c                       |   6 +-
>  fs/binfmt_elf_fdpic.c                 |   6 +-
>  fs/compat_binfmt_elf.c                |   5 ++
>  7 files changed, 146 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/arm64/elf_at_flags.txt
>  create mode 100644 arch/arm64/include/asm/atflags.h
>  create mode 100644 arch/arm64/include/uapi/asm/atflags.h
>
> --
> 2.19.2
>

Acked-by: Andrey Konovalov <andreyknvl@google.com>

Catalin Marinas Dec. 12, 2018, 3:02 p.m. UTC | #3

Hi Andrey,

On Wed, Dec 12, 2018 at 03:23:25PM +0100, Andrey Konovalov wrote:
> On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino
> <vincenzo.frascino@arm.com> wrote:
> > On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
> > the userspace (EL0) is allowed to set a non-zero value in the top
> > byte but the resulting pointers are not allowed at the user-kernel
> > syscall ABI boundary.
> >
> > This patchset proposes a relaxation of the ABI and a mechanism to
> > advertise it to the userspace via an AT_FLAGS.
> >
> > The rationale behind the choice of AT_FLAGS is that the Unix System V
> > ABI defines AT_FLAGS as "flags", leaving some degree of freedom in
> > interpretation.
> > There are two previous attempts of using AT_FLAGS in the Linux Kernel
> > for different reasons: the first was more generic and was used to expose
> > the support for the GNU STACK NX feature [1] and the second was done for
> > the MIPS architecture and was used to expose the support of "MIPS ABI
> > Extension for IEEE Std 754 Non-Compliant Interlinking" [2].
> > Both the changes are currently _not_ merged in mainline.
> > The only architecture that reserves some of the bits in AT_FLAGS is
> > currently MIPS, which introduced the concept of platform specific ABI
> > (psABI) reserving the top-byte [3].
> >
> > When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising
> > to the userspace that a relaxed ABI is supported hence this type
> > of pointers are now allowed to be passed to the syscalls when they are
> > in memory ranges obtained by anonymous mmap() or brk().
> >
> > The userspace _must_ verify that the flag is set before passing tagged
> > pointers to the syscalls allowed by this relaxation.
> >
> > More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating
> > to the software to check that the feature is present, before using the
> > associated functionality, it provides a degree of control on the decision
> > of disabling such a feature in future without consequently breaking the
> > userspace.
[...]
> Acked-by: Andrey Konovalov <andreyknvl@google.com>

Thanks for the ack. However, if we go ahead with this ABI proposal it
means that your patches need to be reworked to allow a non-zero top byte
in all syscalls, including mmap() and friends, ioctl(). There are ABI
concerns in either case but I'd rather have this discussion in the open.
It doesn't necessarily mean that I endorse this proposal, I would like
feedback and not just from kernel developers but user space ones.

The summary of our internal discussions (mostly between kernel
developers) is that we can't properly describe a user ABI that covers
future syscalls or syscall extensions while not all syscalls accept
tagged pointers. So we tweaked the requirements slightly to only allow
tagged pointers back into the kernel *if* the originating address is
from an anonymous mmap() or below sbrk(0). This should cover some of the
ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
pointer to a buffer obtained via mmap() on the device operations.

(sorry for not being clear on what Vincenzo's proposal implies)

Dave Martin Dec. 12, 2018, 5:01 p.m. UTC | #4

On Mon, Dec 10, 2018 at 01:50:57PM +0100, Andrey Konovalov wrote:
> arm64 has a feature called Top Byte Ignore, which allows to embed pointer
> tags into the top byte of each pointer. Userspace programs (such as
> HWASan, a memory debugging tool [1]) might use this feature and pass
> tagged user pointers to the kernel through syscalls or other interfaces.
> 
> Right now the kernel is already able to handle user faults with tagged
> pointers, due to these patches:
> 
> 1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
>              tagged pointer")
> 2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
>               pointers")
> 3. 276e9327 ("arm64: entry: improve data abort handling of tagged
>               pointers")
> 
> When passing tagged pointers to syscalls, there's a special case of such a
> pointer being passed to one of the memory syscalls (mmap, mprotect, etc.).
> These syscalls don't do memory accesses but rather deal with memory
> ranges, hence an untagged pointer is better suited.
> 
> This patchset extends tagged pointer support to non-memory syscalls. This
> is done by reusing the untagged_addr macro to untag user pointers when the
> kernel performs pointer checking to find out whether the pointer comes
> from userspace (most notably in access_ok). The untagging is done only
> when the pointer is being checked, the tag is preserved as the pointer
> makes its way through the kernel.
> 
> One of the alternative approaches to untagging that was considered is to
> completely strip the pointer tag as the pointer enters the kernel with
> some kind of a syscall wrapper, but that won't work with the countless
> number of different ioctl calls. With this approach we would need a custom
> wrapper for each ioctl variation, which doesn't seem practical.
> 
> The following testing approaches has been taken to find potential issues
> with user pointer untagging:
> 
> 1. Static testing (with sparse [2] and separately with a custom static
>    analyzer based on Clang) to track casts of __user pointers to integer
>    types to find places where untagging needs to be done.
> 
> 2. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
>    a modified syzkaller version that passes tagged pointers to the kernel.
> 
> Based on the results of the testing the requried patches have been added
> to the patchset.
> 
> This patchset has been merged into the Pixel 2 kernel tree and is now
> being used to enable testing of Pixel 2 phones with HWASan.

Do you have an idea of how much of the user/kernel interface is covered
by this workload?

> This patchset is a prerequisite for ARM's memory tagging hardware feature
> support [3].

It looks like there's been a lot of progress made here towards smoking
out most of the sites in the kernel where pointers need to be untagged.

However, I do think that we need a clear policy for how existing kernel
interfaces are to be interpreted in the presence of tagged pointers.
Unless we have that nailed down, we are likely to be able to make only
vague guarantees to userspace about what works, and the ongoing risk
of ABI regressions and inconsistencies seems high.

I don't really see how we can advertise a full system interface if we
know some subset of it doesn't work for foreseeable userspace
environments.  I feel that presenting the current changes as an ABI
relaxation may be a recipe for future problems, since the forwards
compatibility guarantees we're able to make today are few and rather
vague.

Can we define an opt-in for tagged-pointer userspace, that rejects all
syscalls that we haven't checked and whitelisted (or that are
uncheckable like ioctl)?  This reflects the reality that we don't have
a regular userspace environment in which standards-compliant software
that uses address tags in a reasonable way will just work.

It might be feasible to promote this to be enabled by default later on,
if it becomes sufficiently complete.

In the meantime, I think we really need to nail down the kernel's
policies on

 * in the default configuration (without opt-in), is the presence of
non-address bits in pointers exchanged with the kernel simply
considered broken?  (Even with this series, the de factor answer
generally seems to be "yes", although many specific things will now
work fine)

 * if not, how do we tighten syscall / interface specifications to
describe what happens with pointers containing non-address bits, while
keeping the existing behaviour for untagged pointers?

We would want a general recipe that gives clear guidance on what
userspace should expect an arbitrarily chosen syscall to do with its
pointers, without having to enumerate each and every case.

To be sustainable, we would also need to solve that in a way that
doesn't need to be reintented per-arch.

There may already be some background on these topics -- can you throw me
a link if so?

Cheers
---Dave

Andrey Konovalov Dec. 18, 2018, 3:03 p.m. UTC | #5

On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> Hi Andrey,
>
> On Wed, Dec 12, 2018 at 03:23:25PM +0100, Andrey Konovalov wrote:
> > On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino
> > <vincenzo.frascino@arm.com> wrote:
> > > On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
> > > the userspace (EL0) is allowed to set a non-zero value in the top
> > > byte but the resulting pointers are not allowed at the user-kernel
> > > syscall ABI boundary.
> > >
> > > This patchset proposes a relaxation of the ABI and a mechanism to
> > > advertise it to the userspace via an AT_FLAGS.
> > >
> > > The rationale behind the choice of AT_FLAGS is that the Unix System V
> > > ABI defines AT_FLAGS as "flags", leaving some degree of freedom in
> > > interpretation.
> > > There are two previous attempts of using AT_FLAGS in the Linux Kernel
> > > for different reasons: the first was more generic and was used to expose
> > > the support for the GNU STACK NX feature [1] and the second was done for
> > > the MIPS architecture and was used to expose the support of "MIPS ABI
> > > Extension for IEEE Std 754 Non-Compliant Interlinking" [2].
> > > Both the changes are currently _not_ merged in mainline.
> > > The only architecture that reserves some of the bits in AT_FLAGS is
> > > currently MIPS, which introduced the concept of platform specific ABI
> > > (psABI) reserving the top-byte [3].
> > >
> > > When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising
> > > to the userspace that a relaxed ABI is supported hence this type
> > > of pointers are now allowed to be passed to the syscalls when they are
> > > in memory ranges obtained by anonymous mmap() or brk().
> > >
> > > The userspace _must_ verify that the flag is set before passing tagged
> > > pointers to the syscalls allowed by this relaxation.
> > >
> > > More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating
> > > to the software to check that the feature is present, before using the
> > > associated functionality, it provides a degree of control on the decision
> > > of disabling such a feature in future without consequently breaking the
> > > userspace.
> [...]
> > Acked-by: Andrey Konovalov <andreyknvl@google.com>
>
> Thanks for the ack. However, if we go ahead with this ABI proposal it
> means that your patches need to be reworked to allow a non-zero top byte
> in all syscalls, including mmap() and friends, ioctl(). There are ABI
> concerns in either case but I'd rather have this discussion in the open.
> It doesn't necessarily mean that I endorse this proposal, I would like
> feedback and not just from kernel developers but user space ones.
>
> The summary of our internal discussions (mostly between kernel
> developers) is that we can't properly describe a user ABI that covers
> future syscalls or syscall extensions while not all syscalls accept
> tagged pointers. So we tweaked the requirements slightly to only allow
> tagged pointers back into the kernel *if* the originating address is
> from an anonymous mmap() or below sbrk(0). This should cover some of the
> ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
> pointer to a buffer obtained via mmap() on the device operations.
>
> (sorry for not being clear on what Vincenzo's proposal implies)

OK, I see. So I need to make the following changes to my patchset AFAIU.

1. Make sure that we only allow tagged user addresses that originate
from an anonymous mmap() or below sbrk(0). How exactly should this
check be performed?

2. Allow tagged addressed passed to memory syscalls (as long as (1) is
satisfied). Do I understand correctly that this means that I need to
locate all find_vma() callers outside of mm/ and fix them up as well?

Andrey Konovalov Dec. 18, 2018, 5:17 p.m. UTC | #6

On Wed, Dec 12, 2018 at 6:01 PM Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Mon, Dec 10, 2018 at 01:50:57PM +0100, Andrey Konovalov wrote:
> > arm64 has a feature called Top Byte Ignore, which allows to embed pointer
> > tags into the top byte of each pointer. Userspace programs (such as
> > HWASan, a memory debugging tool [1]) might use this feature and pass
> > tagged user pointers to the kernel through syscalls or other interfaces.
> >
> > Right now the kernel is already able to handle user faults with tagged
> > pointers, due to these patches:
> >
> > 1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
> >              tagged pointer")
> > 2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
> >               pointers")
> > 3. 276e9327 ("arm64: entry: improve data abort handling of tagged
> >               pointers")
> >
> > When passing tagged pointers to syscalls, there's a special case of such a
> > pointer being passed to one of the memory syscalls (mmap, mprotect, etc.).
> > These syscalls don't do memory accesses but rather deal with memory
> > ranges, hence an untagged pointer is better suited.
> >
> > This patchset extends tagged pointer support to non-memory syscalls. This
> > is done by reusing the untagged_addr macro to untag user pointers when the
> > kernel performs pointer checking to find out whether the pointer comes
> > from userspace (most notably in access_ok). The untagging is done only
> > when the pointer is being checked, the tag is preserved as the pointer
> > makes its way through the kernel.
> >
> > One of the alternative approaches to untagging that was considered is to
> > completely strip the pointer tag as the pointer enters the kernel with
> > some kind of a syscall wrapper, but that won't work with the countless
> > number of different ioctl calls. With this approach we would need a custom
> > wrapper for each ioctl variation, which doesn't seem practical.
> >
> > The following testing approaches has been taken to find potential issues
> > with user pointer untagging:
> >
> > 1. Static testing (with sparse [2] and separately with a custom static
> >    analyzer based on Clang) to track casts of __user pointers to integer
> >    types to find places where untagging needs to be done.
> >
> > 2. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
> >    a modified syzkaller version that passes tagged pointers to the kernel.
> >
> > Based on the results of the testing the requried patches have been added
> > to the patchset.
> >
> > This patchset has been merged into the Pixel 2 kernel tree and is now
> > being used to enable testing of Pixel 2 phones with HWASan.

Hi, Dave,

>
> Do you have an idea of how much of the user/kernel interface is covered
> by this workload?

Not really. I don't even know what kind of measurements can be used to
obtain this estimate. But Pixel 2 kernel with these patches + Android
runtime instrumented with HWASan works.

>
> > This patchset is a prerequisite for ARM's memory tagging hardware feature
> > support [3].
>
> It looks like there's been a lot of progress made here towards smoking
> out most of the sites in the kernel where pointers need to be untagged.
>
> However, I do think that we need a clear policy for how existing kernel
> interfaces are to be interpreted in the presence of tagged pointers.
> Unless we have that nailed down, we are likely to be able to make only
> vague guarantees to userspace about what works, and the ongoing risk
> of ABI regressions and inconsistencies seems high.
>
> I don't really see how we can advertise a full system interface if we
> know some subset of it doesn't work for foreseeable userspace
> environments.  I feel that presenting the current changes as an ABI
> relaxation may be a recipe for future problems, since the forwards
> compatibility guarantees we're able to make today are few and rather
> vague.
>
> Can we define an opt-in for tagged-pointer userspace, that rejects all
> syscalls that we haven't checked and whitelisted (or that are
> uncheckable like ioctl)?  This reflects the reality that we don't have
> a regular userspace environment in which standards-compliant software
> that uses address tags in a reasonable way will just work.
>
> It might be feasible to promote this to be enabled by default later on,
> if it becomes sufficiently complete.
>
>
> In the meantime, I think we really need to nail down the kernel's
> policies on
>
>  * in the default configuration (without opt-in), is the presence of
> non-address bits in pointers exchanged with the kernel simply
> considered broken?  (Even with this series, the de factor answer
> generally seems to be "yes", although many specific things will now
> work fine)
>
>  * if not, how do we tighten syscall / interface specifications to
> describe what happens with pointers containing non-address bits, while
> keeping the existing behaviour for untagged pointers?
>
> We would want a general recipe that gives clear guidance on what
> userspace should expect an arbitrarily chosen syscall to do with its
> pointers, without having to enumerate each and every case.
>
> To be sustainable, we would also need to solve that in a way that
> doesn't need to be reintented per-arch.

As I understand your main concern is userspace/kernel ABI changes
these patches introduce. This concern was already pointed out by
Catalin, and working out the details is still in progress.

>
> There may already be some background on these topics -- can you throw me
> a link if so?

I don't have a single link, I would suggest to look at the comments
for all the previous versions of this patchset. I see you saw the
pathset by Vincenzo, it also has some information about this.

>
> Cheers
> ---Dave

Thanks!

Catalin Marinas Dec. 18, 2018, 5:59 p.m. UTC | #7

On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
> On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > The summary of our internal discussions (mostly between kernel
> > developers) is that we can't properly describe a user ABI that covers
> > future syscalls or syscall extensions while not all syscalls accept
> > tagged pointers. So we tweaked the requirements slightly to only allow
> > tagged pointers back into the kernel *if* the originating address is
> > from an anonymous mmap() or below sbrk(0). This should cover some of the
> > ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
> > pointer to a buffer obtained via mmap() on the device operations.
> >
> > (sorry for not being clear on what Vincenzo's proposal implies)
> 
> OK, I see. So I need to make the following changes to my patchset AFAIU.
> 
> 1. Make sure that we only allow tagged user addresses that originate
> from an anonymous mmap() or below sbrk(0). How exactly should this
> check be performed?

I don't think we should perform such checks. That's rather stating that
the kernel only guarantees that the tagged pointers work if they
originated from these memory ranges.

> 2. Allow tagged addressed passed to memory syscalls (as long as (1) is
> satisfied). Do I understand correctly that this means that I need to
> locate all find_vma() callers outside of mm/ and fix them up as well?

Yes (unless anyone as a better idea or objections to this approach).

BTW, I'll be off until the new year, so won't be able to follow up.

Dave Martin Dec. 19, 2018, 12:52 p.m. UTC | #8

On Tue, Dec 18, 2018 at 05:59:38PM +0000, Catalin Marinas wrote:
> On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
> > On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > The summary of our internal discussions (mostly between kernel
> > > developers) is that we can't properly describe a user ABI that covers
> > > future syscalls or syscall extensions while not all syscalls accept
> > > tagged pointers. So we tweaked the requirements slightly to only allow
> > > tagged pointers back into the kernel *if* the originating address is
> > > from an anonymous mmap() or below sbrk(0). This should cover some of the
> > > ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
> > > pointer to a buffer obtained via mmap() on the device operations.
> > >
> > > (sorry for not being clear on what Vincenzo's proposal implies)
> > 
> > OK, I see. So I need to make the following changes to my patchset AFAIU.
> > 
> > 1. Make sure that we only allow tagged user addresses that originate
> > from an anonymous mmap() or below sbrk(0). How exactly should this
> > check be performed?
> 
> I don't think we should perform such checks. That's rather stating that
> the kernel only guarantees that the tagged pointers work if they
> originated from these memory ranges.

I concur.

Really, the kernel should do the expected thing with all "non-weird"
memory.

In lieu of a proper definition of "non-weird", I think we should have
some lists of things that are explicitly included, and also excluded:

OK:
	kernel-allocated process stack
	brk area
	MAP_ANONYMOUS | MAP_PRIVATE
	MAP_PRIVATE mappings of /dev/zero

Not OK:
	MAP_SHARED
	mmaps of non-memory-like devices
	mmaps of anything that is not a regular file
	the VDSO
	...

In general, userspace can tag memory that it "owns", and we do not assume
a transfer of ownership except in the "OK" list above.  Otherwise, it's
the kernel's memory, or the owner is simply not well defined.

I would also like to see advice for userspace developers, particularly
things like (strawman, please challenge!):

 * Userspace should set tags at the point of allocation only.

 * If you don't know how an object was allocated, you cannot modify the
   tag, period.

 * A single C object should be accessed using a single, fixed pointer tag
   throughout its entire lifetime.

 * Tags can be changed only when there are no outstanding pointers to
   the affected object or region that may be used to access the object
   or region (i.e., if the object were allocated from the C heap and
   is it safe to realloc() it, then it is safe to change the tag; for
   other types of allocation, analogous arguments can be applied).

 * When the kernel dereferences a pointer on userspace's behalf, it
   shall behave equivalently to userspace dereferencing the same pointer,
   including use of the same tag (where passed by userspace).

 * Where the pointer tag affects pointer dereference behaviour (i.e.,
   with hardware memory colouring) the kernel makes no guarantee to
   honour pointer tags correctly for every location a buffer based on a
   pointer passed by userspace to the kernel.

   (This means for example that for a read(fd, buf, size), we can check
   the tag for a single arbitrary location in *(char (*)[size])buf
   before passing the buffer to get_user_pages().  Hopefully this could
   be done in get_user_pages() itself rather than hunting call sites.
   For userspace, it means that you're on your own if you ask the
   kernel to operate on a buffer than spans multiple, independently-
   allocated objects, or a deliberately striped single object.)

 * The kernel shall not extend the lifetime of user pointers in ways
   that are not clear from the specification of the syscall or
   interface to which the pointer is passed (and in any case shall not
   extend pointer lifetimes without good reason).

   So no clever transparent caching between syscalls, unless it _really_
   is transparent in the presence of tags.

 * For purposes other than dereference, the kernel shall accept any
   legitimately tagged pointer (according to the above rules) as
   identifying the associated memory location.

   So, mprotect(some_page_aligned_object, ...); is valid irrespective
   of where page_aligned_object() came from.  There is no implicit
   derefence by the kernel here, hence no tag check.

   The kernel does not guarantee to work correctly if the wrong tag
   is used, but there is not always a well-defined "right" tag, so
   we can't really guarantee to check it.  So a pointer derived by
   any reasonable means by userspace has to be treated as equally
   valid.

We would need to get some cross-arch buy-in for this, otherwise core
maintainers might just refuse to maintain the necessary guarantees.

> > 2. Allow tagged addressed passed to memory syscalls (as long as (1) is
> > satisfied). Do I understand correctly that this means that I need to
> > locate all find_vma() callers outside of mm/ and fix them up as well?
> 
> Yes (unless anyone as a better idea or objections to this approach).

Also, watch out for code that pokes about inside struct vma directly.

I'm wondering, could we define an explicit type, say,

	struct user_vaddr {
		unsigned long addr;
	};

to replace the unsigned longs in struct vma the mm API?  This would
turn ad-hoc (unsigned long) casts into build breaks.  We could have
an explicit conversion functions, say,

	struct user_vaddr __user_vaddr_unsafe(void __user *);
	void __user *__user_ptr_unsafe(struct user_vaddr);

that we robotically insert in all the relevant places to mark
unaudited code.

This allows us to keep the kernel buildable, while flagging things
that will need review.  We would also need to warn the mm folks to
reject any new code using these unsafe conversions.

Of course, it would be a non-trivial effort...

> 
> BTW, I'll be off until the new year, so won't be able to follow up.

Cheers
---Dave

Catalin Marinas Feb. 11, 2019, 11:35 a.m. UTC | #9

Hi Dave,

On Wed, Dec 12, 2018 at 05:01:12PM +0000, Dave P Martin wrote:
> On Mon, Dec 10, 2018 at 01:50:57PM +0100, Andrey Konovalov wrote:
> > arm64 has a feature called Top Byte Ignore, which allows to embed pointer
> > tags into the top byte of each pointer. Userspace programs (such as
> > HWASan, a memory debugging tool [1]) might use this feature and pass
> > tagged user pointers to the kernel through syscalls or other interfaces.
[...]
> It looks like there's been a lot of progress made here towards smoking
> out most of the sites in the kernel where pointers need to be untagged.

In summary, based on last summer's analysis, there are two main (and
rather broad) scenarios of __user pointers use in the kernel: (a)
uaccess macros, together with access_ok() checks and (b) identifying
of user address ranges (find_vma() and related, some ioctls). The
patches here handle the former by allowing sign-extension in access_ok()
and subsequent uaccess routines work fine with tagged pointers.
Identifying the latter is a bit more problematic and the approach we
took was tracking down pointer to long conversion which seems to cover
the majority of cases. However, this approach doesn't scale as, for
example, we'd rather change get_user_pages() to sign-extend the address
rather than all the callers. In lots of other cases we don't even need
untagging as we don't expect user space to tag such pointers (i.e.
mmap() of device memory).

We might be able to improve the static analysis by introducing a
virt_addr_t but that's significant effort and we still won't cover all
cases (e.g. it doesn't necessarily catch tcp_zerocopy_receive() which
wouldn't use a pointer, just a u64 for address).

> However, I do think that we need a clear policy for how existing kernel
> interfaces are to be interpreted in the presence of tagged pointers.
> Unless we have that nailed down, we are likely to be able to make only
> vague guarantees to userspace about what works, and the ongoing risk
> of ABI regressions and inconsistencies seems high.

I agree.

> Can we define an opt-in for tagged-pointer userspace, that rejects all
> syscalls that we haven't checked and whitelisted (or that are
> uncheckable like ioctl)? 

Defining an opt-in is not a problem, however, rejecting all syscalls
that we haven't whitelisted is not feasible. We can have an opt-in per
process (that's what we were going to do with MTE) but the only thing
we can reasonably do is change the behaviour of access_ok(). That's too
big a knob and a new syscall that we haven't got around to whitelist may
just work. This eventually leads to de-facto ABI and our whitelist would
simply be ignored.

I'm not really keen on a big syscall shim in the arm64 kernel which
checks syscall arguments, including in-struct values. If we are to do
this, I'd rather keep it in user space as part of the C library.

> In the meantime, I think we really need to nail down the kernel's
> policies on
> 
>  * in the default configuration (without opt-in), is the presence of
> non-address bits in pointers exchanged with the kernel simply
> considered broken?  (Even with this series, the de factor answer
> generally seems to be "yes", although many specific things will now
> work fine)

Without these patches, passing non-address bits in pointers is
considered broken. I couldn't find a case where it would still work with
non-zero tag but maybe I haven't looked hard enough.

>  * if not, how do we tighten syscall / interface specifications to
> describe what happens with pointers containing non-address bits, while
> keeping the existing behaviour for untagged pointers?
> 
> We would want a general recipe that gives clear guidance on what
> userspace should expect an arbitrarily chosen syscall to do with its
> pointers, without having to enumerate each and every case.

That's what we are aiming with the pointer origins, to move away from a
syscall whitelist to a generic definition. That said, the two approaches
are orthogonal, we can use the pointer origins as the base rule for
which syscalls can be whitelisted.

If we step back a bit to look at the use-case for TBI (and MTE), the
normal application programmer shouldn't really care about this ABI
(well, most of the time). The app gets a tagged pointer from the C
library as a result of a malloc()/realloc() (possibly alloca()) call and
it expects to be able to pass it back into the kernel (usually via the C
library) without any awareness of the non-address bits. Now, we can't
define a user/kernel ABI based on the provenance of the pointer in user
space (i.e. we only support tags for heap and stack), so we are trying
to generalise this based where the pointer originated from in the kernel
(e.g. anonymous mmap()).

> There may already be some background on these topics -- can you throw me
> a link if so?

That's an interesting sub-thread to read:

https://lore.kernel.org/lkml/5d54526e5ff2e5ad63d0dfdd9ab17cf359afa4f2.1535629099.git.andreyknvl@google.com/

Dave Martin Feb. 11, 2019, 5:02 p.m. UTC | #10

On Mon, Feb 11, 2019 at 11:35:12AM +0000, Catalin Marinas wrote:
> Hi Dave,
> 
> On Wed, Dec 12, 2018 at 05:01:12PM +0000, Dave P Martin wrote:
> > On Mon, Dec 10, 2018 at 01:50:57PM +0100, Andrey Konovalov wrote:
> > > arm64 has a feature called Top Byte Ignore, which allows to embed pointer
> > > tags into the top byte of each pointer. Userspace programs (such as
> > > HWASan, a memory debugging tool [1]) might use this feature and pass
> > > tagged user pointers to the kernel through syscalls or other interfaces.
> [...]
> > It looks like there's been a lot of progress made here towards smoking
> > out most of the sites in the kernel where pointers need to be untagged.
> 
> In summary, based on last summer's analysis, there are two main (and
> rather broad) scenarios of __user pointers use in the kernel: (a)
> uaccess macros, together with access_ok() checks and (b) identifying
> of user address ranges (find_vma() and related, some ioctls). The
> patches here handle the former by allowing sign-extension in access_ok()
> and subsequent uaccess routines work fine with tagged pointers.
> Identifying the latter is a bit more problematic and the approach we
> took was tracking down pointer to long conversion which seems to cover
> the majority of cases. However, this approach doesn't scale as, for
> example, we'd rather change get_user_pages() to sign-extend the address
> rather than all the callers. In lots of other cases we don't even need
> untagging as we don't expect user space to tag such pointers (i.e.
> mmap() of device memory).
> 
> We might be able to improve the static analysis by introducing a
> virt_addr_t but that's significant effort and we still won't cover all
> cases (e.g. it doesn't necessarily catch tcp_zerocopy_receive() which
> wouldn't use a pointer, just a u64 for address).
> 
> > However, I do think that we need a clear policy for how existing kernel
> > interfaces are to be interpreted in the presence of tagged pointers.
> > Unless we have that nailed down, we are likely to be able to make only
> > vague guarantees to userspace about what works, and the ongoing risk
> > of ABI regressions and inconsistencies seems high.
> 
> I agree.
> 
> > Can we define an opt-in for tagged-pointer userspace, that rejects all
> > syscalls that we haven't checked and whitelisted (or that are
> > uncheckable like ioctl)? 
> 
> Defining an opt-in is not a problem, however, rejecting all syscalls
> that we haven't whitelisted is not feasible. We can have an opt-in per
> process (that's what we were going to do with MTE) but the only thing
> we can reasonably do is change the behaviour of access_ok(). That's too
> big a knob and a new syscall that we haven't got around to whitelist may
> just work. This eventually leads to de-facto ABI and our whitelist would
> simply be ignored.
> 
> I'm not really keen on a big syscall shim in the arm64 kernel which
> checks syscall arguments, including in-struct values. If we are to do
> this, I'd rather keep it in user space as part of the C library.
> 
> > In the meantime, I think we really need to nail down the kernel's
> > policies on
> > 
> >  * in the default configuration (without opt-in), is the presence of
> > non-address bits in pointers exchanged with the kernel simply
> > considered broken?  (Even with this series, the de factor answer
> > generally seems to be "yes", although many specific things will now
> > work fine)
> 
> Without these patches, passing non-address bits in pointers is
> considered broken. I couldn't find a case where it would still work with
> non-zero tag but maybe I haven't looked hard enough.
> 
> >  * if not, how do we tighten syscall / interface specifications to
> > describe what happens with pointers containing non-address bits, while
> > keeping the existing behaviour for untagged pointers?
> > 
> > We would want a general recipe that gives clear guidance on what
> > userspace should expect an arbitrarily chosen syscall to do with its
> > pointers, without having to enumerate each and every case.
> 
> That's what we are aiming with the pointer origins, to move away from a
> syscall whitelist to a generic definition. That said, the two approaches
> are orthogonal, we can use the pointer origins as the base rule for
> which syscalls can be whitelisted.
> 
> If we step back a bit to look at the use-case for TBI (and MTE), the
> normal application programmer shouldn't really care about this ABI
> (well, most of the time). The app gets a tagged pointer from the C
> library as a result of a malloc()/realloc() (possibly alloca()) call and
> it expects to be able to pass it back into the kernel (usually via the C
> library) without any awareness of the non-address bits. Now, we can't
> define a user/kernel ABI based on the provenance of the pointer in user
> space (i.e. we only support tags for heap and stack), so we are trying
> to generalise this based where the pointer originated from in the kernel
> (e.g. anonymous mmap()).

This sounds generally reasonable.

It is not adequate for describing changing the tag on already-tagged
memory (which a memory allocator will definitely do), but we may be able
to come up with some weasel words to cover that.

It is also not adequete for describing tagging (and retagging) regions
of the stack -- but as you say, we can rule that use-case out for now
in the interest of simplicity, since we know we wouldn't be able to
deploy it widely for now anyway due to the incompability with non-MTE-
capable hardware.


Ideally we would clarify user/kernel interface semantics in terms of
object and pointer lifetimes and accessibility, but that's a larger
project that should be pursued separately (if at all).

I could also quibble about whether "anonymous mmap" is the right thing
here -- we should still give specific examples of things that do / don't
qualify, to make it clear what we mean.

Cheers
---Dave

Kevin Brodsky Feb. 11, 2019, 5:28 p.m. UTC | #11

On 19/12/2018 12:52, Dave Martin wrote:
> On Tue, Dec 18, 2018 at 05:59:38PM +0000, Catalin Marinas wrote:
>> On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
>>> On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas<catalin.marinas@arm.com>  wrote:
>>>> The summary of our internal discussions (mostly between kernel
>>>> developers) is that we can't properly describe a user ABI that covers
>>>> future syscalls or syscall extensions while not all syscalls accept
>>>> tagged pointers. So we tweaked the requirements slightly to only allow
>>>> tagged pointers back into the kernel *if* the originating address is
>>>> from an anonymous mmap() or below sbrk(0). This should cover some of the
>>>> ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
>>>> pointer to a buffer obtained via mmap() on the device operations.
>>>>
>>>> (sorry for not being clear on what Vincenzo's proposal implies)
>>> OK, I see. So I need to make the following changes to my patchset AFAIU.
>>>
>>> 1. Make sure that we only allow tagged user addresses that originate
>>> from an anonymous mmap() or below sbrk(0). How exactly should this
>>> check be performed?
>> I don't think we should perform such checks. That's rather stating that
>> the kernel only guarantees that the tagged pointers work if they
>> originated from these memory ranges.
> I concur.
>
> Really, the kernel should do the expected thing with all "non-weird"
> memory.
>
> In lieu of a proper definition of "non-weird", I think we should have
> some lists of things that are explicitly included, and also excluded:
>
> OK:
> 	kernel-allocated process stack
> 	brk area
> 	MAP_ANONYMOUS | MAP_PRIVATE
> 	MAP_PRIVATE mappings of /dev/zero
>
> Not OK:
> 	MAP_SHARED
> 	mmaps of non-memory-like devices
> 	mmaps of anything that is not a regular file
> 	the VDSO
> 	...
>
> In general, userspace can tag memory that it "owns", and we do not assume
> a transfer of ownership except in the "OK" list above.  Otherwise, it's
> the kernel's memory, or the owner is simply not well defined.

Agreed on the general idea: a process should be able to pass tagged pointers at the 
syscall interface, as long as they point to memory privately owned by the process. I 
think it would be possible to simplify the definition of "non-weird" memory by using 
only this "OK" list:
- mmap() done by the process itself, where either:
   * flags = MAP_PRIVATE | MAP_ANONYMOUS
   * flags = MAP_PRIVATE and fd refers to a regular file or a well-defined list of 
device files (like /dev/zero)
- brk() done by the process itself
- Any memory mapped by the kernel in the new process's address space during execve(), 
with the same restrictions as above ([vdso]/[vvar] are therefore excluded)

> I would also like to see advice for userspace developers, particularly
> things like (strawman, please challenge!):

To some extent, one could call me a userspace developer, so I'll try to help :)

>   * Userspace should set tags at the point of allocation only.

Yes, tags are only supposed to be set at the point of either allocation or 
deallocation/reallocation. However, allocators can in principle be nested, so an 
allocator could  take a region allocated by malloc() as input and subdivide it 
(changing tags in the process). That said, this suballocator must not free() that 
region until all the suballocations themselves have been freed (thereby restoring the 
tags initially set by malloc()).

>   * If you don't know how an object was allocated, you cannot modify the
>     tag, period.

Agreed, allocators that tag memory can only operate on memory with a well-defined 
provenance (for instance anonymous mmap() or malloc()).

>   * A single C object should be accessed using a single, fixed pointer tag
>     throughout its entire lifetime.

Agreed. Allocators themselves may need to be excluded though, depending on how they 
represent their managed memory.

>   * Tags can be changed only when there are no outstanding pointers to
>     the affected object or region that may be used to access the object
>     or region (i.e., if the object were allocated from the C heap and
>     is it safe to realloc() it, then it is safe to change the tag; for
>     other types of allocation, analogous arguments can be applied).

Tags can only be changed at the point of deallocation/reallocation. Pointers to the 
object become invalid and cannot be used after it has been deallocated; memory 
tagging allows to catch such invalid usage.

>   * When the kernel dereferences a pointer on userspace's behalf, it
>     shall behave equivalently to userspace dereferencing the same pointer,
>     including use of the same tag (where passed by userspace).
>
>   * Where the pointer tag affects pointer dereference behaviour (i.e.,
>     with hardware memory colouring) the kernel makes no guarantee to
>     honour pointer tags correctly for every location a buffer based on a
>     pointer passed by userspace to the kernel.
>
>     (This means for example that for a read(fd, buf, size), we can check
>     the tag for a single arbitrary location in *(char (*)[size])buf
>     before passing the buffer to get_user_pages().  Hopefully this could
>     be done in get_user_pages() itself rather than hunting call sites.
>     For userspace, it means that you're on your own if you ask the
>     kernel to operate on a buffer than spans multiple, independently-
>     allocated objects, or a deliberately striped single object.)

I think both points are reasonable. It is very valuable for the kernel to access 
userspace memory using the user-provided tag, because it enables kernel accesses to 
be checked in the same way as user accesses, allowing to detect bugs that are 
potentially hard to find. For instance, if a pointer to an object is passed to the 
kernel after it has been deallocated, this is invalid and should be detected. 
However, you are absolutely right that the kernel cannot *guarantee* that such a 
check is carried out for the entire memory range (or in fact at all); it should be a 
best-effort approach.

>   * The kernel shall not extend the lifetime of user pointers in ways
>     that are not clear from the specification of the syscall or
>     interface to which the pointer is passed (and in any case shall not
>     extend pointer lifetimes without good reason).
>
>     So no clever transparent caching between syscalls, unless it _really_
>     is transparent in the presence of tags.

Do you have any particular case in mind? If such caching is really valuable, it is 
always possible to access the object while ignoring the tag. For sure, the 
user-provided tag can only be used during the syscall handling itself, not 
asynchronously later on, unless otherwise specified.

>   * For purposes other than dereference, the kernel shall accept any
>     legitimately tagged pointer (according to the above rules) as
>     identifying the associated memory location.
>
>     So, mprotect(some_page_aligned_object, ...); is valid irrespective
>     of where page_aligned_object() came from.  There is no implicit
>     derefence by the kernel here, hence no tag check.
>
>     The kernel does not guarantee to work correctly if the wrong tag
>     is used, but there is not always a well-defined "right" tag, so
>     we can't really guarantee to check it.  So a pointer derived by
>     any reasonable means by userspace has to be treated as equally
>     valid.

This is a disputed point :) In my opinion, this is the the most reasonable approach.

Cheers,
Kevin

> We would need to get some cross-arch buy-in for this, otherwise core
> maintainers might just refuse to maintain the necessary guarantees.
>
>
>>> 2. Allow tagged addressed passed to memory syscalls (as long as (1) is
>>> satisfied). Do I understand correctly that this means that I need to
>>> locate all find_vma() callers outside of mm/ and fix them up as well?
>> Yes (unless anyone as a better idea or objections to this approach).
> Also, watch out for code that pokes about inside struct vma directly.
>
> I'm wondering, could we define an explicit type, say,
>
> 	struct user_vaddr {
> 		unsigned long addr;
> 	};
>
> to replace the unsigned longs in struct vma the mm API?  This would
> turn ad-hoc (unsigned long) casts into build breaks.  We could have
> an explicit conversion functions, say,
>
> 	struct user_vaddr __user_vaddr_unsafe(void __user *);
> 	void __user *__user_ptr_unsafe(struct user_vaddr);
>
> that we robotically insert in all the relevant places to mark
> unaudited code.
>
> This allows us to keep the kernel buildable, while flagging things
> that will need review.  We would also need to warn the mm folks to
> reject any new code using these unsafe conversions.
>
> Of course, it would be a non-trivial effort...
>
>> BTW, I'll be off until the new year, so won't be able to follow up.
> Cheers
> ---Dave
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Evgenii Stepanov Feb. 11, 2019, 8:32 p.m. UTC | #12

On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky <kevin.brodsky@arm.com> wrote:
>
> On 19/12/2018 12:52, Dave Martin wrote:
> > On Tue, Dec 18, 2018 at 05:59:38PM +0000, Catalin Marinas wrote:
> >> On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
> >>> On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas<catalin.marinas@arm.com>  wrote:
> >>>> The summary of our internal discussions (mostly between kernel
> >>>> developers) is that we can't properly describe a user ABI that covers
> >>>> future syscalls or syscall extensions while not all syscalls accept
> >>>> tagged pointers. So we tweaked the requirements slightly to only allow
> >>>> tagged pointers back into the kernel *if* the originating address is
> >>>> from an anonymous mmap() or below sbrk(0). This should cover some of the
> >>>> ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
> >>>> pointer to a buffer obtained via mmap() on the device operations.
> >>>>
> >>>> (sorry for not being clear on what Vincenzo's proposal implies)
> >>> OK, I see. So I need to make the following changes to my patchset AFAIU.
> >>>
> >>> 1. Make sure that we only allow tagged user addresses that originate
> >>> from an anonymous mmap() or below sbrk(0). How exactly should this
> >>> check be performed?
> >> I don't think we should perform such checks. That's rather stating that
> >> the kernel only guarantees that the tagged pointers work if they
> >> originated from these memory ranges.
> > I concur.
> >
> > Really, the kernel should do the expected thing with all "non-weird"
> > memory.
> >
> > In lieu of a proper definition of "non-weird", I think we should have
> > some lists of things that are explicitly included, and also excluded:
> >
> > OK:
> >       kernel-allocated process stack
> >       brk area
> >       MAP_ANONYMOUS | MAP_PRIVATE
> >       MAP_PRIVATE mappings of /dev/zero
> >
> > Not OK:
> >       MAP_SHARED
> >       mmaps of non-memory-like devices
> >       mmaps of anything that is not a regular file
> >       the VDSO
> >       ...
> >
> > In general, userspace can tag memory that it "owns", and we do not assume
> > a transfer of ownership except in the "OK" list above.  Otherwise, it's
> > the kernel's memory, or the owner is simply not well defined.
>
> Agreed on the general idea: a process should be able to pass tagged pointers at the
> syscall interface, as long as they point to memory privately owned by the process. I
> think it would be possible to simplify the definition of "non-weird" memory by using
> only this "OK" list:
> - mmap() done by the process itself, where either:
>    * flags = MAP_PRIVATE | MAP_ANONYMOUS
>    * flags = MAP_PRIVATE and fd refers to a regular file or a well-defined list of
> device files (like /dev/zero)
> - brk() done by the process itself
> - Any memory mapped by the kernel in the new process's address space during execve(),
> with the same restrictions as above ([vdso]/[vvar] are therefore excluded)
>
> > I would also like to see advice for userspace developers, particularly
> > things like (strawman, please challenge!):
>
> To some extent, one could call me a userspace developer, so I'll try to help :)
>
> >   * Userspace should set tags at the point of allocation only.
>
> Yes, tags are only supposed to be set at the point of either allocation or
> deallocation/reallocation. However, allocators can in principle be nested, so an
> allocator could  take a region allocated by malloc() as input and subdivide it
> (changing tags in the process). That said, this suballocator must not free() that
> region until all the suballocations themselves have been freed (thereby restoring the
> tags initially set by malloc()).
>
> >   * If you don't know how an object was allocated, you cannot modify the
> >     tag, period.
>
> Agreed, allocators that tag memory can only operate on memory with a well-defined
> provenance (for instance anonymous mmap() or malloc()).
>
> >   * A single C object should be accessed using a single, fixed pointer tag
> >     throughout its entire lifetime.
>
> Agreed. Allocators themselves may need to be excluded though, depending on how they
> represent their managed memory.
>
> >   * Tags can be changed only when there are no outstanding pointers to
> >     the affected object or region that may be used to access the object
> >     or region (i.e., if the object were allocated from the C heap and
> >     is it safe to realloc() it, then it is safe to change the tag; for
> >     other types of allocation, analogous arguments can be applied).
>
> Tags can only be changed at the point of deallocation/reallocation. Pointers to the
> object become invalid and cannot be used after it has been deallocated; memory
> tagging allows to catch such invalid usage.
>
> >   * When the kernel dereferences a pointer on userspace's behalf, it
> >     shall behave equivalently to userspace dereferencing the same pointer,
> >     including use of the same tag (where passed by userspace).
> >
> >   * Where the pointer tag affects pointer dereference behaviour (i.e.,
> >     with hardware memory colouring) the kernel makes no guarantee to
> >     honour pointer tags correctly for every location a buffer based on a
> >     pointer passed by userspace to the kernel.
> >
> >     (This means for example that for a read(fd, buf, size), we can check
> >     the tag for a single arbitrary location in *(char (*)[size])buf
> >     before passing the buffer to get_user_pages().  Hopefully this could
> >     be done in get_user_pages() itself rather than hunting call sites.
> >     For userspace, it means that you're on your own if you ask the
> >     kernel to operate on a buffer than spans multiple, independently-
> >     allocated objects, or a deliberately striped single object.)
>
> I think both points are reasonable. It is very valuable for the kernel to access
> userspace memory using the user-provided tag, because it enables kernel accesses to
> be checked in the same way as user accesses, allowing to detect bugs that are
> potentially hard to find. For instance, if a pointer to an object is passed to the
> kernel after it has been deallocated, this is invalid and should be detected.
> However, you are absolutely right that the kernel cannot *guarantee* that such a
> check is carried out for the entire memory range (or in fact at all); it should be a
> best-effort approach.

It would also be valuable to narrow down the set of "relaxed" (i.e.
not tag-checking) syscalls as reasonably possible. We would want to
provide tag-checking userspace wrappers for any important calls that
are not checked in the kernel. Is it correct to assume that anything
that goes through copy_from_user  / copy_to_user is checked?

>
> >   * The kernel shall not extend the lifetime of user pointers in ways
> >     that are not clear from the specification of the syscall or
> >     interface to which the pointer is passed (and in any case shall not
> >     extend pointer lifetimes without good reason).
> >
> >     So no clever transparent caching between syscalls, unless it _really_
> >     is transparent in the presence of tags.
>
> Do you have any particular case in mind? If such caching is really valuable, it is
> always possible to access the object while ignoring the tag. For sure, the
> user-provided tag can only be used during the syscall handling itself, not
> asynchronously later on, unless otherwise specified.

For aio* operations it would be nice if the tag was checked at the
time of the actual userspace read/write, either instead of or in
addition to at the time of the system call.

>
> >   * For purposes other than dereference, the kernel shall accept any
> >     legitimately tagged pointer (according to the above rules) as
> >     identifying the associated memory location.
> >
> >     So, mprotect(some_page_aligned_object, ...); is valid irrespective
> >     of where page_aligned_object() came from.  There is no implicit
> >     derefence by the kernel here, hence no tag check.
> >
> >     The kernel does not guarantee to work correctly if the wrong tag
> >     is used, but there is not always a well-defined "right" tag, so
> >     we can't really guarantee to check it.  So a pointer derived by
> >     any reasonable means by userspace has to be treated as equally
> >     valid.
>
> This is a disputed point :) In my opinion, this is the the most reasonable approach.

Yes, it would be nice if the kernel explicitly promised, ex.
mprotect() over a range of differently tagged pages to be allowed
(i.e. address tag should be unchecked).


>
> Cheers,
> Kevin
>
> > We would need to get some cross-arch buy-in for this, otherwise core
> > maintainers might just refuse to maintain the necessary guarantees.
> >
> >
> >>> 2. Allow tagged addressed passed to memory syscalls (as long as (1) is
> >>> satisfied). Do I understand correctly that this means that I need to
> >>> locate all find_vma() callers outside of mm/ and fix them up as well?
> >> Yes (unless anyone as a better idea or objections to this approach).
> > Also, watch out for code that pokes about inside struct vma directly.
> >
> > I'm wondering, could we define an explicit type, say,
> >
> >       struct user_vaddr {
> >               unsigned long addr;
> >       };
> >
> > to replace the unsigned longs in struct vma the mm API?  This would
> > turn ad-hoc (unsigned long) casts into build breaks.  We could have
> > an explicit conversion functions, say,
> >
> >       struct user_vaddr __user_vaddr_unsafe(void __user *);
> >       void __user *__user_ptr_unsafe(struct user_vaddr);
> >
> > that we robotically insert in all the relevant places to mark
> > unaudited code.
> >
> > This allows us to keep the kernel buildable, while flagging things
> > that will need review.  We would also need to warn the mm folks to
> > reject any new code using these unsafe conversions.
> >
> > Of course, it would be a non-trivial effort...
> >
> >> BTW, I'll be off until the new year, so won't be able to follow up.
> > Cheers
> > ---Dave
> >
> > _______________________________________________
> > linux-arm-kernel mailing list
> > linux-arm-kernel@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

Catalin Marinas Feb. 12, 2019, 6:02 p.m. UTC | #13

On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
> On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky <kevin.brodsky@arm.com> wrote:
> > On 19/12/2018 12:52, Dave Martin wrote:
> > > Really, the kernel should do the expected thing with all "non-weird"
> > > memory.
> > >
> > > In lieu of a proper definition of "non-weird", I think we should have
> > > some lists of things that are explicitly included, and also excluded:
> > >
> > > OK:
> > >       kernel-allocated process stack
> > >       brk area
> > >       MAP_ANONYMOUS | MAP_PRIVATE
> > >       MAP_PRIVATE mappings of /dev/zero
> > >
> > > Not OK:
> > >       MAP_SHARED
> > >       mmaps of non-memory-like devices
> > >       mmaps of anything that is not a regular file
> > >       the VDSO
> > >       ...
> > >
> > > In general, userspace can tag memory that it "owns", and we do not assume
> > > a transfer of ownership except in the "OK" list above.  Otherwise, it's
> > > the kernel's memory, or the owner is simply not well defined.
> >
> > Agreed on the general idea: a process should be able to pass tagged pointers at the
> > syscall interface, as long as they point to memory privately owned by the process. I
> > think it would be possible to simplify the definition of "non-weird" memory by using
> > only this "OK" list:
> > - mmap() done by the process itself, where either:
> >    * flags = MAP_PRIVATE | MAP_ANONYMOUS
> >    * flags = MAP_PRIVATE and fd refers to a regular file or a well-defined list of
> > device files (like /dev/zero)
> > - brk() done by the process itself
> > - Any memory mapped by the kernel in the new process's address space during execve(),
> > with the same restrictions as above ([vdso]/[vvar] are therefore excluded)

Sounds reasonable.

> > >   * Userspace should set tags at the point of allocation only.
> >
> > Yes, tags are only supposed to be set at the point of either allocation or
> > deallocation/reallocation. However, allocators can in principle be nested, so an
> > allocator could  take a region allocated by malloc() as input and subdivide it
> > (changing tags in the process). That said, this suballocator must not free() that
> > region until all the suballocations themselves have been freed (thereby restoring the
> > tags initially set by malloc()).
> >
> > >   * If you don't know how an object was allocated, you cannot modify the
> > >     tag, period.
> >
> > Agreed, allocators that tag memory can only operate on memory with a well-defined
> > provenance (for instance anonymous mmap() or malloc()).
> >
> > >   * A single C object should be accessed using a single, fixed pointer tag
> > >     throughout its entire lifetime.
> >
> > Agreed. Allocators themselves may need to be excluded though, depending on how they
> > represent their managed memory.
> >
> > >   * Tags can be changed only when there are no outstanding pointers to
> > >     the affected object or region that may be used to access the object
> > >     or region (i.e., if the object were allocated from the C heap and
> > >     is it safe to realloc() it, then it is safe to change the tag; for
> > >     other types of allocation, analogous arguments can be applied).
> >
> > Tags can only be changed at the point of deallocation/reallocation. Pointers to the
> > object become invalid and cannot be used after it has been deallocated; memory
> > tagging allows to catch such invalid usage.

All the above sound well but that's mostly a guideline on what a C
library can do. It doesn't help much with defining the kernel ABI.
Anyway, it's good to clarify the use-cases.

> > >   * When the kernel dereferences a pointer on userspace's behalf, it
> > >     shall behave equivalently to userspace dereferencing the same pointer,
> > >     including use of the same tag (where passed by userspace).
> > >
> > >   * Where the pointer tag affects pointer dereference behaviour (i.e.,
> > >     with hardware memory colouring) the kernel makes no guarantee to
> > >     honour pointer tags correctly for every location a buffer based on a
> > >     pointer passed by userspace to the kernel.
> > >
> > >     (This means for example that for a read(fd, buf, size), we can check
> > >     the tag for a single arbitrary location in *(char (*)[size])buf
> > >     before passing the buffer to get_user_pages().  Hopefully this could
> > >     be done in get_user_pages() itself rather than hunting call sites.
> > >     For userspace, it means that you're on your own if you ask the
> > >     kernel to operate on a buffer than spans multiple, independently-
> > >     allocated objects, or a deliberately striped single object.)
> >
> > I think both points are reasonable. It is very valuable for the kernel to access
> > userspace memory using the user-provided tag, because it enables kernel accesses to
> > be checked in the same way as user accesses, allowing to detect bugs that are
> > potentially hard to find. For instance, if a pointer to an object is passed to the
> > kernel after it has been deallocated, this is invalid and should be detected.
> > However, you are absolutely right that the kernel cannot *guarantee* that such a
> > check is carried out for the entire memory range (or in fact at all); it should be a
> > best-effort approach.
> 
> It would also be valuable to narrow down the set of "relaxed" (i.e.
> not tag-checking) syscalls as reasonably possible. We would want to
> provide tag-checking userspace wrappers for any important calls that
> are not checked in the kernel. Is it correct to assume that anything
> that goes through copy_from_user  / copy_to_user is checked?

I lost track of the context of this thread but if it's just about
relaxing the ABI for hwasan, the kernel has no idea of the compiler
generated structures in user space, so nothing is checked.

If we talk about tags in the context of MTE, than yes, with the current
proposal the tag would be checked by copy_*_user() functions.

> > >   * The kernel shall not extend the lifetime of user pointers in ways
> > >     that are not clear from the specification of the syscall or
> > >     interface to which the pointer is passed (and in any case shall not
> > >     extend pointer lifetimes without good reason).
> > >
> > >     So no clever transparent caching between syscalls, unless it _really_
> > >     is transparent in the presence of tags.
> >
> > Do you have any particular case in mind? If such caching is really valuable, it is
> > always possible to access the object while ignoring the tag. For sure, the
> > user-provided tag can only be used during the syscall handling itself, not
> > asynchronously later on, unless otherwise specified.
> 
> For aio* operations it would be nice if the tag was checked at the
> time of the actual userspace read/write, either instead of or in
> addition to at the time of the system call.

With aio* (and synchronous iovec-based syscalls), the kernel may access
the memory while the corresponding user process is scheduled out. Given
that such access is not done in the context of the user process (and
using the user VA like copy_*_user), the kernel cannot handle potential
tag faults. Moreover, the transfer may be done by DMA and the device
does not understand tags.

I'd like to keep tags as a property of the pointer in a specific virtual
address space. The moment you convert it to a different address space
(e.g. kernel linear map, physical address), the tag property is stripped
and I don't think we should re-build it (and have it checked).

> > >   * For purposes other than dereference, the kernel shall accept any
> > >     legitimately tagged pointer (according to the above rules) as
> > >     identifying the associated memory location.
> > >
> > >     So, mprotect(some_page_aligned_object, ...); is valid irrespective
> > >     of where page_aligned_object() came from.  There is no implicit
> > >     derefence by the kernel here, hence no tag check.
> > >
> > >     The kernel does not guarantee to work correctly if the wrong tag
> > >     is used, but there is not always a well-defined "right" tag, so
> > >     we can't really guarantee to check it.  So a pointer derived by
> > >     any reasonable means by userspace has to be treated as equally
> > >     valid.
> >
> > This is a disputed point :) In my opinion, this is the the most reasonable approach.
> 
> Yes, it would be nice if the kernel explicitly promised, ex.
> mprotect() over a range of differently tagged pages to be allowed
> (i.e. address tag should be unchecked).

I don't think mprotect() over differently tagged pages was ever a
problem. I originally asked that mprotect() and friends do not accept
tagged pointers since these functions deal with memory ranges rather
than dereferencing such pointer (the reason being minimal kernel
changes). However, given how complicated it is to specify an ABI, I came
to the conclusion that a pointer passed to such function should be
allowed to have non-zero top byte. It would be the kernel's
responsibility to strip it out as appropriate.

Dave Martin Feb. 13, 2019, 2:58 p.m. UTC | #14

On Tue, Feb 12, 2019 at 06:02:24PM +0000, Catalin Marinas wrote:
> On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
> > On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky <kevin.brodsky@arm.com> wrote:
> > > On 19/12/2018 12:52, Dave Martin wrote:

[...]

> > > >   * A single C object should be accessed using a single, fixed
> > > >     pointer tag throughout its entire lifetime.
> > >
> > > Agreed.  Allocators themselves may need to be excluded though,
> > > depending on how they represent their managed memory.
> > >
> > > >   * Tags can be changed only when there are no outstanding pointers to
> > > >     the affected object or region that may be used to access the object
> > > >     or region (i.e., if the object were allocated from the C heap and
> > > >     is it safe to realloc() it, then it is safe to change the tag; for
> > > >     other types of allocation, analogous arguments can be applied).
> > >
> > > Tags can only be changed at the point of deallocation/
> > > reallocation.  Pointers to the object become invalid and cannot
> > > be used after it has been deallocated; memory tagging allows to
> > > catch such invalid usage.
>
> All the above sound well but that's mostly a guideline on what a C
> library can do. It doesn't help much with defining the kernel ABI.
> Anyway, it's good to clarify the use-cases.

My aim was to clarify the use case in userspace, since I wasn't directly
involved in that.  The kernel ABI needs to be compatible with the the
use case, but doesn't need to specify must of it.

I'm wondering whether we can piggy-back on existing concepts.

We could say that recolouring memory is safe when and only when
unmapping of the page or removing permissions on the page (via
munmap/mremap/mprotect) would be safe.  Otherwise, the resulting
behaviour of the process is undefined.

Hopefully there are friendly fuzzers testing this kind of thing.

[...]

> > It would also be valuable to narrow down the set of "relaxed" (i.e.
> > not tag-checking) syscalls as reasonably possible. We would want to
> > provide tag-checking userspace wrappers for any important calls that
> > are not checked in the kernel. Is it correct to assume that anything
> > that goes through copy_from_user  / copy_to_user is checked?
> 
> I lost track of the context of this thread but if it's just about
> relaxing the ABI for hwasan, the kernel has no idea of the compiler
> generated structures in user space, so nothing is checked.
> 
> If we talk about tags in the context of MTE, than yes, with the current
> proposal the tag would be checked by copy_*_user() functions.

Also put_user() and friends?

It might be reasonable to do the check in access_ok() and skip it in
__put_user() etc.

(I seem to remember some separate discussion about abolishing
__put_user() and friends though, due to the accident risk they pose.)

> > For aio* operations it would be nice if the tag was checked at the
> > time of the actual userspace read/write, either instead of or in
> > addition to at the time of the system call.
> 
> With aio* (and synchronous iovec-based syscalls), the kernel may access
> the memory while the corresponding user process is scheduled out. Given
> that such access is not done in the context of the user process (and
> using the user VA like copy_*_user), the kernel cannot handle potential
> tag faults. Moreover, the transfer may be done by DMA and the device
> does not understand tags.
> 
> I'd like to keep tags as a property of the pointer in a specific virtual
> address space. The moment you convert it to a different address space
> (e.g. kernel linear map, physical address), the tag property is stripped
> and I don't think we should re-build it (and have it checked).

This is probably reasonable.

Ideally we would check the tag at the point of stripping it off, but
most likely it's going to be rather best-effort.

If memory tagging is essential a debugging feature then this seems
an acceptable compromise.

> > > >   * For purposes other than dereference, the kernel shall accept any
> > > >     legitimately tagged pointer (according to the above rules) as
> > > >     identifying the associated memory location.
> > > >
> > > >     So, mprotect(some_page_aligned_object, ...); is valid irrespective
> > > >     of where page_aligned_object() came from.  There is no implicit
> > > >     derefence by the kernel here, hence no tag check.
> > > >
> > > >     The kernel does not guarantee to work correctly if the wrong tag
> > > >     is used, but there is not always a well-defined "right" tag, so
> > > >     we can't really guarantee to check it.  So a pointer derived by
> > > >     any reasonable means by userspace has to be treated as equally
> > > >     valid.
> > >
> > > This is a disputed point :) In my opinion, this is the the most
> > > reasonable approach.
> > 
> > Yes, it would be nice if the kernel explicitly promised, ex.
> > mprotect() over a range of differently tagged pages to be allowed
> > (i.e. address tag should be unchecked).
> 
> I don't think mprotect() over differently tagged pages was ever a
> problem. I originally asked that mprotect() and friends do not accept
> tagged pointers since these functions deal with memory ranges rather
> than dereferencing such pointer (the reason being minimal kernel
> changes). However, given how complicated it is to specify an ABI, I came
> to the conclusion that a pointer passed to such function should be
> allowed to have non-zero top byte. It would be the kernel's
> responsibility to strip it out as appropriate.

I think that if the page range is all the same colour then it should be
legitimate to pass a matching tag.

But it doesn't seem reasonable for the kernel to require this.  If
free() calls munmap(), the page(s) will contain possibly randomly-
coloured garbage.  There's no correct tag to pass in such a case.

The most obvious solution is just to ignore the tags passed by userspace
to such syscalls.  This would imply that the kernel must explicitly
strip it out, as you suggest.

The number of affected syscalls is relatively small though.

Cheers
---Dave

Kevin Brodsky Feb. 13, 2019, 4:42 p.m. UTC | #15

(+Cc other people with MTE experience: Branislav, Ruben)

On 13/02/2019 14:58, Dave Martin wrote:
> On Tue, Feb 12, 2019 at 06:02:24PM +0000, Catalin Marinas wrote:
>> On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
>>> On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky <kevin.brodsky@arm.com> wrote:
>>>> On 19/12/2018 12:52, Dave Martin wrote:
> [...]
>
>>>>>    * A single C object should be accessed using a single, fixed
>>>>>      pointer tag throughout its entire lifetime.
>>>> Agreed.  Allocators themselves may need to be excluded though,
>>>> depending on how they represent their managed memory.
>>>>
>>>>>    * Tags can be changed only when there are no outstanding pointers to
>>>>>      the affected object or region that may be used to access the object
>>>>>      or region (i.e., if the object were allocated from the C heap and
>>>>>      is it safe to realloc() it, then it is safe to change the tag; for
>>>>>      other types of allocation, analogous arguments can be applied).
>>>> Tags can only be changed at the point of deallocation/
>>>> reallocation.  Pointers to the object become invalid and cannot
>>>> be used after it has been deallocated; memory tagging allows to
>>>> catch such invalid usage.
>> All the above sound well but that's mostly a guideline on what a C
>> library can do. It doesn't help much with defining the kernel ABI.
>> Anyway, it's good to clarify the use-cases.
> My aim was to clarify the use case in userspace, since I wasn't directly
> involved in that.  The kernel ABI needs to be compatible with the the
> use case, but doesn't need to specify must of it.
>
> I'm wondering whether we can piggy-back on existing concepts.
>
> We could say that recolouring memory is safe when and only when
> unmapping of the page or removing permissions on the page (via
> munmap/mremap/mprotect) would be safe.  Otherwise, the resulting
> behaviour of the process is undefined.

Is that a sufficient requirement? I don't think that anything prevents you from using 
mprotect() on say [vvar], but we don't necessarily want to map [vvar] as tagged. I'm 
not sure it's easy to define what "safe" would mean here.

> Hopefully there are friendly fuzzers testing this kind of thing.
>
> [...]
>
>>> It would also be valuable to narrow down the set of "relaxed" (i.e.
>>> not tag-checking) syscalls as reasonably possible. We would want to
>>> provide tag-checking userspace wrappers for any important calls that
>>> are not checked in the kernel. Is it correct to assume that anything
>>> that goes through copy_from_user  / copy_to_user is checked?
>> I lost track of the context of this thread but if it's just about
>> relaxing the ABI for hwasan, the kernel has no idea of the compiler
>> generated structures in user space, so nothing is checked.
>>
>> If we talk about tags in the context of MTE, than yes, with the current
>> proposal the tag would be checked by copy_*_user() functions.
> Also put_user() and friends?
>
> It might be reasonable to do the check in access_ok() and skip it in
> __put_user() etc.
>
> (I seem to remember some separate discussion about abolishing
> __put_user() and friends though, due to the accident risk they pose.)

Keep in mind that with MTE, there is no need to do any explicit check when accessing 
user memory via a user-provided pointer. The tagged user pointer is directly passed 
to copy_*_user() or put_user(). If the load/store causes a tag fault, then it is 
handled just like a page fault (i.e. invoking the fixup handler). As far as I can 
tell, there's no need to do anything special in access_ok() in that case.

[The above applies to precise mode. In imprecise mode, some more work will be needed 
after the load/store to check whether a tag fault happened.]

>
>>> For aio* operations it would be nice if the tag was checked at the
>>> time of the actual userspace read/write, either instead of or in
>>> addition to at the time of the system call.
>> With aio* (and synchronous iovec-based syscalls), the kernel may access
>> the memory while the corresponding user process is scheduled out. Given
>> that such access is not done in the context of the user process (and
>> using the user VA like copy_*_user), the kernel cannot handle potential
>> tag faults. Moreover, the transfer may be done by DMA and the device
>> does not understand tags.
>>
>> I'd like to keep tags as a property of the pointer in a specific virtual
>> address space. The moment you convert it to a different address space
>> (e.g. kernel linear map, physical address), the tag property is stripped
>> and I don't think we should re-build it (and have it checked).
> This is probably reasonable.
>
> Ideally we would check the tag at the point of stripping it off, but
> most likely it's going to be rather best-effort.
>
> If memory tagging is essential a debugging feature then this seems
> an acceptable compromise.

There are many possible ways to deploy MTE, and debugging is just one of them. For 
instance, you may want to turn on heap colouring for some processes in the system, 
including in production.

Regarding those cases where it is impossible to check tags at the point of accessing 
user memory, it is indeed possible to check the memory tags at the point of stripping 
the tag from the user pointer. Given that some MTE use-cases favour performance over 
tag check coverage, the ideal approach would be to make these checks configurable 
(e.g. check one granule, check all of them, or check none). I don't know how feasible 
this is in practice.

Kevin

>
>>>>>    * For purposes other than dereference, the kernel shall accept any
>>>>>      legitimately tagged pointer (according to the above rules) as
>>>>>      identifying the associated memory location.
>>>>>
>>>>>      So, mprotect(some_page_aligned_object, ...); is valid irrespective
>>>>>      of where page_aligned_object() came from.  There is no implicit
>>>>>      derefence by the kernel here, hence no tag check.
>>>>>
>>>>>      The kernel does not guarantee to work correctly if the wrong tag
>>>>>      is used, but there is not always a well-defined "right" tag, so
>>>>>      we can't really guarantee to check it.  So a pointer derived by
>>>>>      any reasonable means by userspace has to be treated as equally
>>>>>      valid.
>>>> This is a disputed point :) In my opinion, this is the the most
>>>> reasonable approach.
>>> Yes, it would be nice if the kernel explicitly promised, ex.
>>> mprotect() over a range of differently tagged pages to be allowed
>>> (i.e. address tag should be unchecked).
>> I don't think mprotect() over differently tagged pages was ever a
>> problem. I originally asked that mprotect() and friends do not accept
>> tagged pointers since these functions deal with memory ranges rather
>> than dereferencing such pointer (the reason being minimal kernel
>> changes). However, given how complicated it is to specify an ABI, I came
>> to the conclusion that a pointer passed to such function should be
>> allowed to have non-zero top byte. It would be the kernel's
>> responsibility to strip it out as appropriate.
> I think that if the page range is all the same colour then it should be
> legitimate to pass a matching tag.
>
> But it doesn't seem reasonable for the kernel to require this.  If
> free() calls munmap(), the page(s) will contain possibly randomly-
> coloured garbage.  There's no correct tag to pass in such a case.
>
> The most obvious solution is just to ignore the tags passed by userspace
> to such syscalls.  This would imply that the kernel must explicitly
> strip it out, as you suggest.
>
> The number of affected syscalls is relatively small though.
>
> Cheers
> ---Dave

Dave Martin Feb. 13, 2019, 5:43 p.m. UTC | #16

On Wed, Feb 13, 2019 at 04:42:11PM +0000, Kevin Brodsky wrote:
> (+Cc other people with MTE experience: Branislav, Ruben)

[...]

> >I'm wondering whether we can piggy-back on existing concepts.
> >
> >We could say that recolouring memory is safe when and only when
> >unmapping of the page or removing permissions on the page (via
> >munmap/mremap/mprotect) would be safe.  Otherwise, the resulting
> >behaviour of the process is undefined.
> 
> Is that a sufficient requirement? I don't think that anything prevents you
> from using mprotect() on say [vvar], but we don't necessarily want to map
> [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean
> here.

I think the origin rules have to apply too: [vvar] is not a regular,
private page but a weird, shared thing mapped for you by the kernel.

Presumably userspace _cannot_ do mprotect(PROT_WRITE) on it.

I'm also assuming that userspace cannot recolour memory in read-only
pages.  That sounds bad if there's no way to prevent it.

[...]

> >It might be reasonable to do the check in access_ok() and skip it in
> >__put_user() etc.
> >
> >(I seem to remember some separate discussion about abolishing
> >__put_user() and friends though, due to the accident risk they pose.)
> 
> Keep in mind that with MTE, there is no need to do any explicit check when
> accessing user memory via a user-provided pointer. The tagged user pointer
> is directly passed to copy_*_user() or put_user(). If the load/store causes
> a tag fault, then it is handled just like a page fault (i.e. invoking the
> fixup handler). As far as I can tell, there's no need to do anything special
> in access_ok() in that case.
> 
> [The above applies to precise mode. In imprecise mode, some more work will
> be needed after the load/store to check whether a tag fault happened.]

Fair enough, I'm a bit hazy on the details as of right now..

[...]

> There are many possible ways to deploy MTE, and debugging is just one of
> them. For instance, you may want to turn on heap colouring for some
> processes in the system, including in production.

To implement enforceable protection, or as a diagnostic tool for when
something goes wrong?

In the latter case it's still OK for the kernel's tag checking not to be
exhaustive.

> Regarding those cases where it is impossible to check tags at the point of
> accessing user memory, it is indeed possible to check the memory tags at the
> point of stripping the tag from the user pointer. Given that some MTE
> use-cases favour performance over tag check coverage, the ideal approach
> would be to make these checks configurable (e.g. check one granule, check
> all of them, or check none). I don't know how feasible this is in practice.

Check all granules of a massive DMA buffer?

That doesn't sounds feasible without explicit support in the hardware to
have the DMA check tags itself as the memory is accessed.  MTE by itself
doesn't provide for this IIUC (at least, it would require support in the
platform, not just the CPU).

We do not want to bake any assumptions into the ABI about whether a
given data transfer may or may not be offloaded to DMA.  That feels
like a slippery slope.

Providing we get the checks for free in put_user/get_user/
copy_{to,from}_user(), those will cover a lot of cases though, for
non-bulk-IO cases.

My assumption has been that at this point in time we are mainly aiming
to support the debug/diagnostic use cases today.

At least, those are the low(ish)-hanging fruit.

Others are better placed than me to comment on the goals here.

Cheers
---Dave

Evgenii Stepanov Feb. 13, 2019, 9:41 p.m. UTC | #17

On Wed, Feb 13, 2019 at 9:43 AM Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Wed, Feb 13, 2019 at 04:42:11PM +0000, Kevin Brodsky wrote:
> > (+Cc other people with MTE experience: Branislav, Ruben)
>
> [...]
>
> > >I'm wondering whether we can piggy-back on existing concepts.
> > >
> > >We could say that recolouring memory is safe when and only when
> > >unmapping of the page or removing permissions on the page (via
> > >munmap/mremap/mprotect) would be safe.  Otherwise, the resulting
> > >behaviour of the process is undefined.
> >
> > Is that a sufficient requirement? I don't think that anything prevents you
> > from using mprotect() on say [vvar], but we don't necessarily want to map
> > [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean
> > here.
>
> I think the origin rules have to apply too: [vvar] is not a regular,
> private page but a weird, shared thing mapped for you by the kernel.
>
> Presumably userspace _cannot_ do mprotect(PROT_WRITE) on it.
>
> I'm also assuming that userspace cannot recolour memory in read-only
> pages.  That sounds bad if there's no way to prevent it.

That sounds like something we would like to do to catch out of bounds
read of .rodata globals.
Another potentially interesting use case for MTE is infinite hardware
watchpoints - that would require trapping reads for individual tagging
granules, include those in read-only binary segment.

>
> [...]
>
> > >It might be reasonable to do the check in access_ok() and skip it in
> > >__put_user() etc.
> > >
> > >(I seem to remember some separate discussion about abolishing
> > >__put_user() and friends though, due to the accident risk they pose.)
> >
> > Keep in mind that with MTE, there is no need to do any explicit check when
> > accessing user memory via a user-provided pointer. The tagged user pointer
> > is directly passed to copy_*_user() or put_user(). If the load/store causes
> > a tag fault, then it is handled just like a page fault (i.e. invoking the
> > fixup handler). As far as I can tell, there's no need to do anything special
> > in access_ok() in that case.
> >
> > [The above applies to precise mode. In imprecise mode, some more work will
> > be needed after the load/store to check whether a tag fault happened.]
>
> Fair enough, I'm a bit hazy on the details as of right now..
>
> [...]
>
> > There are many possible ways to deploy MTE, and debugging is just one of
> > them. For instance, you may want to turn on heap colouring for some
> > processes in the system, including in production.
>
> To implement enforceable protection, or as a diagnostic tool for when
> something goes wrong?
>
> In the latter case it's still OK for the kernel's tag checking not to be
> exhaustive.
>
> > Regarding those cases where it is impossible to check tags at the point of
> > accessing user memory, it is indeed possible to check the memory tags at the
> > point of stripping the tag from the user pointer. Given that some MTE
> > use-cases favour performance over tag check coverage, the ideal approach
> > would be to make these checks configurable (e.g. check one granule, check
> > all of them, or check none). I don't know how feasible this is in practice.
>
> Check all granules of a massive DMA buffer?
>
> That doesn't sounds feasible without explicit support in the hardware to
> have the DMA check tags itself as the memory is accessed.  MTE by itself
> doesn't provide for this IIUC (at least, it would require support in the
> platform, not just the CPU).
>
> We do not want to bake any assumptions into the ABI about whether a
> given data transfer may or may not be offloaded to DMA.  That feels
> like a slippery slope.
>
> Providing we get the checks for free in put_user/get_user/
> copy_{to,from}_user(), those will cover a lot of cases though, for
> non-bulk-IO cases.
>
>
> My assumption has been that at this point in time we are mainly aiming
> to support the debug/diagnostic use cases today.
>
> At least, those are the low(ish)-hanging fruit.
>
> Others are better placed than me to comment on the goals here.
>
> Cheers
> ---Dave

Kevin Brodsky Feb. 14, 2019, 11:22 a.m. UTC | #18

On 13/02/2019 21:41, Evgenii Stepanov wrote:
> On Wed, Feb 13, 2019 at 9:43 AM Dave Martin <Dave.Martin@arm.com> wrote:
>> On Wed, Feb 13, 2019 at 04:42:11PM +0000, Kevin Brodsky wrote:
>>> (+Cc other people with MTE experience: Branislav, Ruben)
>> [...]
>>
>>>> I'm wondering whether we can piggy-back on existing concepts.
>>>>
>>>> We could say that recolouring memory is safe when and only when
>>>> unmapping of the page or removing permissions on the page (via
>>>> munmap/mremap/mprotect) would be safe.  Otherwise, the resulting
>>>> behaviour of the process is undefined.
>>> Is that a sufficient requirement? I don't think that anything prevents you
>>> from using mprotect() on say [vvar], but we don't necessarily want to map
>>> [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean
>>> here.
>> I think the origin rules have to apply too: [vvar] is not a regular,
>> private page but a weird, shared thing mapped for you by the kernel.
>>
>> Presumably userspace _cannot_ do mprotect(PROT_WRITE) on it.
>>
>> I'm also assuming that userspace cannot recolour memory in read-only
>> pages.  That sounds bad if there's no way to prevent it.
> That sounds like something we would like to do to catch out of bounds
> read of .rodata globals.
> Another potentially interesting use case for MTE is infinite hardware
> watchpoints - that would require trapping reads for individual tagging
> granules, include those in read-only binary segment.

I think we should keep this discussion for a later, separate thread. Vincenzo's 
proposal is about allowing userspace to pass tags at the syscall interface. The set 
of mappings allowed to be tagged by userspace (in MTE) should be contained in the set 
of mappings that userspace can pass tagged pointers to (at the syscall interface), 
but they are not necessarily the same. Private read-only mappings are an edge case 
(you can pass tagged pointers to them, the memory may or may not be mapped as tagged, 
but in any case it is not possible to change the memory tags via such mapping).

>
>> [...]
>>
>>>> It might be reasonable to do the check in access_ok() and skip it in
>>>> __put_user() etc.
>>>>
>>>> (I seem to remember some separate discussion about abolishing
>>>> __put_user() and friends though, due to the accident risk they pose.)
>>> Keep in mind that with MTE, there is no need to do any explicit check when
>>> accessing user memory via a user-provided pointer. The tagged user pointer
>>> is directly passed to copy_*_user() or put_user(). If the load/store causes
>>> a tag fault, then it is handled just like a page fault (i.e. invoking the
>>> fixup handler). As far as I can tell, there's no need to do anything special
>>> in access_ok() in that case.
>>>
>>> [The above applies to precise mode. In imprecise mode, some more work will
>>> be needed after the load/store to check whether a tag fault happened.]
>> Fair enough, I'm a bit hazy on the details as of right now..
>>
>> [...]
>>
>>> There are many possible ways to deploy MTE, and debugging is just one of
>>> them. For instance, you may want to turn on heap colouring for some
>>> processes in the system, including in production.
>> To implement enforceable protection, or as a diagnostic tool for when
>> something goes wrong?
>>
>> In the latter case it's still OK for the kernel's tag checking not to be
>> exhaustive.
>>
>>> Regarding those cases where it is impossible to check tags at the point of
>>> accessing user memory, it is indeed possible to check the memory tags at the
>>> point of stripping the tag from the user pointer. Given that some MTE
>>> use-cases favour performance over tag check coverage, the ideal approach
>>> would be to make these checks configurable (e.g. check one granule, check
>>> all of them, or check none). I don't know how feasible this is in practice.
>> Check all granules of a massive DMA buffer?
>>
>> That doesn't sounds feasible without explicit support in the hardware to
>> have the DMA check tags itself as the memory is accessed.  MTE by itself
>> doesn't provide for this IIUC (at least, it would require support in the
>> platform, not just the CPU).
>>
>> We do not want to bake any assumptions into the ABI about whether a
>> given data transfer may or may not be offloaded to DMA.  That feels
>> like a slippery slope.
>>
>> Providing we get the checks for free in put_user/get_user/
>> copy_{to,from}_user(), those will cover a lot of cases though, for
>> non-bulk-IO cases.
>>
>>
>> My assumption has been that at this point in time we are mainly aiming
>> to support the debug/diagnostic use cases today.

MTE can be used both for diagnostics (imprecise mode is especially suitable for 
that), and to halt execution when something wrong is detected. Even in the latter 
case, one cannot expect exhaustive checking from MTE, because the way it works is 
fundamentally statistical; an invalid pointer may by chance have the right tag to 
access the given location. So again, I think that a best-effort approach is 
appropriate when the kernel accesses user memory, in terms of checking that tags match.

More specifically, different use-cases come with different tradeoffs (performance / 
tag check coverage). That's why I am suggesting that in the cases where tag checks 
would need to be done _explicitly_ (before losing the user-provided tag), it would be 
nice to be able to choose how much should be checked. I am not suggesting that always 
checking all the granules by default is sane. Maybe checking just the first granule 
is the right default.

I don't think we need to get to the bottom of this specific aspect at this point. 
This ABI proposal is not about memory tagging, so there is no need to specify how or 
when tag checking is done. As long as this ABI allows tagged pointers, pointing to 
mappings that could be potentially tagged, to be passed to syscalls, I don't think 
further relaxations are needed to enable memory tagging.

Kevin

>>
>> At least, those are the low(ish)-hanging fruit.
>>
>> Others are better placed than me to comment on the goals here.
>>
>> Cheers
>> ---Dave

Szabolcs Nagy Feb. 19, 2019, 6:38 p.m. UTC | #19

On 12/02/2019 18:02, Catalin Marinas wrote:
> On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
>> On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky <kevin.brodsky@arm.com> wrote:
>>> On 19/12/2018 12:52, Dave Martin wrote:
>>>> Really, the kernel should do the expected thing with all "non-weird"
>>>> memory.
>>>>
>>>> In lieu of a proper definition of "non-weird", I think we should have
>>>> some lists of things that are explicitly included, and also excluded:
>>>>
>>>> OK:
>>>>       kernel-allocated process stack
>>>>       brk area
>>>>       MAP_ANONYMOUS | MAP_PRIVATE
>>>>       MAP_PRIVATE mappings of /dev/zero
>>>>
>>>> Not OK:
>>>>       MAP_SHARED
>>>>       mmaps of non-memory-like devices
>>>>       mmaps of anything that is not a regular file
>>>>       the VDSO
>>>>       ...
>>>>
>>>> In general, userspace can tag memory that it "owns", and we do not assume
>>>> a transfer of ownership except in the "OK" list above.  Otherwise, it's
>>>> the kernel's memory, or the owner is simply not well defined.
>>>
>>> Agreed on the general idea: a process should be able to pass tagged pointers at the
>>> syscall interface, as long as they point to memory privately owned by the process. I
>>> think it would be possible to simplify the definition of "non-weird" memory by using
>>> only this "OK" list:
>>> - mmap() done by the process itself, where either:
>>>    * flags = MAP_PRIVATE | MAP_ANONYMOUS
>>>    * flags = MAP_PRIVATE and fd refers to a regular file or a well-defined list of
>>> device files (like /dev/zero)
>>> - brk() done by the process itself
>>> - Any memory mapped by the kernel in the new process's address space during execve(),
>>> with the same restrictions as above ([vdso]/[vvar] are therefore excluded)
> 
> Sounds reasonable.

OK. this non-weird memory definition works for me too.

rule 1: if weird memory pointers are passed to the kernel
with top byte set then the behaviour is undefined.

>>>>   * When the kernel dereferences a pointer on userspace's behalf, it
>>>>     shall behave equivalently to userspace dereferencing the same pointer,
>>>>     including use of the same tag (where passed by userspace).
>>>>
>>>>   * Where the pointer tag affects pointer dereference behaviour (i.e.,
>>>>     with hardware memory colouring) the kernel makes no guarantee to
>>>>     honour pointer tags correctly for every location a buffer based on a
>>>>     pointer passed by userspace to the kernel.
>>>>
>>>>     (This means for example that for a read(fd, buf, size), we can check
>>>>     the tag for a single arbitrary location in *(char (*)[size])buf
>>>>     before passing the buffer to get_user_pages().  Hopefully this could
>>>>     be done in get_user_pages() itself rather than hunting call sites.
>>>>     For userspace, it means that you're on your own if you ask the
>>>>     kernel to operate on a buffer than spans multiple, independently-
>>>>     allocated objects, or a deliberately striped single object.)
>>>
>>> I think both points are reasonable. It is very valuable for the kernel to access
>>> userspace memory using the user-provided tag, because it enables kernel accesses to
>>> be checked in the same way as user accesses, allowing to detect bugs that are
>>> potentially hard to find. For instance, if a pointer to an object is passed to the
>>> kernel after it has been deallocated, this is invalid and should be detected.
>>> However, you are absolutely right that the kernel cannot *guarantee* that such a
>>> check is carried out for the entire memory range (or in fact at all); it should be a
>>> best-effort approach.
>>
>> It would also be valuable to narrow down the set of "relaxed" (i.e.
>> not tag-checking) syscalls as reasonably possible. We would want to
>> provide tag-checking userspace wrappers for any important calls that
>> are not checked in the kernel. Is it correct to assume that anything
>> that goes through copy_from_user  / copy_to_user is checked?
> 
> I lost track of the context of this thread but if it's just about
> relaxing the ABI for hwasan, the kernel has no idea of the compiler
> generated structures in user space, so nothing is checked.
> 
> If we talk about tags in the context of MTE, than yes, with the current
> proposal the tag would be checked by copy_*_user() functions.

rule 2: kernel derefs as if user derefs when non-weird memory
pointers are passed to the kernel.

note that the important bit is what happens on valid pointer
derefs, invalid pointer deref is usually undefined for user
programs, so what happens in case of mte tag failures is
more of a QoI issue than abi i think.

(e.g. EFAULT is not guaranteed by the kernel currently, i can
successfully do write(open("/dev/null",O_WRONLY), 0, 1), or
get a crash when passing invalid pointer to a vdso function,
so userspace should not rely on some strict EFAULT behaviour).

>>>>   * The kernel shall not extend the lifetime of user pointers in ways
>>>>     that are not clear from the specification of the syscall or
>>>>     interface to which the pointer is passed (and in any case shall not
>>>>     extend pointer lifetimes without good reason).
>>>>
>>>>     So no clever transparent caching between syscalls, unless it _really_
>>>>     is transparent in the presence of tags.
>>>
>>> Do you have any particular case in mind? If such caching is really valuable, it is
>>> always possible to access the object while ignoring the tag. For sure, the
>>> user-provided tag can only be used during the syscall handling itself, not
>>> asynchronously later on, unless otherwise specified.
>>
>> For aio* operations it would be nice if the tag was checked at the
>> time of the actual userspace read/write, either instead of or in
>> addition to at the time of the system call.
> 
> With aio* (and synchronous iovec-based syscalls), the kernel may access
> the memory while the corresponding user process is scheduled out. Given
> that such access is not done in the context of the user process (and
> using the user VA like copy_*_user), the kernel cannot handle potential
> tag faults. Moreover, the transfer may be done by DMA and the device
> does not understand tags.
> 
> I'd like to keep tags as a property of the pointer in a specific virtual
> address space. The moment you convert it to a different address space
> (e.g. kernel linear map, physical address), the tag property is stripped
> and I don't think we should re-build it (and have it checked).

OK.

i don't think the new abi needs special rules about
pointer lifetime.

>>>>   * For purposes other than dereference, the kernel shall accept any
>>>>     legitimately tagged pointer (according to the above rules) as
>>>>     identifying the associated memory location.
>>>>
>>>>     So, mprotect(some_page_aligned_object, ...); is valid irrespective
>>>>     of where page_aligned_object() came from.  There is no implicit
>>>>     derefence by the kernel here, hence no tag check.
>>>>
>>>>     The kernel does not guarantee to work correctly if the wrong tag
>>>>     is used, but there is not always a well-defined "right" tag, so
>>>>     we can't really guarantee to check it.  So a pointer derived by
>>>>     any reasonable means by userspace has to be treated as equally
>>>>     valid.
>>>
>>> This is a disputed point :) In my opinion, this is the the most reasonable approach.
>>
>> Yes, it would be nice if the kernel explicitly promised, ex.
>> mprotect() over a range of differently tagged pages to be allowed
>> (i.e. address tag should be unchecked).
> 
> I don't think mprotect() over differently tagged pages was ever a
> problem. I originally asked that mprotect() and friends do not accept
> tagged pointers since these functions deal with memory ranges rather
> than dereferencing such pointer (the reason being minimal kernel
> changes). However, given how complicated it is to specify an ABI, I came
> to the conclusion that a pointer passed to such function should be
> allowed to have non-zero top byte. It would be the kernel's
> responsibility to strip it out as appropriate.

OK.

rule 3: kernel accepts legitimately tagged non-weird memory
pointers and untags them before usage other than deref.

this is relevant if a syscall uses pointers for address range
specification, instead of deref. (mprotect, madvise,...)

i also propose:

rule 4: kernel keeps legitimate tags on non-weird memory
pointers that it returns to the user.

e.g. clone passes stack/arg/tls pointers on without dropping
tags, same for set/get_robust_list. i'm not sure if there
are pointer values observable in /proc etc but those should
keep tags too.

"legitimately tagged" may not always be obvious, but the
illegitimately tagged case can be left unspecified i think,
so dropping tags is ok, but not required if tbi is off and
mte is not used (i.e. tag is illegitimate).

i think these rules work for the cases i care about, a more
tricky question is when/how to check for the new syscall abi
and when/how the TCR_EL1.TBI0 setting may be turned off.
consider the following cases (tb == top byte):

binary 1: user tb = any, syscall tb = 0
  tbi is on, "legacy binary"

binary 2: user tb = any, syscall tb = any
  tbi is on, "new binary using tb"
  for backward compat it needs to check for new syscall abi.

binary 3: user tb = 0, syscall tb = 0
  tbi can be off, "new binary",
  binary is marked to indicate unused tb,
  kernel may turn tbi off: additional pac bits.

binary 4: user tb = mte, syscall tb = mte
  like binary 3, but with mte, "new binary using mte"
  does it have to check for new syscall abi?
  or MTE HWCAP would imply it?
  (is it possible to use mte without new syscall abi?)

in userspace we want most binaries to be like binary 3 and 4
eventually, i.e. marked as not-relying-on-tbi, if a dso is
loaded that is unmarked (legacy or new tb user), then either
the load fails (e.g. if mte is already used? or can we turn
mte off at runtime?) or tbi has to be enabled (prctl? does
this work with pac? or multi-threads?).

as for checking the new syscall abi: i don't see much semantic
difference between AT_HWCAP and AT_FLAGS (either way, the user
has to check a feature flag before using the feature of the
underlying system and it does not matter much if it's a syscall
abi feature or cpu feature), but i don't see anything wrong
with AT_FLAGS if the kernel prefers that.

the discussion here was mostly about binary 2, but for
me the open question is if we can make binary 3/4 work.
(which requires some elf binary marking, that is recognised
by the kernel and dynamic loader, and efficient handling of
the TBI0 bit, ..if it's not possible, then i don't see how
mte will be deployed).

and i guess on the kernel side the open question is if the
rules 1/2/3/4 can be made to work in corner cases e.g. when
pointers embedded into structs are passed down in ioctl.

Catalin Marinas Feb. 25, 2019, 4:57 p.m. UTC | #20

Hi Szabolcs,

Thanks for looking into this. Comments below.

On Tue, Feb 19, 2019 at 06:38:31PM +0000, Szabolcs Nagy wrote:
> i think these rules work for the cases i care about, a more
> tricky question is when/how to check for the new syscall abi
> and when/how the TCR_EL1.TBI0 setting may be turned off.

I don't think turning TBI0 off is critical (it's handy for PAC with
52-bit VA but then it's short-lived if you want more security features
like MTE).

> consider the following cases (tb == top byte):
> 
> binary 1: user tb = any, syscall tb = 0
>   tbi is on, "legacy binary"
> 
> binary 2: user tb = any, syscall tb = any
>   tbi is on, "new binary using tb"
>   for backward compat it needs to check for new syscall abi.
> 
> binary 3: user tb = 0, syscall tb = 0
>   tbi can be off, "new binary",
>   binary is marked to indicate unused tb,
>   kernel may turn tbi off: additional pac bits.
> 
> binary 4: user tb = mte, syscall tb = mte
>   like binary 3, but with mte, "new binary using mte"
>   does it have to check for new syscall abi?
>   or MTE HWCAP would imply it?
>   (is it possible to use mte without new syscall abi?)

I think MTE HWCAP should imply it.

> in userspace we want most binaries to be like binary 3 and 4
> eventually, i.e. marked as not-relying-on-tbi, if a dso is
> loaded that is unmarked (legacy or new tb user), then either
> the load fails (e.g. if mte is already used? or can we turn
> mte off at runtime?) or tbi has to be enabled (prctl? does
> this work with pac? or multi-threads?).

We could enable it via prctl. That's the plan for MTE as well (in
addition maybe to some ELF flag).

> as for checking the new syscall abi: i don't see much semantic
> difference between AT_HWCAP and AT_FLAGS (either way, the user
> has to check a feature flag before using the feature of the
> underlying system and it does not matter much if it's a syscall
> abi feature or cpu feature), but i don't see anything wrong
> with AT_FLAGS if the kernel prefers that.

The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the
relaxation of the syscall ABI to accept tb = any. The MTE support will
have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably,
AT_FLAGS is either redundant here if MTE implies it (and no harm in
keeping it around) or the meaning is different: a tb != 0 may be checked
by the kernel against the allocation tag (i.e. get_user() could fail,
the tag is not entirely ignored).

> the discussion here was mostly about binary 2,

That's because passing tb != 0 into the syscall ABI is the main blocker
here that needs clearing out before merging the MTE support. There is,
of course, a variation of binary 1 for MTE:

binary 5: user tb = mte, syscall tb = 0

but this requires a lot of C lib changes to support properly.

> but for
> me the open question is if we can make binary 3/4 work.
> (which requires some elf binary marking, that is recognised
> by the kernel and dynamic loader, and efficient handling of
> the TBI0 bit, ..if it's not possible, then i don't see how
> mte will be deployed).

If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have
MTE or not.

> and i guess on the kernel side the open question is if the
> rules 1/2/3/4 can be made to work in corner cases e.g. when
> pointers embedded into structs are passed down in ioctl.

We've been trying to track these down since last summer and we came to
the conclusion that it should be (mostly) fine for the non-weird memory
described above.

Szabolcs Nagy Feb. 25, 2019, 6:02 p.m. UTC | #21

On 25/02/2019 16:57, Catalin Marinas wrote:
> On Tue, Feb 19, 2019 at 06:38:31PM +0000, Szabolcs Nagy wrote:
>> i think these rules work for the cases i care about, a more
>> tricky question is when/how to check for the new syscall abi
>> and when/how the TCR_EL1.TBI0 setting may be turned off.
> 
> I don't think turning TBI0 off is critical (it's handy for PAC with
> 52-bit VA but then it's short-lived if you want more security features
> like MTE).

yes, i made a mistake assuming TBI0 off is
required for (or at least compatible with) MTE.

if TBI0 needs to be on for MTE then some of my
analysis is wrong, and i expect TBI0 to be on
in the foreseeable future.

>> consider the following cases (tb == top byte):
>>
>> binary 1: user tb = any, syscall tb = 0
>>   tbi is on, "legacy binary"
>>
>> binary 2: user tb = any, syscall tb = any
>>   tbi is on, "new binary using tb"
>>   for backward compat it needs to check for new syscall abi.
>>
>> binary 3: user tb = 0, syscall tb = 0
>>   tbi can be off, "new binary",
>>   binary is marked to indicate unused tb,
>>   kernel may turn tbi off: additional pac bits.
>>
>> binary 4: user tb = mte, syscall tb = mte
>>   like binary 3, but with mte, "new binary using mte"

so this should be "like binary 2, but with mte".

>>   does it have to check for new syscall abi?
>>   or MTE HWCAP would imply it?
>>   (is it possible to use mte without new syscall abi?)
> 
> I think MTE HWCAP should imply it.
> 
>> in userspace we want most binaries to be like binary 3 and 4
>> eventually, i.e. marked as not-relying-on-tbi, if a dso is
>> loaded that is unmarked (legacy or new tb user), then either
>> the load fails (e.g. if mte is already used? or can we turn
>> mte off at runtime?) or tbi has to be enabled (prctl? does
>> this work with pac? or multi-threads?).
> 
> We could enable it via prctl. That's the plan for MTE as well (in
> addition maybe to some ELF flag).
> 
>> as for checking the new syscall abi: i don't see much semantic
>> difference between AT_HWCAP and AT_FLAGS (either way, the user
>> has to check a feature flag before using the feature of the
>> underlying system and it does not matter much if it's a syscall
>> abi feature or cpu feature), but i don't see anything wrong
>> with AT_FLAGS if the kernel prefers that.
> 
> The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the
> relaxation of the syscall ABI to accept tb = any. The MTE support will
> have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably,
> AT_FLAGS is either redundant here if MTE implies it (and no harm in
> keeping it around) or the meaning is different: a tb != 0 may be checked
> by the kernel against the allocation tag (i.e. get_user() could fail,
> the tag is not entirely ignored).
> 
>> the discussion here was mostly about binary 2,
> 
> That's because passing tb != 0 into the syscall ABI is the main blocker
> here that needs clearing out before merging the MTE support. There is,
> of course, a variation of binary 1 for MTE:
> 
> binary 5: user tb = mte, syscall tb = 0
> 
> but this requires a lot of C lib changes to support properly.

yes, i don't think we want to do that.

but it's ok to have both syscall tbi AT_FLAGS and MTE HWCAP.

>> but for
>> me the open question is if we can make binary 3/4 work.
>> (which requires some elf binary marking, that is recognised
>> by the kernel and dynamic loader, and efficient handling of
>> the TBI0 bit, ..if it's not possible, then i don't see how
>> mte will be deployed).
> 
> If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have
> MTE or not.
> 
>> and i guess on the kernel side the open question is if the
>> rules 1/2/3/4 can be made to work in corner cases e.g. when
>> pointers embedded into structs are passed down in ioctl.
> 
> We've been trying to track these down since last summer and we came to
> the conclusion that it should be (mostly) fine for the non-weird memory
> described above.

i think an interesting case is when userspace passes
a pointer to the kernel and later gets it back,
which is why i proposed rule 4 (kernel has to keep
the tag then).

but i wonder what's the right thing to do for sp
(user can malloc thread/sigalt/makecontext stack
which will be mte tagged in practice with mte on)
does tagged sp work? should userspace untag the
stack memory before setting it up as a stack?
(but then user pointers to that allocation may get
broken..)

Kevin Brodsky Feb. 26, 2019, 5:30 p.m. UTC | #22

On 25/02/2019 18:02, Szabolcs Nagy wrote:
> On 25/02/2019 16:57, Catalin Marinas wrote:
>> On Tue, Feb 19, 2019 at 06:38:31PM +0000, Szabolcs Nagy wrote:
>>> i think these rules work for the cases i care about, a more
>>> tricky question is when/how to check for the new syscall abi
>>> and when/how the TCR_EL1.TBI0 setting may be turned off.
>> I don't think turning TBI0 off is critical (it's handy for PAC with
>> 52-bit VA but then it's short-lived if you want more security features
>> like MTE).
> yes, i made a mistake assuming TBI0 off is
> required for (or at least compatible with) MTE.
>
> if TBI0 needs to be on for MTE then some of my
> analysis is wrong, and i expect TBI0 to be on
> in the foreseeable future.
>
>>> consider the following cases (tb == top byte):
>>>
>>> binary 1: user tb = any, syscall tb = 0
>>>    tbi is on, "legacy binary"
>>>
>>> binary 2: user tb = any, syscall tb = any
>>>    tbi is on, "new binary using tb"
>>>    for backward compat it needs to check for new syscall abi.
>>>
>>> binary 3: user tb = 0, syscall tb = 0
>>>    tbi can be off, "new binary",
>>>    binary is marked to indicate unused tb,
>>>    kernel may turn tbi off: additional pac bits.
>>>
>>> binary 4: user tb = mte, syscall tb = mte
>>>    like binary 3, but with mte, "new binary using mte"
> so this should be "like binary 2, but with mte".
>
>>>    does it have to check for new syscall abi?
>>>    or MTE HWCAP would imply it?
>>>    (is it possible to use mte without new syscall abi?)
>> I think MTE HWCAP should imply it.
>>
>>> in userspace we want most binaries to be like binary 3 and 4
>>> eventually, i.e. marked as not-relying-on-tbi, if a dso is
>>> loaded that is unmarked (legacy or new tb user), then either
>>> the load fails (e.g. if mte is already used? or can we turn
>>> mte off at runtime?) or tbi has to be enabled (prctl? does
>>> this work with pac? or multi-threads?).
>> We could enable it via prctl. That's the plan for MTE as well (in
>> addition maybe to some ELF flag).
>>
>>> as for checking the new syscall abi: i don't see much semantic
>>> difference between AT_HWCAP and AT_FLAGS (either way, the user
>>> has to check a feature flag before using the feature of the
>>> underlying system and it does not matter much if it's a syscall
>>> abi feature or cpu feature), but i don't see anything wrong
>>> with AT_FLAGS if the kernel prefers that.
>> The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the
>> relaxation of the syscall ABI to accept tb = any. The MTE support will
>> have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably,
>> AT_FLAGS is either redundant here if MTE implies it (and no harm in
>> keeping it around) or the meaning is different: a tb != 0 may be checked
>> by the kernel against the allocation tag (i.e. get_user() could fail,
>> the tag is not entirely ignored).
>>
>>> the discussion here was mostly about binary 2,
>> That's because passing tb != 0 into the syscall ABI is the main blocker
>> here that needs clearing out before merging the MTE support. There is,
>> of course, a variation of binary 1 for MTE:
>>
>> binary 5: user tb = mte, syscall tb = 0
>>
>> but this requires a lot of C lib changes to support properly.
> yes, i don't think we want to do that.
>
> but it's ok to have both syscall tbi AT_FLAGS and MTE HWCAP.
>
>>> but for
>>> me the open question is if we can make binary 3/4 work.
>>> (which requires some elf binary marking, that is recognised
>>> by the kernel and dynamic loader, and efficient handling of
>>> the TBI0 bit, ..if it's not possible, then i don't see how
>>> mte will be deployed).
>> If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have
>> MTE or not.
>>
>>> and i guess on the kernel side the open question is if the
>>> rules 1/2/3/4 can be made to work in corner cases e.g. when
>>> pointers embedded into structs are passed down in ioctl.
>> We've been trying to track these down since last summer and we came to
>> the conclusion that it should be (mostly) fine for the non-weird memory
>> described above.
> i think an interesting case is when userspace passes
> a pointer to the kernel and later gets it back,
> which is why i proposed rule 4 (kernel has to keep
> the tag then).
>
> but i wonder what's the right thing to do for sp
> (user can malloc thread/sigalt/makecontext stack
> which will be mte tagged in practice with mte on)
> does tagged sp work? should userspace untag the
> stack memory before setting it up as a stack?
> (but then user pointers to that allocation may get
> broken..)

Tagged SP does work, and it is actually a good idea (it avoids using the default tag 
for the stack). It would be quite easy for the kernel to tag the initial SP and the 
stack on execve(). For other stacks, it is up to userspace, as you say, and would be 
made easier by making it possible to choose how a mapping should be tagged by the 
kernel via a new mmap() flag. Some software that makes too many assumptions on the 
address of stack variables will be disturbed by a tagged SP, but this should be 
fairly rare.

In any case, I don't think this impacts this ABI proposal (beyond the fact that 
passing tagged pointers to the stack needs to be allowed).

Kevin

[v9,0/8] arm64: untag user pointers passed to the kernel

Message

Comments