[v15,00/17] arm64: untag user pointers passed to the kernel

Message ID	cover.1557160186.git.andreyknvl@google.com (mailing list archive)
Headers	show Return-Path: <linux-kselftest-owner@kernel.org> Date: Mon, 6 May 2019 18:30:46 +0200 Message-Id: <cover.1557160186.git.andreyknvl@google.com> Mime-Version: 1.0 Subject: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel From: Andrey Konovalov <andreyknvl@google.com> To: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-rdma@vger.kernel.org, linux-media@vger.kernel.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com>, Vincenzo Frascino <vincenzo.frascino@arm.com>, Will Deacon <will.deacon@arm.com>, Mark Rutland <mark.rutland@arm.com>, Andrew Morton <akpm@linux-foundation.org>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Kees Cook <keescook@chromium.org>, Yishai Hadas <yishaih@mellanox.com>, Felix Kuehling <Felix.Kuehling@amd.com>, Alexander Deucher <Alexander.Deucher@amd.com>, Christian Koenig <Christian.Koenig@amd.com>, Mauro Carvalho Chehab <mchehab@kernel.org>, Jens Wiklander <jens.wiklander@linaro.org>, Alex Williamson <alex.williamson@redhat.com>, Leon Romanovsky <leon@kernel.org>, Dmitry Vyukov <dvyukov@google.com>, Kostya Serebryany <kcc@google.com>, Evgeniy Stepanov <eugenis@google.com>, Lee Smith <Lee.Smith@arm.com>, Ramana Radhakrishnan <Ramana.Radhakrishnan@arm.com>, Jacob Bramley <Jacob.Bramley@arm.com>, Ruben Ayrapetyan <Ruben.Ayrapetyan@arm.com>, Robin Murphy <robin.murphy@arm.com>, Luc Van Oostenryck <luc.vanoostenryck@gmail.com>, Dave Martin <Dave.Martin@arm.com>, Kevin Brodsky <kevin.brodsky@arm.com>, Szabolcs Nagy <Szabolcs.Nagy@arm.com>, Andrey Konovalov <andreyknvl@google.com> Content-Type: text/plain; charset="UTF-8" Sender: linux-kselftest-owner@vger.kernel.org Precedence: bulk
Series	arm64: untag user pointers passed to the kernel \| expand [v15,00/17] arm64: untag user pointers passed to the kernel [v15,01/17] uaccess: add untagged_addr definition for other arches [v15,02/17] arm64: untag user pointers in access_ok and __uaccess_mask_ptr [v15,03/17] lib, arm64: untag user pointers in strn*_user [v15,04/17] mm: add ksys_ wrappers to memory syscalls [v15,05/17] arms64: untag user pointers passed to memory syscalls [v15,06/17] mm: untag user pointers in do_pages_move [v15,07/17] mm, arm64: untag user pointers in mm/gup.c [v15,08/17] mm, arm64: untag user pointers in get_vaddr_frames [v15,09/17] fs, arm64: untag user pointers in copy_mount_options [v15,10/17] fs, arm64: untag user pointers in fs/userfaultfd.c [v15,11/17] drm/amdgpu, arm64: untag user pointers [v15,12/17] drm/radeon, arm64: untag user pointers in radeon_gem_userptr_ioctl [v15,13/17] IB, arm64: untag user pointers in ib_uverbs_(re)reg_mr() [v15,14/17] media/v4l2-core, arm64: untag user pointers in videobuf_dma_contig_user_get [v15,15/17] tee, arm64: untag user pointers in tee_shm_register [v15,16/17] vfio/type1, arm64: untag user pointers in vaddr_get_pfn [v15,17/17] selftests, arm64: add a selftest for passing tagged pointers to kernel

Andrey Konovalov May 6, 2019, 4:30 p.m. UTC

=== Overview

arm64 has a feature called Top Byte Ignore, which allows to embed pointer
tags into the top byte of each pointer. Userspace programs (such as
HWASan, a memory debugging tool [1]) might use this feature and pass
tagged user pointers to the kernel through syscalls or other interfaces.

Right now the kernel is already able to handle user faults with tagged
pointers, due to these patches:

1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
             tagged pointer")
2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
	      pointers")
3. 276e9327 ("arm64: entry: improve data abort handling of tagged
	      pointers")

This patchset extends tagged pointer support to syscall arguments.

As per the proposed ABI change [3], tagged pointers are only allowed to be
passed to syscalls when they point to memory ranges obtained by anonymous
mmap() or sbrk() (see the patchset [3] for more details).

For non-memory syscalls this is done by untaging user pointers when the
kernel performs pointer checking to find out whether the pointer comes
from userspace (most notably in access_ok). The untagging is done only
when the pointer is being checked, the tag is preserved as the pointer
makes its way through the kernel and stays tagged when the kernel
dereferences the pointer when perfoming user memory accesses.

Memory syscalls (mmap, mprotect, etc.) don't do user memory accesses but
rather deal with memory ranges, and untagged pointers are better suited to
describe memory ranges internally. Thus for memory syscalls we untag
pointers completely when they enter the kernel.

=== Other approaches

One of the alternative approaches to untagging that was considered is to
completely strip the pointer tag as the pointer enters the kernel with
some kind of a syscall wrapper, but that won't work with the countless
number of different ioctl calls. With this approach we would need a custom
wrapper for each ioctl variation, which doesn't seem practical.

An alternative approach to untagging pointers in memory syscalls prologues
is to inspead allow tagged pointers to be passed to find_vma() (and other
vma related functions) and untag them there. Unfortunately, a lot of
find_vma() callers then compare or subtract the returned vma start and end
fields against the pointer that was being searched. Thus this approach
would still require changing all find_vma() callers.

=== Testing

The following testing approaches has been taken to find potential issues
with user pointer untagging:

1. Static testing (with sparse [2] and separately with a custom static
   analyzer based on Clang) to track casts of __user pointers to integer
   types to find places where untagging needs to be done.

2. Static testing with grep to find parts of the kernel that call
   find_vma() (and other similar functions) or directly compare against
   vm_start/vm_end fields of vma.

3. Static testing with grep to find parts of the kernel that compare
   user pointers with TASK_SIZE or other similar consts and macros.

4. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
   a modified syzkaller version that passes tagged pointers to the kernel.

Based on the results of the testing the requried patches have been added
to the patchset.

=== Notes

This patchset is meant to be merged together with "arm64 relaxed ABI" [3].

This patchset is a prerequisite for ARM's memory tagging hardware feature
support [4].

This patchset has been merged into the Pixel 2 & 3 kernel trees and is
now being used to enable testing of Pixel phones with HWASan.

Thanks!

[1] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html

[2] https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060e0145f292

[3] https://lkml.org/lkml/2019/3/18/819

[4] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a

Changes in v15:
- Removed unnecessary untagging from radeon_ttm_tt_set_userptr().
- Removed unnecessary untagging from amdgpu_ttm_tt_set_userptr().
- Moved untagging to validate_range() in userfaultfd code.
- Moved untagging to ib_uverbs_(re)reg_mr() from mlx4_get_umem_mr().
- Rebased onto 5.1.

Changes in v14:
- Moved untagging for most memory syscalls to an arm64 specific
  implementation, instead of doing that in the common code.
- Dropped "net, arm64: untag user pointers in tcp_zerocopy_receive", since
  the provided user pointers don't come from an anonymous map and thus are
  not covered by this ABI relaxation.
- Dropped "kernel, arm64: untag user pointers in prctl_set_mm*".
- Moved untagging from __check_mem_type() to tee_shm_register().
- Updated untagging for the amdgpu and radeon drivers to cover the MMU
  notifier, as suggested by Felix.
- Since this ABI relaxation doesn't actually allow tagged instruction
  pointers, dropped the following patches:
- Dropped "tracing, arm64: untag user pointers in seq_print_user_ip".
- Dropped "uprobes, arm64: untag user pointers in find_active_uprobe".
- Dropped "bpf, arm64: untag user pointers in stack_map_get_build_id_offset".
- Rebased onto 5.1-rc7 (37624b58).

Changes in v13:
- Simplified untagging in tcp_zerocopy_receive().
- Looked at find_vma() callers in drivers/, which allowed to identify a
  few other places where untagging is needed.
- Added patch "mm, arm64: untag user pointers in get_vaddr_frames".
- Added patch "drm/amdgpu, arm64: untag user pointers in
  amdgpu_ttm_tt_get_user_pages".
- Added patch "drm/radeon, arm64: untag user pointers in
  radeon_ttm_tt_pin_userptr".
- Added patch "IB/mlx4, arm64: untag user pointers in mlx4_get_umem_mr".
- Added patch "media/v4l2-core, arm64: untag user pointers in
  videobuf_dma_contig_user_get".
- Added patch "tee/optee, arm64: untag user pointers in check_mem_type".
- Added patch "vfio/type1, arm64: untag user pointers".

Changes in v12:
- Changed untagging in tcp_zerocopy_receive() to also untag zc->address.
- Fixed untagging in prctl_set_mm* to only untag pointers for vma lookups
  and validity checks, but leave them as is for actual user space accesses.
- Updated the link to the v2 of the "arm64 relaxed ABI" patchset [3].
- Dropped the documentation patch, as the "arm64 relaxed ABI" patchset [3]
  handles that.

Changes in v11:
- Added "uprobes, arm64: untag user pointers in find_active_uprobe" patch.
- Added "bpf, arm64: untag user pointers in stack_map_get_build_id_offset"
  patch.
- Fixed "tracing, arm64: untag user pointers in seq_print_user_ip" to
  correctly perform subtration with a tagged addr.
- Moved untagged_addr() from SYSCALL_DEFINE3(mprotect) and
  SYSCALL_DEFINE4(pkey_mprotect) to do_mprotect_pkey().
- Moved untagged_addr() definition for other arches from
  include/linux/memory.h to include/linux/mm.h.
- Changed untagging in strn*_user() to perform userspace accesses through
  tagged pointers.
- Updated the documentation to mention that passing tagged pointers to
  memory syscalls is allowed.
- Updated the test to use malloc'ed memory instead of stack memory.

Changes in v10:
- Added "mm, arm64: untag user pointers passed to memory syscalls" back.
- New patch "fs, arm64: untag user pointers in fs/userfaultfd.c".
- New patch "net, arm64: untag user pointers in tcp_zerocopy_receive".
- New patch "kernel, arm64: untag user pointers in prctl_set_mm*".
- New patch "tracing, arm64: untag user pointers in seq_print_user_ip".

Changes in v9:
- Rebased onto 4.20-rc6.
- Used u64 instead of __u64 in type casts in the untagged_addr macro for
  arm64.
- Added braces around (addr) in the untagged_addr macro for other arches.

Changes in v8:
- Rebased onto 65102238 (4.20-rc1).
- Added a note to the cover letter on why syscall wrappers/shims that untag
  user pointers won't work.
- Added a note to the cover letter that this patchset has been merged into
  the Pixel 2 kernel tree.
- Documentation fixes, in particular added a list of syscalls that don't
  support tagged user pointers.

Changes in v7:
- Rebased onto 17b57b18 (4.19-rc6).
- Dropped the "arm64: untag user address in __do_user_fault" patch, since
  the existing patches already handle user faults properly.
- Dropped the "usb, arm64: untag user addresses in devio" patch, since the
  passed pointer must come from a vma and therefore be untagged.
- Dropped the "arm64: annotate user pointers casts detected by sparse"
  patch (see the discussion to the replies of the v6 of this patchset).
- Added more context to the cover letter.
- Updated Documentation/arm64/tagged-pointers.txt.

Changes in v6:
- Added annotations for user pointer casts found by sparse.
- Rebased onto 050cdc6c (4.19-rc1+).

Changes in v5:
- Added 3 new patches that add untagging to places found with static
  analysis.
- Rebased onto 44c929e1 (4.18-rc8).

Changes in v4:
- Added a selftest for checking that passing tagged pointers to the
  kernel succeeds.
- Rebased onto 81e97f013 (4.18-rc1+).

Changes in v3:
- Rebased onto e5c51f30 (4.17-rc6+).
- Added linux-arch@ to the list of recipients.

Changes in v2:
- Rebased onto 2d618bdf (4.17-rc3+).
- Removed excessive untagging in gup.c.
- Removed untagging pointers returned from __uaccess_mask_ptr.

Changes in v1:
- Rebased onto 4.17-rc1.

Changes in RFC v2:
- Added "#ifndef untagged_addr..." fallback in linux/uaccess.h instead of
  defining it for each arch individually.
- Updated Documentation/arm64/tagged-pointers.txt.
- Dropped "mm, arm64: untag user addresses in memory syscalls".
- Rebased onto 3eb2ce82 (4.16-rc7).

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>

Andrey Konovalov (17):
  uaccess: add untagged_addr definition for other arches
  arm64: untag user pointers in access_ok and __uaccess_mask_ptr
  lib, arm64: untag user pointers in strn*_user
  mm: add ksys_ wrappers to memory syscalls
  arms64: untag user pointers passed to memory syscalls
  mm: untag user pointers in do_pages_move
  mm, arm64: untag user pointers in mm/gup.c
  mm, arm64: untag user pointers in get_vaddr_frames
  fs, arm64: untag user pointers in copy_mount_options
  fs, arm64: untag user pointers in fs/userfaultfd.c
  drm/amdgpu, arm64: untag user pointers
  drm/radeon, arm64: untag user pointers in radeon_gem_userptr_ioctl
  IB, arm64: untag user pointers in ib_uverbs_(re)reg_mr()
  media/v4l2-core, arm64: untag user pointers in
    videobuf_dma_contig_user_get
  tee, arm64: untag user pointers in tee_shm_register
  vfio/type1, arm64: untag user pointers in vaddr_get_pfn
  selftests, arm64: add a selftest for passing tagged pointers to kernel

 arch/arm64/include/asm/uaccess.h              |  10 +-
 arch/arm64/kernel/sys.c                       | 128 ++++++++++++++++-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c       |   2 +
 drivers/gpu/drm/radeon/radeon_gem.c           |   2 +
 drivers/infiniband/core/uverbs_cmd.c          |   4 +
 drivers/media/v4l2-core/videobuf-dma-contig.c |   9 +-
 drivers/tee/tee_shm.c                         |   1 +
 drivers/vfio/vfio_iommu_type1.c               |   2 +
 fs/namespace.c                                |   2 +-
 fs/userfaultfd.c                              |  22 +--
 include/linux/mm.h                            |   4 +
 include/linux/syscalls.h                      |  22 +++
 ipc/shm.c                                     |   7 +-
 lib/strncpy_from_user.c                       |   3 +-
 lib/strnlen_user.c                            |   3 +-
 mm/frame_vector.c                             |   2 +
 mm/gup.c                                      |   4 +
 mm/madvise.c                                  | 129 +++++++++---------
 mm/mempolicy.c                                |  21 ++-
 mm/migrate.c                                  |   1 +
 mm/mincore.c                                  |  57 ++++----
 mm/mlock.c                                    |  20 ++-
 mm/mmap.c                                     |  30 +++-
 mm/mprotect.c                                 |   6 +-
 mm/mremap.c                                   |  27 ++--
 mm/msync.c                                    |  35 +++--
 tools/testing/selftests/arm64/.gitignore      |   1 +
 tools/testing/selftests/arm64/Makefile        |  11 ++
 .../testing/selftests/arm64/run_tags_test.sh  |  12 ++
 tools/testing/selftests/arm64/tags_test.c     |  21 +++
 31 files changed, 436 insertions(+), 164 deletions(-)
 create mode 100644 tools/testing/selftests/arm64/.gitignore
 create mode 100644 tools/testing/selftests/arm64/Makefile
 create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh
 create mode 100644 tools/testing/selftests/arm64/tags_test.c

Catalin Marinas May 17, 2019, 2:49 p.m. UTC | #1

Hi Andrey,

On Mon, May 06, 2019 at 06:30:46PM +0200, Andrey Konovalov wrote:
> One of the alternative approaches to untagging that was considered is to
> completely strip the pointer tag as the pointer enters the kernel with
> some kind of a syscall wrapper, but that won't work with the countless
> number of different ioctl calls. With this approach we would need a custom
> wrapper for each ioctl variation, which doesn't seem practical.

The more I look at this problem, the less convinced I am that we can
solve it in a way that results in a stable ABI covering ioctls(). While
for the Android kernel codebase it could be simpler as you don't upgrade
the kernel version every 2.5 months, for the mainline kernel this
doesn't scale. Any run-time checks are relatively limited in terms of
drivers covered. Better static checking would be nice as a long term
solution but we didn't get anywhere with the discussion last year.

IMO (RFC for now), I see two ways forward:

1. Make this a user space problem and do not allow tagged pointers into
   the syscall ABI. A libc wrapper would have to convert structures,
   parameters before passing them into the kernel. Note that we can
   still support the hardware MTE in the kernel by enabling tagged
   memory ranges, saving/restoring tags etc. but not allowing tagged
   addresses at the syscall boundary.

2. Similar shim to the above libc wrapper but inside the kernel
   (arch/arm64 only; most pointer arguments could be covered with an
   __SC_CAST similar to the s390 one). There are two differences from
   what we've discussed in the past:

   a) this is an opt-in by the user which would have to explicitly call
      prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
      to pass tagged pointers to the kernel. This would probably be the
      responsibility of the C lib to make sure it doesn't tag heap
      allocations. If the user did not opt-in, the syscalls are routed
      through the normal path (no untagging address shim).

   b) ioctl() and other blacklisted syscalls (prctl) will not accept
      tagged pointers (to be documented in Vicenzo's ABI patches).

It doesn't solve the problems we are trying to address but 2.a saves us
from blindly relaxing the ABI without knowing how to easily assess new
code being merged (over 500K lines between kernel versions). Existing
applications (who don't opt-in) won't inadvertently start using the new
ABI which could risk becoming de-facto ABI that we need to support on
the long run.

Option 1 wouldn't solve the ioctl() problem either and while it makes
things simpler for the kernel, I am aware that it's slightly more
complicated in user space (but I really don't mind if you prefer option
1 ;)).

The tagged pointers (whether hwasan or MTE) should ideally be a
transparent feature for the application writer but I don't think we can
solve it entirely and make it seamless for the multitude of ioctls().
I'd say you only opt in to such feature if you know what you are doing
and the user code takes care of specific cases like ioctl(), hence the
prctl() proposal even for the hwasan.

Comments welcomed.

Evgenii Stepanov May 20, 2019, 11:53 p.m. UTC | #2

On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> Hi Andrey,
>
> On Mon, May 06, 2019 at 06:30:46PM +0200, Andrey Konovalov wrote:
> > One of the alternative approaches to untagging that was considered is to
> > completely strip the pointer tag as the pointer enters the kernel with
> > some kind of a syscall wrapper, but that won't work with the countless
> > number of different ioctl calls. With this approach we would need a custom
> > wrapper for each ioctl variation, which doesn't seem practical.
>
> The more I look at this problem, the less convinced I am that we can
> solve it in a way that results in a stable ABI covering ioctls(). While
> for the Android kernel codebase it could be simpler as you don't upgrade
> the kernel version every 2.5 months, for the mainline kernel this
> doesn't scale. Any run-time checks are relatively limited in terms of
> drivers covered. Better static checking would be nice as a long term
> solution but we didn't get anywhere with the discussion last year.
>
> IMO (RFC for now), I see two ways forward:
>
> 1. Make this a user space problem and do not allow tagged pointers into
>    the syscall ABI. A libc wrapper would have to convert structures,
>    parameters before passing them into the kernel. Note that we can
>    still support the hardware MTE in the kernel by enabling tagged
>    memory ranges, saving/restoring tags etc. but not allowing tagged
>    addresses at the syscall boundary.
>
> 2. Similar shim to the above libc wrapper but inside the kernel
>    (arch/arm64 only; most pointer arguments could be covered with an
>    __SC_CAST similar to the s390 one). There are two differences from
>    what we've discussed in the past:
>
>    a) this is an opt-in by the user which would have to explicitly call
>       prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
>       to pass tagged pointers to the kernel. This would probably be the
>       responsibility of the C lib to make sure it doesn't tag heap
>       allocations. If the user did not opt-in, the syscalls are routed
>       through the normal path (no untagging address shim).
>
>    b) ioctl() and other blacklisted syscalls (prctl) will not accept
>       tagged pointers (to be documented in Vicenzo's ABI patches).
>
> It doesn't solve the problems we are trying to address but 2.a saves us
> from blindly relaxing the ABI without knowing how to easily assess new
> code being merged (over 500K lines between kernel versions). Existing
> applications (who don't opt-in) won't inadvertently start using the new
> ABI which could risk becoming de-facto ABI that we need to support on
> the long run.
>
> Option 1 wouldn't solve the ioctl() problem either and while it makes
> things simpler for the kernel, I am aware that it's slightly more
> complicated in user space (but I really don't mind if you prefer option
> 1 ;)).
>
> The tagged pointers (whether hwasan or MTE) should ideally be a
> transparent feature for the application writer but I don't think we can
> solve it entirely and make it seamless for the multitude of ioctls().
> I'd say you only opt in to such feature if you know what you are doing
> and the user code takes care of specific cases like ioctl(), hence the
> prctl() proposal even for the hwasan.
>
> Comments welcomed.

Any userspace shim approach is problematic for Android because of the
apps that use raw system calls. AFAIK, all apps written in Go are in
that camp - I'm not sure how common they are, but getting them all
recompiled is probably not realistic.

The way I see it, a patch that breaks handling of tagged pointers is
not that different from, say, a patch that adds a wild pointer
dereference. Both are bugs; the difference is that (a) the former
breaks a relatively uncommon target and (b) it's arguably an easier
mistake to make. If MTE adoption goes well, (a) will not be the case
for long.

This is a bit of a chicken-and-egg problem. In a world where memory
allocators on one or several popular platforms generate pointers with
non-zero tags, any such breakage will be caught in testing.
Unfortunately to reach that state we need the kernel to start
accepting tagged pointers first, and then hold on for a couple of
years until userspace catches up.

Perhaps we can start by whitelisting ioctls by driver?

Catalin Marinas May 21, 2019, 6:29 p.m. UTC | #3

On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
> On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > IMO (RFC for now), I see two ways forward:
> >
> > 1. Make this a user space problem and do not allow tagged pointers into
> >    the syscall ABI. A libc wrapper would have to convert structures,
> >    parameters before passing them into the kernel. Note that we can
> >    still support the hardware MTE in the kernel by enabling tagged
> >    memory ranges, saving/restoring tags etc. but not allowing tagged
> >    addresses at the syscall boundary.
> >
> > 2. Similar shim to the above libc wrapper but inside the kernel
> >    (arch/arm64 only; most pointer arguments could be covered with an
> >    __SC_CAST similar to the s390 one). There are two differences from
> >    what we've discussed in the past:
> >
> >    a) this is an opt-in by the user which would have to explicitly call
> >       prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
> >       to pass tagged pointers to the kernel. This would probably be the
> >       responsibility of the C lib to make sure it doesn't tag heap
> >       allocations. If the user did not opt-in, the syscalls are routed
> >       through the normal path (no untagging address shim).
> >
> >    b) ioctl() and other blacklisted syscalls (prctl) will not accept
> >       tagged pointers (to be documented in Vicenzo's ABI patches).
[...]
> Any userspace shim approach is problematic for Android because of the
> apps that use raw system calls. AFAIK, all apps written in Go are in
> that camp - I'm not sure how common they are, but getting them all
> recompiled is probably not realistic.

That's a fair point (I wasn't expecting it would get much traction
anyway ;)). OTOH, it allows upstreaming of the MTE patches while we
continue the discussions around TBI.

> The way I see it, a patch that breaks handling of tagged pointers is
> not that different from, say, a patch that adds a wild pointer
> dereference. Both are bugs; the difference is that (a) the former
> breaks a relatively uncommon target and (b) it's arguably an easier
> mistake to make. If MTE adoption goes well, (a) will not be the case
> for long.

It's also the fact such patch would go unnoticed for a long time until
someone exercises that code path. And when they do, the user would be
pretty much in the dark trying to figure what what went wrong, why a
SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed
all the places where it matters in the current kernel codebase (ignoring
future patches).

I think we should revisit the static checking discussions we had last
year. Run-time checking (even with compiler instrumentation and
syzkaller fuzzing) would only cover the code paths specific to a Linux
or Android installation.

> This is a bit of a chicken-and-egg problem. In a world where memory
> allocators on one or several popular platforms generate pointers with
> non-zero tags, any such breakage will be caught in testing.
> Unfortunately to reach that state we need the kernel to start
> accepting tagged pointers first, and then hold on for a couple of
> years until userspace catches up.

Would the kernel also catch up with providing a stable ABI? Because we
have two moving targets.

On one hand, you have Android or some Linux distro that stick to a
stable kernel version for some time, so they have better chance of
clearing most of the problems. On the other hand, we have mainline
kernel that gets over 500K lines every release. As maintainer, I can't
rely on my testing alone as this is on a limited number of platforms. So
my concern is that every kernel release has a significant chance of
breaking the ABI, unless we have a better way of identifying potential
issues.

> Perhaps we can start by whitelisting ioctls by driver?

This was also raised by Ruben in private but without a (static) tool to
to check, manually going through all the drivers doesn't scale. It's
very likely that most drivers don't care, just a get_user/put_user is
already handled by these patches. Searching for find_vma() was
identifying one such use-case but is this sufficient? Are there other
cases we need to explicitly untag a pointer?

The other point I'd like feedback on is 2.a above. I see _some_ value
into having the user opt-in to this relaxed ABI rather than blinding
exposing it to all applications. Dave suggested (in private) a new
personality (e.g. PER_LINUX_TBI) inherited by children. It would be the
responsibility of the C library to check the current personality bits
and only tag pointers on allocation *if* the kernel allowed it. The
kernel could provide the AT_FLAGS bit as in Vincenzo's patches if the
personality was set but can't set it retrospectively if the user called
sys_personality. By default, /sbin/init would not have this personality
and libc would not tag pointers, so we can guarantee that your distro
boots normally with a new kernel version. We could have an envp that
gets caught by /sbin/init so you can pass it on the kernel command line
(or a dynamic loader at run-time). But the default should be the current
ABI behaviour.

We can enforce the current behaviour by having access_ok() check the
personality or a TIF flag but we may relax this enforcement at some
point in the future as we learn more about the implications of TBI.

Thanks.

Jason Gunthorpe May 21, 2019, 6:48 p.m. UTC | #4

On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:

> The tagged pointers (whether hwasan or MTE) should ideally be a
> transparent feature for the application writer but I don't think we can
> solve it entirely and make it seamless for the multitude of ioctls().
> I'd say you only opt in to such feature if you know what you are doing
> and the user code takes care of specific cases like ioctl(), hence the
> prctl() proposal even for the hwasan.

I'm not sure such a dire view is warrented.. 

The ioctl situation is not so bad, other than a few special cases,
most drivers just take a 'void __user *' and pass it as an argument to
some function that accepts a 'void __user *'. sparse et al verify
this. 

As long as the core functions do the right thing the drivers will be
OK.

The only place things get dicy is if someone casts to unsigned long
(ie for vma work) but I think that reflects that our driver facing
APIs for VMAs are compatible with static analysis (ie I have no
earthly idea why get_user_pages() accepts an unsigned long), not that
this is too hard.

Jason

Kees Cook May 22, 2019, 12:04 a.m. UTC | #5

On Tue, May 21, 2019 at 07:29:33PM +0100, Catalin Marinas wrote:
> On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
> > On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > IMO (RFC for now), I see two ways forward:
> > > [...]
> > > 2. Similar shim to the above libc wrapper but inside the kernel
> > >    (arch/arm64 only; most pointer arguments could be covered with an
> > >    __SC_CAST similar to the s390 one). There are two differences from
> > >    what we've discussed in the past:
> > >
> > >    a) this is an opt-in by the user which would have to explicitly call
> > >       prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
> > >       to pass tagged pointers to the kernel. This would probably be the
> > >       responsibility of the C lib to make sure it doesn't tag heap
> > >       allocations. If the user did not opt-in, the syscalls are routed
> > >       through the normal path (no untagging address shim).
> > >
> > >    b) ioctl() and other blacklisted syscalls (prctl) will not accept
> > >       tagged pointers (to be documented in Vicenzo's ABI patches).
> >
> > The way I see it, a patch that breaks handling of tagged pointers is
> > not that different from, say, a patch that adds a wild pointer
> > dereference. Both are bugs; the difference is that (a) the former
> > breaks a relatively uncommon target and (b) it's arguably an easier
> > mistake to make. If MTE adoption goes well, (a) will not be the case
> > for long.
> 
> It's also the fact such patch would go unnoticed for a long time until
> someone exercises that code path. And when they do, the user would be
> pretty much in the dark trying to figure what what went wrong, why a
> SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed
> all the places where it matters in the current kernel codebase (ignoring
> future patches).

So, looking forward a bit, this isn't going to be an ARM-specific issue
for long. In fact, I think we shouldn't have arm-specific syscall wrappers
in this series: I think untagged_addr() should likely be added at the
top-level and have it be a no-op for other architectures. So given this
becoming a kernel-wide multi-architecture issue (under the assumption
that x86, RISC-V, and others will gain similar TBI or MTE things),
we should solve it in a way that we can re-use.

We need something that is going to work everywhere. And it needs to be
supported by the kernel for the simple reason that the kernel needs to
do MTE checks during copy_from_user(): having that information stripped
means we lose any userspace-assigned MTE protections if they get handled
by the kernel, which is a total non-starter, IMO.

As an aside: I think Sparc ADI support in Linux actually side-stepped
this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
for kernel code.") I think this was a mistake we should not repeat for
arm64 (we do seem to be at least in agreement about this, I think).

[1] https://lore.kernel.org/patchwork/patch/654481/

> > This is a bit of a chicken-and-egg problem. In a world where memory
> > allocators on one or several popular platforms generate pointers with
> > non-zero tags, any such breakage will be caught in testing.
> > Unfortunately to reach that state we need the kernel to start
> > accepting tagged pointers first, and then hold on for a couple of
> > years until userspace catches up.
> 
> Would the kernel also catch up with providing a stable ABI? Because we
> have two moving targets.
> 
> On one hand, you have Android or some Linux distro that stick to a
> stable kernel version for some time, so they have better chance of
> clearing most of the problems. On the other hand, we have mainline
> kernel that gets over 500K lines every release. As maintainer, I can't
> rely on my testing alone as this is on a limited number of platforms. So
> my concern is that every kernel release has a significant chance of
> breaking the ABI, unless we have a better way of identifying potential
> issues.

I just want to make sure I fully understand your concern about this
being an ABI break, and I work best with examples. The closest situation
I can see would be:

- some program has no idea about MTE
- malloc() starts returning MTE-tagged addresses
- program doesn't break from that change
- program uses some syscall that is missing untagged_addr() and fails
- kernel has now broken userspace that used to work

The trouble I see with this is that it is largely theoretical and
requires part of userspace to collude to start using a new CPU feature
that tickles a bug in the kernel. As I understand the golden rule,
this is a bug in the kernel (a missed ioctl() or such) to be fixed,
not a global breaking of some userspace behavior.

I feel like I'm missing something about this being seen as an ABI
break. The kernel already fails on userspace addresses that have high
bits set -- are there things that _depend_ on this failure to operate?

Catalin Marinas May 22, 2019, 10:11 a.m. UTC | #6

Hi Kees,

Thanks for joining the thread ;).

On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> On Tue, May 21, 2019 at 07:29:33PM +0100, Catalin Marinas wrote:
> > On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
> > > On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > IMO (RFC for now), I see two ways forward:
> > > > [...]
> > > > 2. Similar shim to the above libc wrapper but inside the kernel
> > > >    (arch/arm64 only; most pointer arguments could be covered with an
> > > >    __SC_CAST similar to the s390 one). There are two differences from
> > > >    what we've discussed in the past:
> > > >
> > > >    a) this is an opt-in by the user which would have to explicitly call
> > > >       prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
> > > >       to pass tagged pointers to the kernel. This would probably be the
> > > >       responsibility of the C lib to make sure it doesn't tag heap
> > > >       allocations. If the user did not opt-in, the syscalls are routed
> > > >       through the normal path (no untagging address shim).
> > > >
> > > >    b) ioctl() and other blacklisted syscalls (prctl) will not accept
> > > >       tagged pointers (to be documented in Vicenzo's ABI patches).
> > >
> > > The way I see it, a patch that breaks handling of tagged pointers is
> > > not that different from, say, a patch that adds a wild pointer
> > > dereference. Both are bugs; the difference is that (a) the former
> > > breaks a relatively uncommon target and (b) it's arguably an easier
> > > mistake to make. If MTE adoption goes well, (a) will not be the case
> > > for long.
> > 
> > It's also the fact such patch would go unnoticed for a long time until
> > someone exercises that code path. And when they do, the user would be
> > pretty much in the dark trying to figure what what went wrong, why a
> > SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed
> > all the places where it matters in the current kernel codebase (ignoring
> > future patches).
> 
> So, looking forward a bit, this isn't going to be an ARM-specific issue
> for long.

I do hope so.

> In fact, I think we shouldn't have arm-specific syscall wrappers
> in this series: I think untagged_addr() should likely be added at the
> top-level and have it be a no-op for other architectures.

That's what the current patchset does, so we have this as a starting
point. Kostya raised another potential issue with the syscall wrappers:
with MTE the kernel will be forced to enable the match-all (wildcard)
pointers for user space accesses since copy_from_user() would only get a
0 tag. So it has wider implications than just uaccess routines not
checking the colour.

> So given this becoming a kernel-wide multi-architecture issue (under
> the assumption that x86, RISC-V, and others will gain similar TBI or
> MTE things), we should solve it in a way that we can re-use.

Can we do any better to aid the untagged_addr() placement (e.g. better
type annotations, better static analysis)? We have to distinguish
between user pointers that may be dereferenced by the kernel (I think
almost fully covered with this patchset) and user addresses represented
as ulong that may:

a) be converted to a user pointer and dereferenced; I think that's the
   case for many overloaded ulong/u64 arguments

b) used for address space management, rbtree look-ups etc. where the tag
   is no longer relevant and it even gets in the way

We tried last year to identify void __user * casts to unsigned long
using sparse on the assumption that pointers can be tagged while ulong
is about address space management and needs to lose such tag. I think we
could have pushed this further. For example, get_user_pages() takes an
unsigned long but it is perfectly capable of untagging the address
itself. Shall we change its first argument to void __user * (together
with all its callers)?

find_vma(), OTOH, could untag the address but it doesn't help since
vm_start/end don't have such information (that's more about the content
or type that the user decided) and the callers check against it.

Are there any other places where this matters? These patches tracked
down find_vma() as some heuristics but we may need better static
analysis to identify other cases.

> We need something that is going to work everywhere. And it needs to be
> supported by the kernel for the simple reason that the kernel needs to
> do MTE checks during copy_from_user(): having that information stripped
> means we lose any userspace-assigned MTE protections if they get handled
> by the kernel, which is a total non-starter, IMO.

Such feedback is welcomed ;).

> As an aside: I think Sparc ADI support in Linux actually side-stepped
> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
> for kernel code.") I think this was a mistake we should not repeat for
> arm64 (we do seem to be at least in agreement about this, I think).
> 
> [1] https://lore.kernel.org/patchwork/patch/654481/

I tried to drag the SPARC guys into this discussion but without much
success.

> > > This is a bit of a chicken-and-egg problem. In a world where memory
> > > allocators on one or several popular platforms generate pointers with
> > > non-zero tags, any such breakage will be caught in testing.
> > > Unfortunately to reach that state we need the kernel to start
> > > accepting tagged pointers first, and then hold on for a couple of
> > > years until userspace catches up.
> > 
> > Would the kernel also catch up with providing a stable ABI? Because we
> > have two moving targets.
> > 
> > On one hand, you have Android or some Linux distro that stick to a
> > stable kernel version for some time, so they have better chance of
> > clearing most of the problems. On the other hand, we have mainline
> > kernel that gets over 500K lines every release. As maintainer, I can't
> > rely on my testing alone as this is on a limited number of platforms. So
> > my concern is that every kernel release has a significant chance of
> > breaking the ABI, unless we have a better way of identifying potential
> > issues.
> 
> I just want to make sure I fully understand your concern about this
> being an ABI break, and I work best with examples. The closest situation
> I can see would be:
> 
> - some program has no idea about MTE

Apart from some libraries like libc (and maybe those that handle
specific device ioctls), I think most programs should have no idea about
MTE. I wouldn't expect programmers to have to change their app just
because we have a new feature that colours heap allocations.

> - malloc() starts returning MTE-tagged addresses
> - program doesn't break from that change
> - program uses some syscall that is missing untagged_addr() and fails
> - kernel has now broken userspace that used to work

That's one aspect though probably more of a case of plugging in a new
device (graphics card, network etc.) and the ioctl to the new device
doesn't work.

The other is that, assuming we reach a point where the kernel entirely
supports this relaxed ABI, can we guarantee that it won't break in the
future. Let's say some subsequent kernel change (some refactoring)
misses out an untagged_addr(). This renders a previously TBI/MTE-capable
syscall unusable. Can we rely only on testing?

> The trouble I see with this is that it is largely theoretical and
> requires part of userspace to collude to start using a new CPU feature
> that tickles a bug in the kernel. As I understand the golden rule,
> this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> not a global breaking of some userspace behavior.

Yes, we should follow the rule that it's a kernel bug but it doesn't
help the user that a newly installed kernel causes user space to no
longer reach a prompt. Hence the proposal of an opt-in via personality
(for MTE we would need an explicit opt-in by the user anyway since the
top byte is no longer ignored but checked against the allocation tag).

> I feel like I'm missing something about this being seen as an ABI
> break. The kernel already fails on userspace addresses that have high
> bits set -- are there things that _depend_ on this failure to operate?

It's about providing a relaxed ABI which allows non-zero top byte and
breaking it later inadvertently without having something better in place
to analyse the kernel changes.

Thanks.

Dave Martin May 22, 2019, 1:49 p.m. UTC | #7

On Tue, May 21, 2019 at 03:48:56PM -0300, Jason Gunthorpe wrote:
> On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:
> 
> > The tagged pointers (whether hwasan or MTE) should ideally be a
> > transparent feature for the application writer but I don't think we can
> > solve it entirely and make it seamless for the multitude of ioctls().
> > I'd say you only opt in to such feature if you know what you are doing
> > and the user code takes care of specific cases like ioctl(), hence the
> > prctl() proposal even for the hwasan.
> 
> I'm not sure such a dire view is warrented.. 
> 
> The ioctl situation is not so bad, other than a few special cases,
> most drivers just take a 'void __user *' and pass it as an argument to
> some function that accepts a 'void __user *'. sparse et al verify
> this. 
> 
> As long as the core functions do the right thing the drivers will be
> OK.
> 
> The only place things get dicy is if someone casts to unsigned long
> (ie for vma work) but I think that reflects that our driver facing
> APIs for VMAs are compatible with static analysis (ie I have no
> earthly idea why get_user_pages() accepts an unsigned long), not that
> this is too hard.

If multiple people will care about this, perhaps we should try to
annotate types more explicitly in SYSCALL_DEFINEx() and ABI data
structures.

For example, we could have a couple of mutually exclusive modifiers

T __object *
T __vaddr * (or U __vaddr)

In the first case the pointer points to an object (in the C sense)
that the call may dereference but not use for any other purpose.

In the latter case the pointer (or other type) is a virtual address
that the call does not dereference but my do other things with.

Also

U __really(T)

to tell static analysers the real type of pointers smuggled through
UAPI disguised as other types (*cough* KVM, etc.)

We could gradually make sparse more strict about the presence of
annotations and allowed conversions, add get/put_user() variants
that demand explicit annotation, etc.

find_vma() wouldn't work with a __object pointer, for example.  A
get_user_pages_for_dereference() might be needed for __object pointers
(embodying a promise from the caller that only the object will be
dereferenced within the mapped pages).

Thoughts?

This kind of thing would need widespread buy-in in order to be viable.

Cheers
---Dave

enh May 22, 2019, 3:30 p.m. UTC | #8

On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> Hi Kees,
>
> Thanks for joining the thread ;).
>
> On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> > On Tue, May 21, 2019 at 07:29:33PM +0100, Catalin Marinas wrote:
> > > On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
> > > > On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > > IMO (RFC for now), I see two ways forward:
> > > > > [...]
> > > > > 2. Similar shim to the above libc wrapper but inside the kernel
> > > > >    (arch/arm64 only; most pointer arguments could be covered with an
> > > > >    __SC_CAST similar to the s390 one). There are two differences from
> > > > >    what we've discussed in the past:
> > > > >
> > > > >    a) this is an opt-in by the user which would have to explicitly call
> > > > >       prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
> > > > >       to pass tagged pointers to the kernel. This would probably be the
> > > > >       responsibility of the C lib to make sure it doesn't tag heap
> > > > >       allocations. If the user did not opt-in, the syscalls are routed
> > > > >       through the normal path (no untagging address shim).
> > > > >
> > > > >    b) ioctl() and other blacklisted syscalls (prctl) will not accept
> > > > >       tagged pointers (to be documented in Vicenzo's ABI patches).
> > > >
> > > > The way I see it, a patch that breaks handling of tagged pointers is
> > > > not that different from, say, a patch that adds a wild pointer
> > > > dereference. Both are bugs; the difference is that (a) the former
> > > > breaks a relatively uncommon target and (b) it's arguably an easier
> > > > mistake to make. If MTE adoption goes well, (a) will not be the case
> > > > for long.
> > >
> > > It's also the fact such patch would go unnoticed for a long time until
> > > someone exercises that code path. And when they do, the user would be
> > > pretty much in the dark trying to figure what what went wrong, why a
> > > SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed
> > > all the places where it matters in the current kernel codebase (ignoring
> > > future patches).
> >
> > So, looking forward a bit, this isn't going to be an ARM-specific issue
> > for long.
>
> I do hope so.
>
> > In fact, I think we shouldn't have arm-specific syscall wrappers
> > in this series: I think untagged_addr() should likely be added at the
> > top-level and have it be a no-op for other architectures.
>
> That's what the current patchset does, so we have this as a starting
> point. Kostya raised another potential issue with the syscall wrappers:
> with MTE the kernel will be forced to enable the match-all (wildcard)
> pointers for user space accesses since copy_from_user() would only get a
> 0 tag. So it has wider implications than just uaccess routines not
> checking the colour.
>
> > So given this becoming a kernel-wide multi-architecture issue (under
> > the assumption that x86, RISC-V, and others will gain similar TBI or
> > MTE things), we should solve it in a way that we can re-use.
>
> Can we do any better to aid the untagged_addr() placement (e.g. better
> type annotations, better static analysis)? We have to distinguish
> between user pointers that may be dereferenced by the kernel (I think
> almost fully covered with this patchset) and user addresses represented
> as ulong that may:
>
> a) be converted to a user pointer and dereferenced; I think that's the
>    case for many overloaded ulong/u64 arguments
>
> b) used for address space management, rbtree look-ups etc. where the tag
>    is no longer relevant and it even gets in the way
>
> We tried last year to identify void __user * casts to unsigned long
> using sparse on the assumption that pointers can be tagged while ulong
> is about address space management and needs to lose such tag. I think we
> could have pushed this further. For example, get_user_pages() takes an
> unsigned long but it is perfectly capable of untagging the address
> itself. Shall we change its first argument to void __user * (together
> with all its callers)?
>
> find_vma(), OTOH, could untag the address but it doesn't help since
> vm_start/end don't have such information (that's more about the content
> or type that the user decided) and the callers check against it.
>
> Are there any other places where this matters? These patches tracked
> down find_vma() as some heuristics but we may need better static
> analysis to identify other cases.
>
> > We need something that is going to work everywhere. And it needs to be
> > supported by the kernel for the simple reason that the kernel needs to
> > do MTE checks during copy_from_user(): having that information stripped
> > means we lose any userspace-assigned MTE protections if they get handled
> > by the kernel, which is a total non-starter, IMO.
>
> Such feedback is welcomed ;).
>
> > As an aside: I think Sparc ADI support in Linux actually side-stepped
> > this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
> > be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
> > for kernel code.") I think this was a mistake we should not repeat for
> > arm64 (we do seem to be at least in agreement about this, I think).
> >
> > [1] https://lore.kernel.org/patchwork/patch/654481/
>
> I tried to drag the SPARC guys into this discussion but without much
> success.
>
> > > > This is a bit of a chicken-and-egg problem. In a world where memory
> > > > allocators on one or several popular platforms generate pointers with
> > > > non-zero tags, any such breakage will be caught in testing.
> > > > Unfortunately to reach that state we need the kernel to start
> > > > accepting tagged pointers first, and then hold on for a couple of
> > > > years until userspace catches up.
> > >
> > > Would the kernel also catch up with providing a stable ABI? Because we
> > > have two moving targets.
> > >
> > > On one hand, you have Android or some Linux distro that stick to a
> > > stable kernel version for some time, so they have better chance of
> > > clearing most of the problems. On the other hand, we have mainline
> > > kernel that gets over 500K lines every release. As maintainer, I can't
> > > rely on my testing alone as this is on a limited number of platforms. So
> > > my concern is that every kernel release has a significant chance of
> > > breaking the ABI, unless we have a better way of identifying potential
> > > issues.
> >
> > I just want to make sure I fully understand your concern about this
> > being an ABI break, and I work best with examples. The closest situation
> > I can see would be:
> >
> > - some program has no idea about MTE
>
> Apart from some libraries like libc (and maybe those that handle
> specific device ioctls), I think most programs should have no idea about
> MTE. I wouldn't expect programmers to have to change their app just
> because we have a new feature that colours heap allocations.

obviously i'm biased as a libc maintainer, but...

i don't think it helps to move this to libc --- now you just have an
extra dependency where to have a guaranteed working system you need to
update your kernel and libc together. (or at least update your libc to
understand new ioctls etc _before_ you can update your kernel.)

> > - malloc() starts returning MTE-tagged addresses
> > - program doesn't break from that change
> > - program uses some syscall that is missing untagged_addr() and fails
> > - kernel has now broken userspace that used to work
>
> That's one aspect though probably more of a case of plugging in a new
> device (graphics card, network etc.) and the ioctl to the new device
> doesn't work.
>
> The other is that, assuming we reach a point where the kernel entirely
> supports this relaxed ABI, can we guarantee that it won't break in the
> future. Let's say some subsequent kernel change (some refactoring)
> misses out an untagged_addr(). This renders a previously TBI/MTE-capable
> syscall unusable. Can we rely only on testing?
>
> > The trouble I see with this is that it is largely theoretical and
> > requires part of userspace to collude to start using a new CPU feature
> > that tickles a bug in the kernel. As I understand the golden rule,
> > this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> > not a global breaking of some userspace behavior.
>
> Yes, we should follow the rule that it's a kernel bug but it doesn't
> help the user that a newly installed kernel causes user space to no
> longer reach a prompt. Hence the proposal of an opt-in via personality
> (for MTE we would need an explicit opt-in by the user anyway since the
> top byte is no longer ignored but checked against the allocation tag).

but realistically would this actually get used in this way? or would
any given system either be MTE or non-MTE. in which case a kernel
configuration option would seem to make more sense. (because either
way, the hypothetical user basically needs to recompile the kernel to
get back on their feet. or all of userspace.)

i'm not sure i see this new way for a kernel update to break my system
and need to be fixed forward/rolled back as any different from any of
the existing ways in which this can happen :-) as an end-user i have
to rely on whoever's sending me software updates to test adequately
enough that they find the problems. as an end user, there isn't any
difference between "my phone rebooted when i tried to take a photo
because of a kernel/driver leak", say, and "my phone rebooted when i
tried to take a photo because of missing untagging of a pointer passed
via ioctl".

i suspect you and i have very different people in mind when we say "user" :-)

> > I feel like I'm missing something about this being seen as an ABI
> > break. The kernel already fails on userspace addresses that have high
> > bits set -- are there things that _depend_ on this failure to operate?
>
> It's about providing a relaxed ABI which allows non-zero top byte and
> breaking it later inadvertently without having something better in place
> to analyse the kernel changes.
>
> Thanks.
>
> --
> Catalin

Catalin Marinas May 22, 2019, 4:35 p.m. UTC | #9

On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
> On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> > > I just want to make sure I fully understand your concern about this
> > > being an ABI break, and I work best with examples. The closest situation
> > > I can see would be:
> > >
> > > - some program has no idea about MTE
> >
> > Apart from some libraries like libc (and maybe those that handle
> > specific device ioctls), I think most programs should have no idea about
> > MTE. I wouldn't expect programmers to have to change their app just
> > because we have a new feature that colours heap allocations.
> 
> obviously i'm biased as a libc maintainer, but...
> 
> i don't think it helps to move this to libc --- now you just have an
> extra dependency where to have a guaranteed working system you need to
> update your kernel and libc together. (or at least update your libc to
> understand new ioctls etc _before_ you can update your kernel.)

That's not what I meant (or I misunderstood you). If we have a relaxed
ABI in the kernel and a libc that returns tagged pointers on malloc() I
wouldn't expect the programmer to do anything different in the
application code like explicit untagging. Basically the program would
continue to run unmodified irrespective of whether you use an old libc
without tagged pointers or a new one which tags heap allocations.

What I do expect is that the libc checks for the presence of the relaxed
ABI, currently proposed as an AT_FLAGS bit (for MTE we'd have a
HWCAP_MTE), and only tag the malloc() pointers if the kernel supports
the relaxed ABI. As you said, you shouldn't expect that the C library
and kernel are upgraded together, so they should be able to work in any
new/old version combination.

> > > The trouble I see with this is that it is largely theoretical and
> > > requires part of userspace to collude to start using a new CPU feature
> > > that tickles a bug in the kernel. As I understand the golden rule,
> > > this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> > > not a global breaking of some userspace behavior.
> >
> > Yes, we should follow the rule that it's a kernel bug but it doesn't
> > help the user that a newly installed kernel causes user space to no
> > longer reach a prompt. Hence the proposal of an opt-in via personality
> > (for MTE we would need an explicit opt-in by the user anyway since the
> > top byte is no longer ignored but checked against the allocation tag).
> 
> but realistically would this actually get used in this way? or would
> any given system either be MTE or non-MTE. in which case a kernel
> configuration option would seem to make more sense. (because either
> way, the hypothetical user basically needs to recompile the kernel to
> get back on their feet. or all of userspace.)

The two hard requirements I have for supporting any new hardware feature
in Linux are (1) a single kernel image binary continues to run on old
hardware while making use of the new feature if available and (2) old
user space continues to run on new hardware while new user space can
take advantage of the new feature.

The distro user space usually has a hard requirement that it continues
to run on (certain) old hardware. We can't enforce this in the kernel
but we offer the option to user space developers of checking feature
availability through HWCAP bits.

The Android story may be different as you have more control about which
kernel configurations are deployed on specific SoCs. I'm looking more
from a Linux distro angle where you just get an off-the-shelf OS image
and install it on your hardware, either taking advantage of new features
or just not using them if the software was not updated. Or, if updated
software is installed on old hardware, it would just run.

For MTE, we just can't enable it by default since there are applications
who use the top byte of a pointer and expect it to be ignored rather
than failing with a mismatched tag. Just think of a hwasan compiled
binary where TBI is expected to work and you try to run it with MTE
turned on.

I would also expect the C library or dynamic loader to check for the
presence of a HWCAP_MTE bit before starting to tag memory allocations,
otherwise it would get SIGILL on the first MTE instruction it tries to
execute.

> i'm not sure i see this new way for a kernel update to break my system
> and need to be fixed forward/rolled back as any different from any of
> the existing ways in which this can happen :-) as an end-user i have
> to rely on whoever's sending me software updates to test adequately
> enough that they find the problems. as an end user, there isn't any
> difference between "my phone rebooted when i tried to take a photo
> because of a kernel/driver leak", say, and "my phone rebooted when i
> tried to take a photo because of missing untagging of a pointer passed
> via ioctl".
> 
> i suspect you and i have very different people in mind when we say "user" :-)

Indeed, I think we have different users in mind. I didn't mean the end
user who doesn't really care which C library version it's running on
their phone but rather advanced users (not necessarily kernel
developers) that prefer to build their own kernels with every release.
We could extend this to kernel developers who don't have time to track
down why a new kernel triggers lots of SIGSEGVs during boot.

enh May 22, 2019, 4:58 p.m. UTC | #10

On Wed, May 22, 2019 at 9:35 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
> > On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> > > > I just want to make sure I fully understand your concern about this
> > > > being an ABI break, and I work best with examples. The closest situation
> > > > I can see would be:
> > > >
> > > > - some program has no idea about MTE
> > >
> > > Apart from some libraries like libc (and maybe those that handle
> > > specific device ioctls), I think most programs should have no idea about
> > > MTE. I wouldn't expect programmers to have to change their app just
> > > because we have a new feature that colours heap allocations.
> >
> > obviously i'm biased as a libc maintainer, but...
> >
> > i don't think it helps to move this to libc --- now you just have an
> > extra dependency where to have a guaranteed working system you need to
> > update your kernel and libc together. (or at least update your libc to
> > understand new ioctls etc _before_ you can update your kernel.)
>
> That's not what I meant (or I misunderstood you). If we have a relaxed
> ABI in the kernel and a libc that returns tagged pointers on malloc() I
> wouldn't expect the programmer to do anything different in the
> application code like explicit untagging. Basically the program would
> continue to run unmodified irrespective of whether you use an old libc
> without tagged pointers or a new one which tags heap allocations.
>
> What I do expect is that the libc checks for the presence of the relaxed
> ABI, currently proposed as an AT_FLAGS bit (for MTE we'd have a
> HWCAP_MTE), and only tag the malloc() pointers if the kernel supports
> the relaxed ABI. As you said, you shouldn't expect that the C library
> and kernel are upgraded together, so they should be able to work in any
> new/old version combination.

yes, that part makes sense. i do think we'd use the AT_FLAGS bit, for
exactly this.

i was questioning the argument about the ioctl issues, and saying that
from my perspective, untagging bugs are not really any different than
any other kind of kernel bug.

> > > > The trouble I see with this is that it is largely theoretical and
> > > > requires part of userspace to collude to start using a new CPU feature
> > > > that tickles a bug in the kernel. As I understand the golden rule,
> > > > this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> > > > not a global breaking of some userspace behavior.
> > >
> > > Yes, we should follow the rule that it's a kernel bug but it doesn't
> > > help the user that a newly installed kernel causes user space to no
> > > longer reach a prompt. Hence the proposal of an opt-in via personality
> > > (for MTE we would need an explicit opt-in by the user anyway since the
> > > top byte is no longer ignored but checked against the allocation tag).
> >
> > but realistically would this actually get used in this way? or would
> > any given system either be MTE or non-MTE. in which case a kernel
> > configuration option would seem to make more sense. (because either
> > way, the hypothetical user basically needs to recompile the kernel to
> > get back on their feet. or all of userspace.)
>
> The two hard requirements I have for supporting any new hardware feature
> in Linux are (1) a single kernel image binary continues to run on old
> hardware while making use of the new feature if available and (2) old
> user space continues to run on new hardware while new user space can
> take advantage of the new feature.
>
> The distro user space usually has a hard requirement that it continues
> to run on (certain) old hardware. We can't enforce this in the kernel
> but we offer the option to user space developers of checking feature
> availability through HWCAP bits.
>
> The Android story may be different as you have more control about which
> kernel configurations are deployed on specific SoCs. I'm looking more
> from a Linux distro angle where you just get an off-the-shelf OS image
> and install it on your hardware, either taking advantage of new features
> or just not using them if the software was not updated. Or, if updated
> software is installed on old hardware, it would just run.
>
> For MTE, we just can't enable it by default since there are applications
> who use the top byte of a pointer and expect it to be ignored rather
> than failing with a mismatched tag. Just think of a hwasan compiled
> binary where TBI is expected to work and you try to run it with MTE
> turned on.
>
> I would also expect the C library or dynamic loader to check for the
> presence of a HWCAP_MTE bit before starting to tag memory allocations,
> otherwise it would get SIGILL on the first MTE instruction it tries to
> execute.

(a bit off-topic, but i thought the MTE instructions were encoded in
the no-op space, to avoid this?)

> > i'm not sure i see this new way for a kernel update to break my system
> > and need to be fixed forward/rolled back as any different from any of
> > the existing ways in which this can happen :-) as an end-user i have
> > to rely on whoever's sending me software updates to test adequately
> > enough that they find the problems. as an end user, there isn't any
> > difference between "my phone rebooted when i tried to take a photo
> > because of a kernel/driver leak", say, and "my phone rebooted when i
> > tried to take a photo because of missing untagging of a pointer passed
> > via ioctl".
> >
> > i suspect you and i have very different people in mind when we say "user" :-)
>
> Indeed, I think we have different users in mind. I didn't mean the end
> user who doesn't really care which C library version it's running on
> their phone but rather advanced users (not necessarily kernel
> developers) that prefer to build their own kernels with every release.
> We could extend this to kernel developers who don't have time to track
> down why a new kernel triggers lots of SIGSEGVs during boot.

i still don't see how this isn't just a regular testing/CI issue, the
same as any other kind of kernel bug. it's already the case that i can
get a bad kernel...

> --
> Catalin

Kees Cook May 22, 2019, 7:21 p.m. UTC | #11

On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
> On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> > > I just want to make sure I fully understand your concern about this
> > > being an ABI break, and I work best with examples. The closest situation
> > > I can see would be:
> > >
> > > - some program has no idea about MTE
> >
> > Apart from some libraries like libc (and maybe those that handle
> > specific device ioctls), I think most programs should have no idea about
> > MTE. I wouldn't expect programmers to have to change their app just
> > because we have a new feature that colours heap allocations.

Right -- things should Just Work from the application perspective.

> obviously i'm biased as a libc maintainer, but...
> 
> i don't think it helps to move this to libc --- now you just have an
> extra dependency where to have a guaranteed working system you need to
> update your kernel and libc together. (or at least update your libc to
> understand new ioctls etc _before_ you can update your kernel.)

I think (hope?) we've all agreed that we shouldn't pass this off to
userspace. At the very least, it reduces the utility of MTE, and at worst
it complicates userspace when this is clearly a kernel/architecture issue.

> 
> > > - malloc() starts returning MTE-tagged addresses
> > > - program doesn't break from that change
> > > - program uses some syscall that is missing untagged_addr() and fails
> > > - kernel has now broken userspace that used to work
> >
> > That's one aspect though probably more of a case of plugging in a new
> > device (graphics card, network etc.) and the ioctl to the new device
> > doesn't work.

I think MTE will likely be rather like NX/PXN and SMAP/PAN: there will
be glitches, and we can disable stuff either via CONFIG or (as is more
common now) via a kernel commandline with untagged_addr() containing a
static branch, etc. But I actually don't think we need to go this route
(see below...)

> > The other is that, assuming we reach a point where the kernel entirely
> > supports this relaxed ABI, can we guarantee that it won't break in the
> > future. Let's say some subsequent kernel change (some refactoring)
> > misses out an untagged_addr(). This renders a previously TBI/MTE-capable
> > syscall unusable. Can we rely only on testing?
> >
> > > The trouble I see with this is that it is largely theoretical and
> > > requires part of userspace to collude to start using a new CPU feature
> > > that tickles a bug in the kernel. As I understand the golden rule,
> > > this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> > > not a global breaking of some userspace behavior.
> >
> > Yes, we should follow the rule that it's a kernel bug but it doesn't
> > help the user that a newly installed kernel causes user space to no
> > longer reach a prompt. Hence the proposal of an opt-in via personality
> > (for MTE we would need an explicit opt-in by the user anyway since the
> > top byte is no longer ignored but checked against the allocation tag).
> 
> but realistically would this actually get used in this way? or would
> any given system either be MTE or non-MTE. in which case a kernel
> configuration option would seem to make more sense. (because either
> way, the hypothetical user basically needs to recompile the kernel to
> get back on their feet. or all of userspace.)

Right: the point is to design things so that we do our best to not break
userspace that is using the new feature (which I think this series has
done well). But supporting MTE/TBI is just like supporting PAN: if someone
refactors a driver and swaps a copy_from_user() to a memcpy(), it's going
to break under PAN. There will be the same long tail of these bugs like
any other, but my sense is that they are small and rare. But I agree:
they're going to be pretty weird bugs to track down. The final result,
however, will be excellent annotation in the kernel for where userspace
addresses get used and people make assumptions about them.

The sooner we get the series landed and gain QEMU support (or real
hardware), the faster we can hammer out these missed corner-cases.
What's the timeline for either of those things, BTW?

> > > I feel like I'm missing something about this being seen as an ABI
> > > break. The kernel already fails on userspace addresses that have high
> > > bits set -- are there things that _depend_ on this failure to operate?
> >
> > It's about providing a relaxed ABI which allows non-zero top byte and
> > breaking it later inadvertently without having something better in place
> > to analyse the kernel changes.

It sounds like the question is how to switch a process in or out of this
ABI (but I don't think that's the real issue: I think it's just a matter
of whether or not a process uses tags at all). Doing it at the prctl()
level doesn't make sense to me, except maybe to detect MTE support or
something. ("Should I tag allocations?") And that state is controlled
by the kernel: the kernel does it or it doesn't.

If a process wants to not tag, that's also up to the allocator where
it can decide not to ask the kernel, and just not tag. Nothing breaks in
userspace if a process is NOT tagging and untagged_addr() exists or is
missing. This, I think, is the core way this doesn't trip over the
golden rule: an old system image will run fine (because it's not
tagging). A *new* system may encounter bugs with tagging because it's a
new feature: this is The Way Of Things. But we don't break old userspace
because old userspace isn't using tags.

So the agreement appears to be between the kernel and the allocator.
Kernel says "I support this" or not. Telling the allocator to not tag if
something breaks sounds like an entirely userspace decision, yes?

enh May 22, 2019, 8:15 p.m. UTC | #12

On Wed, May 22, 2019 at 12:21 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
> > On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> > > > I just want to make sure I fully understand your concern about this
> > > > being an ABI break, and I work best with examples. The closest situation
> > > > I can see would be:
> > > >
> > > > - some program has no idea about MTE
> > >
> > > Apart from some libraries like libc (and maybe those that handle
> > > specific device ioctls), I think most programs should have no idea about
> > > MTE. I wouldn't expect programmers to have to change their app just
> > > because we have a new feature that colours heap allocations.
>
> Right -- things should Just Work from the application perspective.
>
> > obviously i'm biased as a libc maintainer, but...
> >
> > i don't think it helps to move this to libc --- now you just have an
> > extra dependency where to have a guaranteed working system you need to
> > update your kernel and libc together. (or at least update your libc to
> > understand new ioctls etc _before_ you can update your kernel.)
>
> I think (hope?) we've all agreed that we shouldn't pass this off to
> userspace. At the very least, it reduces the utility of MTE, and at worst
> it complicates userspace when this is clearly a kernel/architecture issue.
>
> >
> > > > - malloc() starts returning MTE-tagged addresses
> > > > - program doesn't break from that change
> > > > - program uses some syscall that is missing untagged_addr() and fails
> > > > - kernel has now broken userspace that used to work
> > >
> > > That's one aspect though probably more of a case of plugging in a new
> > > device (graphics card, network etc.) and the ioctl to the new device
> > > doesn't work.
>
> I think MTE will likely be rather like NX/PXN and SMAP/PAN: there will
> be glitches, and we can disable stuff either via CONFIG or (as is more
> common now) via a kernel commandline with untagged_addr() containing a
> static branch, etc. But I actually don't think we need to go this route
> (see below...)
>
> > > The other is that, assuming we reach a point where the kernel entirely
> > > supports this relaxed ABI, can we guarantee that it won't break in the
> > > future. Let's say some subsequent kernel change (some refactoring)
> > > misses out an untagged_addr(). This renders a previously TBI/MTE-capable
> > > syscall unusable. Can we rely only on testing?
> > >
> > > > The trouble I see with this is that it is largely theoretical and
> > > > requires part of userspace to collude to start using a new CPU feature
> > > > that tickles a bug in the kernel. As I understand the golden rule,
> > > > this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> > > > not a global breaking of some userspace behavior.
> > >
> > > Yes, we should follow the rule that it's a kernel bug but it doesn't
> > > help the user that a newly installed kernel causes user space to no
> > > longer reach a prompt. Hence the proposal of an opt-in via personality
> > > (for MTE we would need an explicit opt-in by the user anyway since the
> > > top byte is no longer ignored but checked against the allocation tag).
> >
> > but realistically would this actually get used in this way? or would
> > any given system either be MTE or non-MTE. in which case a kernel
> > configuration option would seem to make more sense. (because either
> > way, the hypothetical user basically needs to recompile the kernel to
> > get back on their feet. or all of userspace.)
>
> Right: the point is to design things so that we do our best to not break
> userspace that is using the new feature (which I think this series has
> done well). But supporting MTE/TBI is just like supporting PAN: if someone
> refactors a driver and swaps a copy_from_user() to a memcpy(), it's going
> to break under PAN. There will be the same long tail of these bugs like
> any other, but my sense is that they are small and rare. But I agree:
> they're going to be pretty weird bugs to track down. The final result,
> however, will be excellent annotation in the kernel for where userspace
> addresses get used and people make assumptions about them.
>
> The sooner we get the series landed and gain QEMU support (or real
> hardware), the faster we can hammer out these missed corner-cases.
> What's the timeline for either of those things, BTW?
>
> > > > I feel like I'm missing something about this being seen as an ABI
> > > > break. The kernel already fails on userspace addresses that have high
> > > > bits set -- are there things that _depend_ on this failure to operate?
> > >
> > > It's about providing a relaxed ABI which allows non-zero top byte and
> > > breaking it later inadvertently without having something better in place
> > > to analyse the kernel changes.
>
> It sounds like the question is how to switch a process in or out of this
> ABI (but I don't think that's the real issue: I think it's just a matter
> of whether or not a process uses tags at all). Doing it at the prctl()
> level doesn't make sense to me, except maybe to detect MTE support or
> something. ("Should I tag allocations?") And that state is controlled
> by the kernel: the kernel does it or it doesn't.
>
> If a process wants to not tag, that's also up to the allocator where
> it can decide not to ask the kernel, and just not tag. Nothing breaks in
> userspace if a process is NOT tagging and untagged_addr() exists or is
> missing. This, I think, is the core way this doesn't trip over the
> golden rule: an old system image will run fine (because it's not
> tagging). A *new* system may encounter bugs with tagging because it's a
> new feature: this is The Way Of Things. But we don't break old userspace
> because old userspace isn't using tags.
>
> So the agreement appears to be between the kernel and the allocator.
> Kernel says "I support this" or not. Telling the allocator to not tag if
> something breaks sounds like an entirely userspace decision, yes?

sgtm, and the AT_FLAGS suggestion sounds fine for our needs in that regard.

> --
> Kees Cook

Kees Cook May 22, 2019, 8:47 p.m. UTC | #13

On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> The two hard requirements I have for supporting any new hardware feature
> in Linux are (1) a single kernel image binary continues to run on old
> hardware while making use of the new feature if available and (2) old
> user space continues to run on new hardware while new user space can
> take advantage of the new feature.

Agreed! And I think the series meets these requirements, yes?

> For MTE, we just can't enable it by default since there are applications
> who use the top byte of a pointer and expect it to be ignored rather
> than failing with a mismatched tag. Just think of a hwasan compiled
> binary where TBI is expected to work and you try to run it with MTE
> turned on.

Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI
conflicting with MTE. And anything that starts using TBI suddenly can't
run in the future because it's being interpreted as MTE bits? (Is that
the ABI concern? I feel like we got into the weeds about ioctl()s and
one-off bugs...)

So there needs to be some way to let the kernel know which of three
things it should be doing:
1- leaving userspace addresses as-is (present)
2- wiping the top bits before using (this series)
3- wiping the top bits for most things, but retaining them for MTE as
   needed (the future)

I expect MTE to be the "default" in the future. Once a system's libc has
grown support for it, everything will be trying to use MTE. TBI will be
the special case (but TBI is effectively a prerequisite).

AFAICT, the only difference I see between 2 and 3 will be the tag handling
in usercopy (all other places will continue to ignore the top bits). Is
that accurate?

Is "1" a per-process state we want to keep? (I assume not, but rather it
is available via no TBI/MTE CONFIG or a boot-time option, if at all?)

To choose between "2" and "3", it seems we need a per-process flag to
opt into TBI (and out of MTE). For userspace, how would a future binary
choose TBI over MTE? If it's a library issue, we can't use an ELF bit,
since the choice may be "late" after ELF load (this implies the need
for a prctl().) If it's binary-only ("built with HWKASan") then an ELF
bit seems sufficient. And without the marking, I'd expect the kernel to
enforce MTE when there are high bits.

> I would also expect the C library or dynamic loader to check for the
> presence of a HWCAP_MTE bit before starting to tag memory allocations,
> otherwise it would get SIGILL on the first MTE instruction it tries to
> execute.

I've got the same question as Elliot: aren't MTE instructions just NOP
to older CPUs? I.e. if the CPU (or kernel) don't support it, it just
gets entirely ignored: checking is only needed to satisfy curiosity
or behavioral expectations.

To me, the conflict seems to be using TBI in the face of expecting MTE to
be the default state of the future. (But the internal changes needed
for TBI -- this series -- is a prereq for MTE.)

Evgenii Stepanov May 22, 2019, 11:03 p.m. UTC | #14

On Wed, May 22, 2019 at 1:47 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> > The two hard requirements I have for supporting any new hardware feature
> > in Linux are (1) a single kernel image binary continues to run on old
> > hardware while making use of the new feature if available and (2) old
> > user space continues to run on new hardware while new user space can
> > take advantage of the new feature.
>
> Agreed! And I think the series meets these requirements, yes?
>
> > For MTE, we just can't enable it by default since there are applications
> > who use the top byte of a pointer and expect it to be ignored rather
> > than failing with a mismatched tag. Just think of a hwasan compiled
> > binary where TBI is expected to work and you try to run it with MTE
> > turned on.
>
> Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI
> conflicting with MTE. And anything that starts using TBI suddenly can't
> run in the future because it's being interpreted as MTE bits? (Is that
> the ABI concern? I feel like we got into the weeds about ioctl()s and
> one-off bugs...)
>
> So there needs to be some way to let the kernel know which of three
> things it should be doing:
> 1- leaving userspace addresses as-is (present)
> 2- wiping the top bits before using (this series)
> 3- wiping the top bits for most things, but retaining them for MTE as
>    needed (the future)
>
> I expect MTE to be the "default" in the future. Once a system's libc has
> grown support for it, everything will be trying to use MTE. TBI will be
> the special case (but TBI is effectively a prerequisite).
>
> AFAICT, the only difference I see between 2 and 3 will be the tag handling
> in usercopy (all other places will continue to ignore the top bits). Is
> that accurate?
>
> Is "1" a per-process state we want to keep? (I assume not, but rather it
> is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
>
> To choose between "2" and "3", it seems we need a per-process flag to
> opt into TBI (and out of MTE). For userspace, how would a future binary
> choose TBI over MTE? If it's a library issue, we can't use an ELF bit,
> since the choice may be "late" after ELF load (this implies the need
> for a prctl().) If it's binary-only ("built with HWKASan") then an ELF
> bit seems sufficient. And without the marking, I'd expect the kernel to
> enforce MTE when there are high bits.
>
> > I would also expect the C library or dynamic loader to check for the
> > presence of a HWCAP_MTE bit before starting to tag memory allocations,
> > otherwise it would get SIGILL on the first MTE instruction it tries to
> > execute.
>
> I've got the same question as Elliot: aren't MTE instructions just NOP
> to older CPUs? I.e. if the CPU (or kernel) don't support it, it just
> gets entirely ignored: checking is only needed to satisfy curiosity
> or behavioral expectations.

MTE instructions are not NOP. Most of them have side effects (changing
register values, zeroing memory).
This only matters for stack tagging, though. Heap tagging is a runtime
decision in the allocator.

If an image needs to run on old hardware, it will have to do heap tagging only.

> To me, the conflict seems to be using TBI in the face of expecting MTE to
> be the default state of the future. (But the internal changes needed
> for TBI -- this series -- is a prereq for MTE.)
>
> --
> Kees Cook

enh May 22, 2019, 11:09 p.m. UTC | #15

On Wed, May 22, 2019 at 4:03 PM Evgenii Stepanov <eugenis@google.com> wrote:
>
> On Wed, May 22, 2019 at 1:47 PM Kees Cook <keescook@chromium.org> wrote:
> >
> > On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> > > The two hard requirements I have for supporting any new hardware feature
> > > in Linux are (1) a single kernel image binary continues to run on old
> > > hardware while making use of the new feature if available and (2) old
> > > user space continues to run on new hardware while new user space can
> > > take advantage of the new feature.
> >
> > Agreed! And I think the series meets these requirements, yes?
> >
> > > For MTE, we just can't enable it by default since there are applications
> > > who use the top byte of a pointer and expect it to be ignored rather
> > > than failing with a mismatched tag. Just think of a hwasan compiled
> > > binary where TBI is expected to work and you try to run it with MTE
> > > turned on.
> >
> > Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI
> > conflicting with MTE. And anything that starts using TBI suddenly can't
> > run in the future because it's being interpreted as MTE bits? (Is that
> > the ABI concern? I feel like we got into the weeds about ioctl()s and
> > one-off bugs...)
> >
> > So there needs to be some way to let the kernel know which of three
> > things it should be doing:
> > 1- leaving userspace addresses as-is (present)
> > 2- wiping the top bits before using (this series)
> > 3- wiping the top bits for most things, but retaining them for MTE as
> >    needed (the future)
> >
> > I expect MTE to be the "default" in the future. Once a system's libc has
> > grown support for it, everything will be trying to use MTE. TBI will be
> > the special case (but TBI is effectively a prerequisite).
> >
> > AFAICT, the only difference I see between 2 and 3 will be the tag handling
> > in usercopy (all other places will continue to ignore the top bits). Is
> > that accurate?
> >
> > Is "1" a per-process state we want to keep? (I assume not, but rather it
> > is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
> >
> > To choose between "2" and "3", it seems we need a per-process flag to
> > opt into TBI (and out of MTE). For userspace, how would a future binary
> > choose TBI over MTE? If it's a library issue, we can't use an ELF bit,
> > since the choice may be "late" after ELF load (this implies the need
> > for a prctl().) If it's binary-only ("built with HWKASan") then an ELF
> > bit seems sufficient. And without the marking, I'd expect the kernel to
> > enforce MTE when there are high bits.
> >
> > > I would also expect the C library or dynamic loader to check for the
> > > presence of a HWCAP_MTE bit before starting to tag memory allocations,
> > > otherwise it would get SIGILL on the first MTE instruction it tries to
> > > execute.
> >
> > I've got the same question as Elliot: aren't MTE instructions just NOP
> > to older CPUs? I.e. if the CPU (or kernel) don't support it, it just
> > gets entirely ignored: checking is only needed to satisfy curiosity
> > or behavioral expectations.
>
> MTE instructions are not NOP. Most of them have side effects (changing
> register values, zeroing memory).

no, i meant "they're encoded in a space that was previously no-ops, so
running on MTE code on old hardware doesn't cause SIGILL".

> This only matters for stack tagging, though. Heap tagging is a runtime
> decision in the allocator.
>
> If an image needs to run on old hardware, it will have to do heap tagging only.
>
> > To me, the conflict seems to be using TBI in the face of expecting MTE to
> > be the default state of the future. (But the internal changes needed
> > for TBI -- this series -- is a prereq for MTE.)
> >
> > --
> > Kees Cook

Jason Gunthorpe May 23, 2019, 12:20 a.m. UTC | #16

On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
> On Tue, May 21, 2019 at 03:48:56PM -0300, Jason Gunthorpe wrote:
> > On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:
> > 
> > > The tagged pointers (whether hwasan or MTE) should ideally be a
> > > transparent feature for the application writer but I don't think we can
> > > solve it entirely and make it seamless for the multitude of ioctls().
> > > I'd say you only opt in to such feature if you know what you are doing
> > > and the user code takes care of specific cases like ioctl(), hence the
> > > prctl() proposal even for the hwasan.
> > 
> > I'm not sure such a dire view is warrented.. 
> > 
> > The ioctl situation is not so bad, other than a few special cases,
> > most drivers just take a 'void __user *' and pass it as an argument to
> > some function that accepts a 'void __user *'. sparse et al verify
> > this. 
> > 
> > As long as the core functions do the right thing the drivers will be
> > OK.
> > 
> > The only place things get dicy is if someone casts to unsigned long
> > (ie for vma work) but I think that reflects that our driver facing
> > APIs for VMAs are compatible with static analysis (ie I have no
> > earthly idea why get_user_pages() accepts an unsigned long), not that
> > this is too hard.
> 
> If multiple people will care about this, perhaps we should try to
> annotate types more explicitly in SYSCALL_DEFINEx() and ABI data
> structures.
> 
> For example, we could have a couple of mutually exclusive modifiers
> 
> T __object *
> T __vaddr * (or U __vaddr)
> 
> In the first case the pointer points to an object (in the C sense)
> that the call may dereference but not use for any other purpose.

How would you use these two differently?

So far the kernel has worked that __user should tag any pointer that
is from userspace and then you can't do anything with it until you
transform it into a kernel something

> to tell static analysers the real type of pointers smuggled through
> UAPI disguised as other types (*cough* KVM, etc.)

Yes, that would help alot, we often have to pass pointers through a
u64 in the uAPI, and there is no static checker support to make sure
they are run through the u64_to_user_ptr() helper.

Jason

Catalin Marinas May 23, 2019, 7:34 a.m. UTC | #17

On Wed, May 22, 2019 at 04:09:31PM -0700, enh wrote:
> On Wed, May 22, 2019 at 4:03 PM Evgenii Stepanov <eugenis@google.com> wrote:
> > On Wed, May 22, 2019 at 1:47 PM Kees Cook <keescook@chromium.org> wrote:
> > > On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> > > > I would also expect the C library or dynamic loader to check for the
> > > > presence of a HWCAP_MTE bit before starting to tag memory allocations,
> > > > otherwise it would get SIGILL on the first MTE instruction it tries to
> > > > execute.
> > >
> > > I've got the same question as Elliot: aren't MTE instructions just NOP
> > > to older CPUs? I.e. if the CPU (or kernel) don't support it, it just
> > > gets entirely ignored: checking is only needed to satisfy curiosity
> > > or behavioral expectations.
> >
> > MTE instructions are not NOP. Most of them have side effects (changing
> > register values, zeroing memory).
> 
> no, i meant "they're encoded in a space that was previously no-ops, so
> running on MTE code on old hardware doesn't cause SIGILL".

It does result in SIGILL, there wasn't enough encoding left in the NOP
space for old/current CPU implementations (in hindsight, we should have
reserved a bigger NOP space).

As Evgenii said, the libc needs to be careful when tagging the heap as
it would cause a SIGILL if the HWCAP_MTE is not set. The standard
application doesn't need to be recompiled as it would not issue MTE
colouring instructions, just standard LDR/STR.

Stack tagging is problematic if you want to colour each frame
individually, the function prologue would need the non-NOP MTE
instructions. The best we can do here is just having the (thread) stacks
of different colours.

Dave Martin May 23, 2019, 10:42 a.m. UTC | #18

On Wed, May 22, 2019 at 09:20:52PM -0300, Jason Gunthorpe wrote:
> On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
> > On Tue, May 21, 2019 at 03:48:56PM -0300, Jason Gunthorpe wrote:
> > > On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:
> > > 
> > > > The tagged pointers (whether hwasan or MTE) should ideally be a
> > > > transparent feature for the application writer but I don't think we can
> > > > solve it entirely and make it seamless for the multitude of ioctls().
> > > > I'd say you only opt in to such feature if you know what you are doing
> > > > and the user code takes care of specific cases like ioctl(), hence the
> > > > prctl() proposal even for the hwasan.
> > > 
> > > I'm not sure such a dire view is warrented.. 
> > > 
> > > The ioctl situation is not so bad, other than a few special cases,
> > > most drivers just take a 'void __user *' and pass it as an argument to
> > > some function that accepts a 'void __user *'. sparse et al verify
> > > this. 
> > > 
> > > As long as the core functions do the right thing the drivers will be
> > > OK.
> > > 
> > > The only place things get dicy is if someone casts to unsigned long
> > > (ie for vma work) but I think that reflects that our driver facing
> > > APIs for VMAs are compatible with static analysis (ie I have no
> > > earthly idea why get_user_pages() accepts an unsigned long), not that
> > > this is too hard.
> > 
> > If multiple people will care about this, perhaps we should try to
> > annotate types more explicitly in SYSCALL_DEFINEx() and ABI data
> > structures.
> > 
> > For example, we could have a couple of mutually exclusive modifiers
> > 
> > T __object *
> > T __vaddr * (or U __vaddr)
> > 
> > In the first case the pointer points to an object (in the C sense)
> > that the call may dereference but not use for any other purpose.
> 
> How would you use these two differently?
> 
> So far the kernel has worked that __user should tag any pointer that
> is from userspace and then you can't do anything with it until you
> transform it into a kernel something

Ultimately it would be good to disallow casting __object pointers execpt
to compatible __object pointer types, and to make get_user etc. demand
__object.

__vaddr pointers / addresses would be freely castable, but not to
__object and so would not be dereferenceable even indirectly.

Or that's the general idea.  Figuring out a sane set of rules that we
could actually check / enforce would require a bit of work.

(Whether the __vaddr base type is a pointer or an integer type is
probably moot, due to the restrictions we would place on the use of
these anyway.)

> > to tell static analysers the real type of pointers smuggled through
> > UAPI disguised as other types (*cough* KVM, etc.)
> 
> Yes, that would help alot, we often have to pass pointers through a
> u64 in the uAPI, and there is no static checker support to make sure
> they are run through the u64_to_user_ptr() helper.

Agreed.

Cheers
---Dave

Catalin Marinas May 23, 2019, 2:44 p.m. UTC | #19

On Wed, May 22, 2019 at 01:47:36PM -0700, Kees Cook wrote:
> On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> > The two hard requirements I have for supporting any new hardware feature
> > in Linux are (1) a single kernel image binary continues to run on old
> > hardware while making use of the new feature if available and (2) old
> > user space continues to run on new hardware while new user space can
> > take advantage of the new feature.
> 
> Agreed! And I think the series meets these requirements, yes?

Yes. I mentioned this just to make sure people don't expect different
kernel builds for different hardware features.

There is also the obvious requirement which I didn't mention: new user
space continues to run on new/subsequent kernel versions. That's one of
the points of contention for this series (ignoring MTE) with the
maintainers having to guarantee this without much effort. IOW, do the
500K+ new lines in a subsequent kernel version break any user space out
there? I'm only talking about the relaxed TBI ABI. Are the usual LTP,
syskaller sufficient? Better static analysis would definitely help.

> > For MTE, we just can't enable it by default since there are applications
> > who use the top byte of a pointer and expect it to be ignored rather
> > than failing with a mismatched tag. Just think of a hwasan compiled
> > binary where TBI is expected to work and you try to run it with MTE
> > turned on.
> 
> Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI
> conflicting with MTE. And anything that starts using TBI suddenly can't
> run in the future because it's being interpreted as MTE bits? (Is that
> the ABI concern?

That's another aspect to figure out when we add the MTE support. I don't
think we'd be able to do this without an explicit opt-in by the user.

Or, if we ever want MTE to be turned on by default (i.e. tag checking),
even if everything is tagged with 0, we have to disallow TBI for user
and this includes hwasan. There were a small number of programs using
the TBI (I think some JavaScript compilers tried this). But now we are
bringing in the hwasan support and this can be a large user base. Shall
we add an ELF note for such binaries that use TBI/hwasan?

This series is still required for MTE but we may decide not to relax the
ABI blindly, therefore the opt-in (prctl) or personality idea.

> I feel like we got into the weeds about ioctl()s and one-off bugs...)

This needs solving as well. Most driver developers won't know why
untagged_addr() is needed unless we have more rigorous types or type
annotations and a tool to check them (we should probably revive the old
sparse thread).

> So there needs to be some way to let the kernel know which of three
> things it should be doing:
> 1- leaving userspace addresses as-is (present)
> 2- wiping the top bits before using (this series)

(I'd say tolerating rather than wiping since get_user still uses the tag
in the current series)

The current series does not allow any choice between 1 and 2, the
default ABI basically becomes option 2.

> 3- wiping the top bits for most things, but retaining them for MTE as
>    needed (the future)

2 and 3 are not entirely compatible as a tagged pointer may be checked
against the memory colour by the hardware. So you can't have hwasan
binary with MTE enabled.

> I expect MTE to be the "default" in the future. Once a system's libc has
> grown support for it, everything will be trying to use MTE. TBI will be
> the special case (but TBI is effectively a prerequisite).

The kernel handling of tagged pointers is indeed a prerequisite. The ABI
distinction between the above 2 and 3 needs to be solved.

> AFAICT, the only difference I see between 2 and 3 will be the tag handling
> in usercopy (all other places will continue to ignore the top bits). Is
> that accurate?

Yes, mostly (for the kernel). If MTE is enabled by default for a hwasan
binary, it will SEGFAULT (either in user space or in kernel uaccess).
How does the kernel choose between 2 and 3?

> Is "1" a per-process state we want to keep? (I assume not, but rather it
> is available via no TBI/MTE CONFIG or a boot-time option, if at all?)

Possibly, though not necessarily per process. For testing or if
something goes wrong during boot, a command line option with a static
label would do. The AT_FLAGS bit needs to be checked by user space. My
preference would be per-process.

> To choose between "2" and "3", it seems we need a per-process flag to
> opt into TBI (and out of MTE).

Or leave option 2 the default and get it to opt in to MTE.

> For userspace, how would a future binary choose TBI over MTE? If it's
> a library issue, we can't use an ELF bit, since the choice may be
> "late" after ELF load (this implies the need for a prctl().) If it's
> binary-only ("built with HWKASan") then an ELF bit seems sufficient.
> And without the marking, I'd expect the kernel to enforce MTE when
> there are high bits.

The current plan is that a future binary issues a prctl(), after
checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions
are not in the current NOP space). I'd expect this to be done by the
libc or dynamic loader under the assumption that the binaries it loads
do _not_ use the top pointer byte for anything else. With hwasan
compiled objects this gets more confusing (any ELF note to identify
them?).

(there is also the risk of existing applications using TBI already but
I'm not aware of any still using this feature other than hwasan)

Catalin Marinas May 23, 2019, 3:08 p.m. UTC | #20

On Wed, May 22, 2019 at 12:21:27PM -0700, Kees Cook wrote:
> If a process wants to not tag, that's also up to the allocator where
> it can decide not to ask the kernel, and just not tag. Nothing breaks in
> userspace if a process is NOT tagging and untagged_addr() exists or is
> missing. This, I think, is the core way this doesn't trip over the
> golden rule: an old system image will run fine (because it's not
> tagging). A *new* system may encounter bugs with tagging because it's a
> new feature: this is The Way Of Things. But we don't break old userspace
> because old userspace isn't using tags.

With this series and hwasan binaries, at some point in the future they
will be considered "old userspace" and they do use pointer tags which
expect to be ignored by both the hardware and the kernel. MTE breaks
this assumption.

Catalin Marinas May 23, 2019, 3:21 p.m. UTC | #21

On Wed, May 22, 2019 at 09:58:22AM -0700, enh wrote:
> i was questioning the argument about the ioctl issues, and saying that
> from my perspective, untagging bugs are not really any different than
> any other kind of kernel bug.

Once this series gets in, they are indeed just kernel bugs. What I want
is an easier way to identify them, ideally before they trigger in the
field.

> i still don't see how this isn't just a regular testing/CI issue, the
> same as any other kind of kernel bug. it's already the case that i can
> get a bad kernel...

The testing would have a smaller code coverage in terms of drivers,
filesystems than something like a static checker (though one does not
exclude the other).

enh May 23, 2019, 3:44 p.m. UTC | #22

On Thu, May 23, 2019 at 7:45 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Wed, May 22, 2019 at 01:47:36PM -0700, Kees Cook wrote:
> > On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> > > The two hard requirements I have for supporting any new hardware feature
> > > in Linux are (1) a single kernel image binary continues to run on old
> > > hardware while making use of the new feature if available and (2) old
> > > user space continues to run on new hardware while new user space can
> > > take advantage of the new feature.
> >
> > Agreed! And I think the series meets these requirements, yes?
>
> Yes. I mentioned this just to make sure people don't expect different
> kernel builds for different hardware features.
>
> There is also the obvious requirement which I didn't mention: new user
> space continues to run on new/subsequent kernel versions. That's one of
> the points of contention for this series (ignoring MTE) with the
> maintainers having to guarantee this without much effort. IOW, do the
> 500K+ new lines in a subsequent kernel version break any user space out
> there? I'm only talking about the relaxed TBI ABI. Are the usual LTP,
> syskaller sufficient? Better static analysis would definitely help.
>
> > > For MTE, we just can't enable it by default since there are applications
> > > who use the top byte of a pointer and expect it to be ignored rather
> > > than failing with a mismatched tag. Just think of a hwasan compiled
> > > binary where TBI is expected to work and you try to run it with MTE
> > > turned on.
> >
> > Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI
> > conflicting with MTE. And anything that starts using TBI suddenly can't
> > run in the future because it's being interpreted as MTE bits? (Is that
> > the ABI concern?
>
> That's another aspect to figure out when we add the MTE support. I don't
> think we'd be able to do this without an explicit opt-in by the user.
>
> Or, if we ever want MTE to be turned on by default (i.e. tag checking),
> even if everything is tagged with 0, we have to disallow TBI for user
> and this includes hwasan. There were a small number of programs using
> the TBI (I think some JavaScript compilers tried this). But now we are
> bringing in the hwasan support and this can be a large user base. Shall
> we add an ELF note for such binaries that use TBI/hwasan?
>
> This series is still required for MTE but we may decide not to relax the
> ABI blindly, therefore the opt-in (prctl) or personality idea.
>
> > I feel like we got into the weeds about ioctl()s and one-off bugs...)
>
> This needs solving as well. Most driver developers won't know why
> untagged_addr() is needed unless we have more rigorous types or type
> annotations and a tool to check them (we should probably revive the old
> sparse thread).
>
> > So there needs to be some way to let the kernel know which of three
> > things it should be doing:
> > 1- leaving userspace addresses as-is (present)
> > 2- wiping the top bits before using (this series)
>
> (I'd say tolerating rather than wiping since get_user still uses the tag
> in the current series)
>
> The current series does not allow any choice between 1 and 2, the
> default ABI basically becomes option 2.
>
> > 3- wiping the top bits for most things, but retaining them for MTE as
> >    needed (the future)
>
> 2 and 3 are not entirely compatible as a tagged pointer may be checked
> against the memory colour by the hardware. So you can't have hwasan
> binary with MTE enabled.
>
> > I expect MTE to be the "default" in the future. Once a system's libc has
> > grown support for it, everything will be trying to use MTE. TBI will be
> > the special case (but TBI is effectively a prerequisite).
>
> The kernel handling of tagged pointers is indeed a prerequisite. The ABI
> distinction between the above 2 and 3 needs to be solved.
>
> > AFAICT, the only difference I see between 2 and 3 will be the tag handling
> > in usercopy (all other places will continue to ignore the top bits). Is
> > that accurate?
>
> Yes, mostly (for the kernel). If MTE is enabled by default for a hwasan
> binary, it will SEGFAULT (either in user space or in kernel uaccess).
> How does the kernel choose between 2 and 3?
>
> > Is "1" a per-process state we want to keep? (I assume not, but rather it
> > is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
>
> Possibly, though not necessarily per process. For testing or if
> something goes wrong during boot, a command line option with a static
> label would do. The AT_FLAGS bit needs to be checked by user space. My
> preference would be per-process.
>
> > To choose between "2" and "3", it seems we need a per-process flag to
> > opt into TBI (and out of MTE).
>
> Or leave option 2 the default and get it to opt in to MTE.
>
> > For userspace, how would a future binary choose TBI over MTE? If it's
> > a library issue, we can't use an ELF bit, since the choice may be
> > "late" after ELF load (this implies the need for a prctl().) If it's
> > binary-only ("built with HWKASan") then an ELF bit seems sufficient.
> > And without the marking, I'd expect the kernel to enforce MTE when
> > there are high bits.
>
> The current plan is that a future binary issues a prctl(), after
> checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions
> are not in the current NOP space). I'd expect this to be done by the
> libc or dynamic loader under the assumption that the binaries it loads
> do _not_ use the top pointer byte for anything else.

yeah, it sounds like to support hwasan and MTE, the dynamic linker
will need to not use either itself.

> With hwasan
> compiled objects this gets more confusing (any ELF note to identify
> them?).

no, at the moment code that wants to know checks for the presence of
__hwasan_init. (and bionic doesn't actually look at any ELF notes
right now.) but we can always add something if we need to.

> (there is also the risk of existing applications using TBI already but
> I'm not aware of any still using this feature other than hwasan)
>
> --
> Catalin

Kees Cook May 23, 2019, 4:38 p.m. UTC | #23

On Thu, May 23, 2019 at 03:44:49PM +0100, Catalin Marinas wrote:
> There is also the obvious requirement which I didn't mention: new user
> space continues to run on new/subsequent kernel versions. That's one of
> the points of contention for this series (ignoring MTE) with the
> maintainers having to guarantee this without much effort. IOW, do the
> 500K+ new lines in a subsequent kernel version break any user space out
> there? I'm only talking about the relaxed TBI ABI. Are the usual LTP,
> syskaller sufficient? Better static analysis would definitely help.

We can't have perfect coverage of people actively (or accidentally)
working to trick static analyzers (and the compiler) into "forgetting"
about a __user annotation. We can certainly improve analysis (I see
the other sub-thread on that), but I'd like that work not to block
this series.

What on this front would you be comfortable with? Given it's a new
feature isn't it sufficient to have a CONFIG (and/or boot option)?

> Or, if we ever want MTE to be turned on by default (i.e. tag checking),
> even if everything is tagged with 0, we have to disallow TBI for user
> and this includes hwasan. There were a small number of programs using
> the TBI (I think some JavaScript compilers tried this). But now we are
> bringing in the hwasan support and this can be a large user base. Shall
> we add an ELF note for such binaries that use TBI/hwasan?

Just to be clear, you say "disallow TBI for user" -- you mean a
particular process, yes? i.e. there is no architectural limitation that
says once we're using MTE nothing can switch to TBI. i.e. a process is
either doing MTE or TBI (or nothing, but that's the same as TBI).

> This needs solving as well. Most driver developers won't know why
> untagged_addr() is needed unless we have more rigorous types or type
> annotations and a tool to check them (we should probably revive the old
> sparse thread).

This seems like a parallel concern: we can do that separately from this
series. Without landing it, is it much harder for people to test it,
look for bugs, help with types/annotations, etc.

> > So there needs to be some way to let the kernel know which of three
> > things it should be doing:
> > 1- leaving userspace addresses as-is (present)
> > 2- wiping the top bits before using (this series)
> 
> (I'd say tolerating rather than wiping since get_user still uses the tag
> in the current series)
> 
> The current series does not allow any choice between 1 and 2, the
> default ABI basically becomes option 2.

What about testing tools that intentionally insert high bits for syscalls
and are _expecting_ them to fail? It seems the TBI series will break them?
In that case, do we need to opt into TBI as well?

> > 3- wiping the top bits for most things, but retaining them for MTE as
> >    needed (the future)
> 
> 2 and 3 are not entirely compatible as a tagged pointer may be checked
> against the memory colour by the hardware. So you can't have hwasan
> binary with MTE enabled.

Right: a process must be either MTE or TBI, not both.

> > I expect MTE to be the "default" in the future. Once a system's libc has
> > grown support for it, everything will be trying to use MTE. TBI will be
> > the special case (but TBI is effectively a prerequisite).
> 
> The kernel handling of tagged pointers is indeed a prerequisite. The ABI
> distinction between the above 2 and 3 needs to be solved.

Does that need solving now or when the MTE series appears? As there is
no reason to distinguish between "normal" and "TBI", that doesn't seem
to need solving at this point?

> > AFAICT, the only difference I see between 2 and 3 will be the tag handling
> > in usercopy (all other places will continue to ignore the top bits). Is
> > that accurate?
> 
> Yes, mostly (for the kernel). If MTE is enabled by default for a hwasan
> binary, it will SEGFAULT (either in user space or in kernel uaccess).
> How does the kernel choose between 2 and 3?

Right -- that was my question as well.

> > Is "1" a per-process state we want to keep? (I assume not, but rather it
> > is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
> 
> Possibly, though not necessarily per process. For testing or if
> something goes wrong during boot, a command line option with a static
> label would do. The AT_FLAGS bit needs to be checked by user space. My
> preference would be per-process.

I would agree.

> > To choose between "2" and "3", it seems we need a per-process flag to
> > opt into TBI (and out of MTE).
> 
> Or leave option 2 the default and get it to opt in to MTE.

Given that MTE has to "start" at some point in the binary lifetime, I'm
fine with opting into MTE. I do expect, though, this will feel redundant
in a couple years as everything will immediately opt-in. But, okay, this
is therefore an issue for the MTE series.

> The current plan is that a future binary issues a prctl(), after
> checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions
> are not in the current NOP space). I'd expect this to be done by the
> libc or dynamic loader under the assumption that the binaries it loads
> do _not_ use the top pointer byte for anything else. With hwasan
> compiled objects this gets more confusing (any ELF note to identify
> them?).

Okay, sounds fine.

> (there is also the risk of existing applications using TBI already but
> I'm not aware of any still using this feature other than hwasan)

Correct.


Alright, the tl;dr appears to be:
- you want more assurances that we can find __user stripping in the
  kernel more easily. (But this seems like a parallel problem.)
- we might need to opt in to TBI with a prctl()
- all other concerns are for the future MTE series (though it sounds
  like HWCAP_MTE and a prctl() solve those issues too).

Is this accurate? What do you see as the blockers for this series at
this point?

Catalin Marinas May 23, 2019, 4:57 p.m. UTC | #24

On Thu, May 23, 2019 at 11:42:57AM +0100, Dave P Martin wrote:
> On Wed, May 22, 2019 at 09:20:52PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
> > > If multiple people will care about this, perhaps we should try to
> > > annotate types more explicitly in SYSCALL_DEFINEx() and ABI data
> > > structures.
> > > 
> > > For example, we could have a couple of mutually exclusive modifiers
> > > 
> > > T __object *
> > > T __vaddr * (or U __vaddr)
> > > 
> > > In the first case the pointer points to an object (in the C sense)
> > > that the call may dereference but not use for any other purpose.
> > 
> > How would you use these two differently?
> > 
> > So far the kernel has worked that __user should tag any pointer that
> > is from userspace and then you can't do anything with it until you
> > transform it into a kernel something
> 
> Ultimately it would be good to disallow casting __object pointers execpt
> to compatible __object pointer types, and to make get_user etc. demand
> __object.
> 
> __vaddr pointers / addresses would be freely castable, but not to
> __object and so would not be dereferenceable even indirectly.

I think it gets too complicated and there are ambiguous cases that we
may not be able to distinguish. For example copy_from_user() may be used
to copy a user data structure into the kernel, hence __object would
work, while the same function may be used to copy opaque data to a file,
so __vaddr may be a better option (unless I misunderstood your
proposal).

We currently have T __user * and I think it's a good starting point. The
prior attempt [1] was shut down because it was just hiding the cast
using __force. We'd need to work through those cases again and rather
start changing the function prototypes to avoid unnecessary casting in
the callers (e.g. get_user_pages(void __user *) or come up with a new
type) while changing the explicit casting to a macro where it needs to
be obvious that we are converting a user pointer, potentially typed
(tagged), to an untyped address range. We may need a user_ptr_to_ulong()
macro or similar (it seems that we have a u64_to_user_ptr, wasn't aware
of it).

It may actually not be far from what you suggested but I'd keep the
current T __user * to denote possible dereference.

[1] https://lore.kernel.org/lkml/5d54526e5ff2e5ad63d0dfdd9ab17cf359afa4f2.1535629099.git.andreyknvl@google.com/

Catalin Marinas May 23, 2019, 5 p.m. UTC | #25

On Thu, May 23, 2019 at 08:44:12AM -0700, enh wrote:
> On Thu, May 23, 2019 at 7:45 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Wed, May 22, 2019 at 01:47:36PM -0700, Kees Cook wrote:
> > > For userspace, how would a future binary choose TBI over MTE? If it's
> > > a library issue, we can't use an ELF bit, since the choice may be
> > > "late" after ELF load (this implies the need for a prctl().) If it's
> > > binary-only ("built with HWKASan") then an ELF bit seems sufficient.
> > > And without the marking, I'd expect the kernel to enforce MTE when
> > > there are high bits.
> >
> > The current plan is that a future binary issues a prctl(), after
> > checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions
> > are not in the current NOP space). I'd expect this to be done by the
> > libc or dynamic loader under the assumption that the binaries it loads
> > do _not_ use the top pointer byte for anything else.
> 
> yeah, it sounds like to support hwasan and MTE, the dynamic linker
> will need to not use either itself.
> 
> > With hwasan compiled objects this gets more confusing (any ELF note
> > to identify them?).
> 
> no, at the moment code that wants to know checks for the presence of
> __hwasan_init. (and bionic doesn't actually look at any ELF notes
> right now.) but we can always add something if we need to.

It's a userspace decision to make. In the kernel, we are proposing that
bionic calls a prctl() to enable MTE explicitly. It could first check
for the presence of __hwasan_init.

Catalin Marinas May 23, 2019, 5:43 p.m. UTC | #26

On Thu, May 23, 2019 at 09:38:19AM -0700, Kees Cook wrote:
> On Thu, May 23, 2019 at 03:44:49PM +0100, Catalin Marinas wrote:
> > There is also the obvious requirement which I didn't mention: new user
> > space continues to run on new/subsequent kernel versions. That's one of
> > the points of contention for this series (ignoring MTE) with the
> > maintainers having to guarantee this without much effort. IOW, do the
> > 500K+ new lines in a subsequent kernel version break any user space out
> > there? I'm only talking about the relaxed TBI ABI. Are the usual LTP,
> > syskaller sufficient? Better static analysis would definitely help.
> 
> We can't have perfect coverage of people actively (or accidentally)
> working to trick static analyzers (and the compiler) into "forgetting"
> about a __user annotation. We can certainly improve analysis (I see
> the other sub-thread on that), but I'd like that work not to block
> this series.

I don't want to block this series either as it's a prerequisite for MTE
but at the moment I'm not confident we tracked down all the places where
things can break and what the future effort will be to analyse new
kernel changes.

We've made lots of progress since March last year (I think when the
first version was posted) by both identifying new places in the kernel
that need untagging and limiting the ranges that user space can tag
(e.g. an mmap() on a device should not allow tagged pointers). Can we
say we are done or more investigation is needed?

> What on this front would you be comfortable with? Given it's a new
> feature isn't it sufficient to have a CONFIG (and/or boot option)?

I'd rather avoid re-building kernels. A boot option would do, unless we
see value in a per-process (inherited) personality or prctl. The
per-process option has the slight advantage that I can boot my machine
normally while being able to run LTP with a relaxed ABI (and a new libc
which tags pointers). I admit I don't have a very strong argument on a
per-process opt-in.

Another option would be a sysctl control together with a cmdline option.

> > Or, if we ever want MTE to be turned on by default (i.e. tag checking),
> > even if everything is tagged with 0, we have to disallow TBI for user
> > and this includes hwasan. There were a small number of programs using
> > the TBI (I think some JavaScript compilers tried this). But now we are
> > bringing in the hwasan support and this can be a large user base. Shall
> > we add an ELF note for such binaries that use TBI/hwasan?
> 
> Just to be clear, you say "disallow TBI for user" -- you mean a
> particular process, yes? i.e. there is no architectural limitation that
> says once we're using MTE nothing can switch to TBI. i.e. a process is
> either doing MTE or TBI (or nothing, but that's the same as TBI).

I may have answered this below. The architecture does not allow TBI
(i.e. faults) if MTE is enabled for a process and address range.

> > > So there needs to be some way to let the kernel know which of three
> > > things it should be doing:
> > > 1- leaving userspace addresses as-is (present)
> > > 2- wiping the top bits before using (this series)
> > 
> > (I'd say tolerating rather than wiping since get_user still uses the tag
> > in the current series)
> > 
> > The current series does not allow any choice between 1 and 2, the
> > default ABI basically becomes option 2.
> 
> What about testing tools that intentionally insert high bits for syscalls
> and are _expecting_ them to fail? It seems the TBI series will break them?
> In that case, do we need to opt into TBI as well?

If there are such tools, then we may need a per-process control. It's
basically an ABI change.

> > > 3- wiping the top bits for most things, but retaining them for MTE as
> > >    needed (the future)
> > 
> > 2 and 3 are not entirely compatible as a tagged pointer may be checked
> > against the memory colour by the hardware. So you can't have hwasan
> > binary with MTE enabled.
> 
> Right: a process must be either MTE or TBI, not both.

Indeed.

> > > I expect MTE to be the "default" in the future. Once a system's libc has
> > > grown support for it, everything will be trying to use MTE. TBI will be
> > > the special case (but TBI is effectively a prerequisite).
> > 
> > The kernel handling of tagged pointers is indeed a prerequisite. The ABI
> > distinction between the above 2 and 3 needs to be solved.
> 
> Does that need solving now or when the MTE series appears? As there is
> no reason to distinguish between "normal" and "TBI", that doesn't seem
> to need solving at this point?

We don't need to solve 2 vs 3 as long as option 3 is an explicit opt-in
(prctl) by the user. Otherwise the kernel cannot guess whether the user
is using TBI or not (and if it does and still opts in to MTE, we can say
it's a user problem ;)).

> > > To choose between "2" and "3", it seems we need a per-process flag to
> > > opt into TBI (and out of MTE).
> > 
> > Or leave option 2 the default and get it to opt in to MTE.
> 
> Given that MTE has to "start" at some point in the binary lifetime, I'm
> fine with opting into MTE. I do expect, though, this will feel redundant
> in a couple years as everything will immediately opt-in. But, okay, this
> is therefore an issue for the MTE series.

Unless Google phases out hwasan soon ;), they may live in parallel for a
while, so a choice of TBI vs MTE.

> Alright, the tl;dr appears to be:
> - you want more assurances that we can find __user stripping in the
>   kernel more easily. (But this seems like a parallel problem.)

Yes, and that we found all (most) cases now. The reason I don't see it
as a parallel problem is that, as maintainer, I promise an ABI to user
and I'd rather stick to it. I don't want, for example, ncurses to stop
working because of some ioctl() rejecting tagged pointers.

If it's just the occasional one-off bug I'm fine to deal with it. But
no-one convinced me yet that this is the case.

As for the generic driver code (filesystems or other subsystems),
without some clear direction for developers, together with static
checking/sparse, on how user pointers are cast to longs (one example),
it would become my responsibility to identify and fix them up with any
kernel release. This series is not providing such guidance, just adding
untagged_addr() in some places that we think matter.

> - we might need to opt in to TBI with a prctl()

Yes, although still up for discussion.

> - all other concerns are for the future MTE series (though it sounds
>   like HWCAP_MTE and a prctl() solve those issues too).

Indeed.

> Is this accurate? What do you see as the blockers for this series at
> this point?

Well, see my comment on your first bullet point above ;).

And thanks for summarising it.

Khalid Aziz May 23, 2019, 5:51 p.m. UTC | #27

On 5/21/19 6:04 PM, Kees Cook wrote:
> As an aside: I think Sparc ADI support in Linux actually side-stepped
> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
> for kernel code.") I think this was a mistake we should not repeat for
> arm64 (we do seem to be at least in agreement about this, I think).
> 
> [1] https://lore.kernel.org/patchwork/patch/654481/

That is a very early version of the sparc ADI patch. Support for tagged
addresses in syscalls was added in later versions and is in the patch
that is in the kernel.

That part "Kernel does not enable ADI for kernel code." is correct. It
is a possible enhancement for future.

--
Khalid

Catalin Marinas May 23, 2019, 8:11 p.m. UTC | #28

Hi Khalid,

On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
> On 5/21/19 6:04 PM, Kees Cook wrote:
> > As an aside: I think Sparc ADI support in Linux actually side-stepped
> > this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
> > be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
> > for kernel code.") I think this was a mistake we should not repeat for
> > arm64 (we do seem to be at least in agreement about this, I think).
> > 
> > [1] https://lore.kernel.org/patchwork/patch/654481/
> 
> That is a very early version of the sparc ADI patch. Support for tagged
> addresses in syscalls was added in later versions and is in the patch
> that is in the kernel.

I tried to figure out but I'm not familiar with the sparc port. How did
you solve the tagged address going into various syscall implementations
in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it
ends up deeper in the core code?

Thanks.

Kees Cook May 23, 2019, 9:31 p.m. UTC | #29

On Thu, May 23, 2019 at 06:43:46PM +0100, Catalin Marinas wrote:
> On Thu, May 23, 2019 at 09:38:19AM -0700, Kees Cook wrote:
> > What on this front would you be comfortable with? Given it's a new
> > feature isn't it sufficient to have a CONFIG (and/or boot option)?
> 
> I'd rather avoid re-building kernels. A boot option would do, unless we
> see value in a per-process (inherited) personality or prctl. The

I think I've convinced myself that the normal<->TBI ABI control should
be a boot parameter. More below...

> > What about testing tools that intentionally insert high bits for syscalls
> > and are _expecting_ them to fail? It seems the TBI series will break them?
> > In that case, do we need to opt into TBI as well?
> 
> If there are such tools, then we may need a per-process control. It's
> basically an ABI change.

syzkaller already attempts to randomly inject non-canonical and
0xFFFF....FFFF addresses for user pointers in syscalls in an effort to
find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was
able to write directly to kernel memory[1].

It seems that using TBI by default and not allowing a switch back to
"normal" ABI without a reboot actually means that userspace cannot inject
kernel pointers into syscalls any more, since they'll get universally
stripped now. Is my understanding correct, here? i.e. exploiting
CVE-2017-5123 would be impossible under TBI?

If so, then I think we should commit to the TBI ABI and have a boot
flag to disable it, but NOT have a process flag, as that would allow
attackers to bypass the masking. The only flag should be "TBI or MTE".

If so, can I get top byte masking for other architectures too? Like,
just to strip high bits off userspace addresses? ;)

(Oh, in looking I see this is implemented with sign-extension... why
not just a mask? So it'll either be valid userspace address or forced
into the non-canonical range?)

[1] https://salls.github.io/Linux-Kernel-CVE-2017-5123/

> > Alright, the tl;dr appears to be:
> > - you want more assurances that we can find __user stripping in the
> >   kernel more easily. (But this seems like a parallel problem.)
> 
> Yes, and that we found all (most) cases now. The reason I don't see it
> as a parallel problem is that, as maintainer, I promise an ABI to user
> and I'd rather stick to it. I don't want, for example, ncurses to stop
> working because of some ioctl() rejecting tagged pointers.

But this is what I don't understand: it would need to be ncurses _using
TBI_, that would stop working (having started to work before, but then
regress due to a newly added one-off bug). Regular ncurses will be fine
because it's not using TBI. So The Golden Rule isn't violated, and by
definition, it's a specific regression caused by some bug (since TBI
would have had to have worked _before_ in the situation to be considered
a regression now). Which describes the normal path for kernel
development... add feature, find corner cases where it doesn't work,
fix them, encounter new regressions, fix those, repeat forever.

> If it's just the occasional one-off bug I'm fine to deal with it. But
> no-one convinced me yet that this is the case.

You believe there still to be some systemic cases that haven't been
found yet? And even if so -- isn't it better to work on that
incrementally?

> As for the generic driver code (filesystems or other subsystems),
> without some clear direction for developers, together with static
> checking/sparse, on how user pointers are cast to longs (one example),
> it would become my responsibility to identify and fix them up with any
> kernel release. This series is not providing such guidance, just adding
> untagged_addr() in some places that we think matter.

What about adding a nice bit of .rst documentation that describes the
situation and shows how to use untagged_addr(). This is the kind of
kernel-wide change that "everyone" needs to know about, and shouldn't
be the arch maintainer's sole responsibility to fix.

> > - we might need to opt in to TBI with a prctl()
> 
> Yes, although still up for discussion.

I think I've talked myself out of it. I say boot param only! :)

So what do you say to these next steps:

- change untagged_addr() to use a static branch that is controlled with
  a boot parameter.
- add, say, Documentation/core-api/user-addresses.rst to describe
  proper care and handling of user space pointers with untagged_addr(),
  with examples based on all the cases seen so far in this series.
- continue work to improve static analysis.

Thanks for wading through this with me! :)

Khalid Aziz May 23, 2019, 9:42 p.m. UTC | #30

On 5/23/19 2:11 PM, Catalin Marinas wrote:
> Hi Khalid,
> 
> On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
>> On 5/21/19 6:04 PM, Kees Cook wrote:
>>> As an aside: I think Sparc ADI support in Linux actually side-stepped
>>> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
>>> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
>>> for kernel code.") I think this was a mistake we should not repeat for
>>> arm64 (we do seem to be at least in agreement about this, I think).
>>>
>>> [1] https://lore.kernel.org/patchwork/patch/654481/
>>
>> That is a very early version of the sparc ADI patch. Support for tagged
>> addresses in syscalls was added in later versions and is in the patch
>> that is in the kernel.
> 
> I tried to figure out but I'm not familiar with the sparc port. How did
> you solve the tagged address going into various syscall implementations
> in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it
> ends up deeper in the core code?

Tag is not removed from the user addresses. Kernel passes tagged
addresses to copy_from_user and copy_to_user. MMU checks the tag
embedded in the address when kernel accesses userspace addresses. This
maintains the ADI integrity even when userspace attempts to access any
userspace addresses through system calls.

On sparc, access_ok() is defined as:

#define access_ok(addr, size) __access_ok((unsigned long)(addr), size)
#define __access_ok(addr, size) (__user_ok((addr) & get_fs().seg, (size)))
#define __user_ok(addr, size) ({ (void)(size); (addr) < STACK_TOP; })

STACK_TOP for M7 processor (which is the first sparc processor to
support ADI) is 0xfff8000000000000UL. Tagged addresses pass the
access_ok() check fine. Any tag mismatches that happen during kernel
access to userspace addresses are handled by do_mcd_err().

--
Khalid

Khalid Aziz May 23, 2019, 9:49 p.m. UTC | #31

On 5/23/19 2:11 PM, Catalin Marinas wrote:
> Hi Khalid,
> 
> On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
>> On 5/21/19 6:04 PM, Kees Cook wrote:
>>> As an aside: I think Sparc ADI support in Linux actually side-stepped
>>> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
>>> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
>>> for kernel code.") I think this was a mistake we should not repeat for
>>> arm64 (we do seem to be at least in agreement about this, I think).
>>>
>>> [1] https://lore.kernel.org/patchwork/patch/654481/
>>
>> That is a very early version of the sparc ADI patch. Support for tagged
>> addresses in syscalls was added in later versions and is in the patch
>> that is in the kernel.
> 
> I tried to figure out but I'm not familiar with the sparc port. How did
> you solve the tagged address going into various syscall implementations
> in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it
> ends up deeper in the core code?
> 

Another spot I should point out in ADI patch - Tags are not stored in
VMAs and IOMMU does not support ADI tags on M7. ADI tags are stripped
before userspace addresses are passed to IOMMU in the following snippet
from the patch:

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 5335ba3c850e..357b6047653a 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -201,6 +202,24 @@ int __get_user_pages_fast(unsigned long start, int
nr_pages
, int write,
        pgd_t *pgdp;
        int nr = 0;

+#ifdef CONFIG_SPARC64
+       if (adi_capable()) {
+               long addr = start;
+
+               /* If userspace has passed a versioned address, kernel
+                * will not find it in the VMAs since it does not store
+                * the version tags in the list of VMAs. Storing version
+                * tags in list of VMAs is impractical since they can be
+                * changed any time from userspace without dropping into
+                * kernel. Any address search in VMAs will be done with
+                * non-versioned addresses. Ensure the ADI version bits
+                * are dropped here by sign extending the last bit before
+                * ADI bits. IOMMU does not implement version tags.
+                */
+               addr = (addr << (long)adi_nbits()) >> (long)adi_nbits();
+               start = addr;
+       }
+#endif
        start &= PAGE_MASK;
        addr = start;
        len = (unsigned long) nr_pages << PAGE_SHIFT;


--
Khalid

Catalin Marinas May 24, 2019, 10:11 a.m. UTC | #32

On Thu, May 23, 2019 at 03:49:05PM -0600, Khalid Aziz wrote:
> On 5/23/19 2:11 PM, Catalin Marinas wrote:
> > On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
> >> On 5/21/19 6:04 PM, Kees Cook wrote:
> >>> As an aside: I think Sparc ADI support in Linux actually side-stepped
> >>> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
> >>> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
> >>> for kernel code.") I think this was a mistake we should not repeat for
> >>> arm64 (we do seem to be at least in agreement about this, I think).
> >>>
> >>> [1] https://lore.kernel.org/patchwork/patch/654481/
> >>
> >> That is a very early version of the sparc ADI patch. Support for tagged
> >> addresses in syscalls was added in later versions and is in the patch
> >> that is in the kernel.
> > 
> > I tried to figure out but I'm not familiar with the sparc port. How did
> > you solve the tagged address going into various syscall implementations
> > in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it
> > ends up deeper in the core code?
> 
> Another spot I should point out in ADI patch - Tags are not stored in
> VMAs and IOMMU does not support ADI tags on M7. ADI tags are stripped
> before userspace addresses are passed to IOMMU in the following snippet
> from the patch:
> 
> diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
> index 5335ba3c850e..357b6047653a 100644
> --- a/arch/sparc/mm/gup.c
> +++ b/arch/sparc/mm/gup.c
> @@ -201,6 +202,24 @@ int __get_user_pages_fast(unsigned long start, int
> nr_pages
> , int write,
>         pgd_t *pgdp;
>         int nr = 0;
> 
> +#ifdef CONFIG_SPARC64
> +       if (adi_capable()) {
> +               long addr = start;
> +
> +               /* If userspace has passed a versioned address, kernel
> +                * will not find it in the VMAs since it does not store
> +                * the version tags in the list of VMAs. Storing version
> +                * tags in list of VMAs is impractical since they can be
> +                * changed any time from userspace without dropping into
> +                * kernel. Any address search in VMAs will be done with
> +                * non-versioned addresses. Ensure the ADI version bits
> +                * are dropped here by sign extending the last bit before
> +                * ADI bits. IOMMU does not implement version tags.
> +                */
> +               addr = (addr << (long)adi_nbits()) >> (long)adi_nbits();
> +               start = addr;
> +       }
> +#endif
>         start &= PAGE_MASK;
>         addr = start;
>         len = (unsigned long) nr_pages << PAGE_SHIFT;

Thanks Khalid. I missed that sparc does not enable HAVE_GENERIC_GUP, so
you fix this case here. If we add the generic untagged_addr() macro in
the generic code, I think sparc can start making use of it rather than
open-coding the shifts.

There are a few other other places where tags can leak and the core code
would get confused (for example, madvise()). I presume your user space
doesn't exercise them. On arm64 we plan to just allow the C library to
tag any new memory allocation, so those core code paths would need to be
covered.

And similarly, devices, IOMMU, any DMA would ignore tags.

Catalin Marinas May 24, 2019, 11:20 a.m. UTC | #33

On Thu, May 23, 2019 at 02:31:16PM -0700, Kees Cook wrote:
> On Thu, May 23, 2019 at 06:43:46PM +0100, Catalin Marinas wrote:
> > On Thu, May 23, 2019 at 09:38:19AM -0700, Kees Cook wrote:
> > > What about testing tools that intentionally insert high bits for syscalls
> > > and are _expecting_ them to fail? It seems the TBI series will break them?
> > > In that case, do we need to opt into TBI as well?
> > 
> > If there are such tools, then we may need a per-process control. It's
> > basically an ABI change.
> 
> syzkaller already attempts to randomly inject non-canonical and
> 0xFFFF....FFFF addresses for user pointers in syscalls in an effort to
> find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was
> able to write directly to kernel memory[1].
> 
> It seems that using TBI by default and not allowing a switch back to
> "normal" ABI without a reboot actually means that userspace cannot inject
> kernel pointers into syscalls any more, since they'll get universally
> stripped now. Is my understanding correct, here? i.e. exploiting
> CVE-2017-5123 would be impossible under TBI?

Unless the kernel is also using TBI (khwasan), in which case masking out
the top byte wouldn't help. Anyway, as per this discussion, we want the
tagged pointer to remain intact all the way to put_user(), so nothing
gets masked out. I don't think this would have helped with the waitid()
bug.

> If so, then I think we should commit to the TBI ABI and have a boot
> flag to disable it, but NOT have a process flag, as that would allow
> attackers to bypass the masking. The only flag should be "TBI or MTE".
> 
> If so, can I get top byte masking for other architectures too? Like,
> just to strip high bits off userspace addresses? ;)

But you didn't like my option 2 shim proposal which strips the tag on
kernel entry because it lowers the value of MTE ;).

> (Oh, in looking I see this is implemented with sign-extension... why
> not just a mask? So it'll either be valid userspace address or forced
> into the non-canonical range?)

The TTBR0/1 selection on memory accesses is done based on bit 63 if TBI
is disabled and bit 55 when enabled. Sign-extension allows us to use the
same macro for both user and kernel tagged pointers. With MTE tag 0
would be match-all for TTBR0 and 0xff for TTBR1 (so that we don't modify
the virtual address space of the kernel; I need to check the latest spec
to be sure). Note that the VA space for both user and kernel is limited
to 52-bit architecturally so, on access, bits 55..52 must be the same, 0
or 1, otherwise you get a fault.

Since the syzkaller tests would also need to set bits 55-52 (actually 48
for kernel addresses, we haven't merged the 52-bit kernel VA patches
yet) to hit a valid kernel address, I don't think ignoring the top byte
makes much difference to the expected failure scenario.

> > > Alright, the tl;dr appears to be:
> > > - you want more assurances that we can find __user stripping in the
> > >   kernel more easily. (But this seems like a parallel problem.)
> > 
> > Yes, and that we found all (most) cases now. The reason I don't see it
> > as a parallel problem is that, as maintainer, I promise an ABI to user
> > and I'd rather stick to it. I don't want, for example, ncurses to stop
> > working because of some ioctl() rejecting tagged pointers.
> 
> But this is what I don't understand: it would need to be ncurses _using
> TBI_, that would stop working (having started to work before, but then
> regress due to a newly added one-off bug). Regular ncurses will be fine
> because it's not using TBI. So The Golden Rule isn't violated,

Once we introduced TBI and the libc starts tagging heap allocations,
this becomes the new "regular" user space behaviour (i.e. using TBI). So
a new bug would break the golden rule. It could also be an old bug that
went unnoticed (i.e. you changed the graphics card and its driver gets
confused by tagged pointers coming from user-space).

> and by definition, it's a specific regression caused by some bug
> (since TBI would have had to have worked _before_ in the situation to
> be considered a regression now). Which describes the normal path for
> kernel development... add feature, find corner cases where it doesn't
> work, fix them, encounter new regressions, fix those, repeat forever.
> 
> > If it's just the occasional one-off bug I'm fine to deal with it. But
> > no-one convinced me yet that this is the case.
> 
> You believe there still to be some systemic cases that haven't been
> found yet? And even if so -- isn't it better to work on that
> incrementally?

I want some way to systematically identify potential issues (sparse?).
Since problems are most likely in drivers, I don't have all devices to
check and not all users have the knowledge to track down why something
failed.

I think we can do this incrementally as long the TBI ABI is not the
default. Even better if we made it per process.

> > As for the generic driver code (filesystems or other subsystems),
> > without some clear direction for developers, together with static
> > checking/sparse, on how user pointers are cast to longs (one example),
> > it would become my responsibility to identify and fix them up with any
> > kernel release. This series is not providing such guidance, just adding
> > untagged_addr() in some places that we think matter.
> 
> What about adding a nice bit of .rst documentation that describes the
> situation and shows how to use untagged_addr(). This is the kind of
> kernel-wide change that "everyone" needs to know about, and shouldn't
> be the arch maintainer's sole responsibility to fix.

This works (if people read it) but we also need to be more prescriptive
in how casting is done and how we differentiate between a pointer for
dereference (T __user *) and address space management (usually unsigned
long). On top of that, we'd get sparse to check for such conversions and
maybe even checkpatch for some low-hanging fruit.

> > > - we might need to opt in to TBI with a prctl()
> > 
> > Yes, although still up for discussion.
> 
> I think I've talked myself out of it. I say boot param only! :)

I hope I talked you in again ;). I don't see TBI as improving kernel
security.

> So what do you say to these next steps:
> 
> - change untagged_addr() to use a static branch that is controlled with
>   a boot parameter.

access_ok() as well.

> - add, say, Documentation/core-api/user-addresses.rst to describe
>   proper care and handling of user space pointers with untagged_addr(),
>   with examples based on all the cases seen so far in this series.

We have u64_to_user_ptr(). What about the reverse? And maybe changing
get_user_pages() to take void __user *.

> - continue work to improve static analysis.

Andrew Murray in the ARM kernel team started revisiting the old sparse
threads, let's see how it goes.

Dave Martin May 24, 2019, 2:23 p.m. UTC | #34

On Thu, May 23, 2019 at 05:57:09PM +0100, Catalin Marinas wrote:
> On Thu, May 23, 2019 at 11:42:57AM +0100, Dave P Martin wrote:
> > On Wed, May 22, 2019 at 09:20:52PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
> > > > If multiple people will care about this, perhaps we should try to
> > > > annotate types more explicitly in SYSCALL_DEFINEx() and ABI data
> > > > structures.
> > > > 
> > > > For example, we could have a couple of mutually exclusive modifiers
> > > > 
> > > > T __object *
> > > > T __vaddr * (or U __vaddr)
> > > > 
> > > > In the first case the pointer points to an object (in the C sense)
> > > > that the call may dereference but not use for any other purpose.
> > > 
> > > How would you use these two differently?
> > > 
> > > So far the kernel has worked that __user should tag any pointer that
> > > is from userspace and then you can't do anything with it until you
> > > transform it into a kernel something
> > 
> > Ultimately it would be good to disallow casting __object pointers execpt
> > to compatible __object pointer types, and to make get_user etc. demand
> > __object.
> > 
> > __vaddr pointers / addresses would be freely castable, but not to
> > __object and so would not be dereferenceable even indirectly.
> 
> I think it gets too complicated and there are ambiguous cases that we
> may not be able to distinguish. For example copy_from_user() may be used
> to copy a user data structure into the kernel, hence __object would
> work, while the same function may be used to copy opaque data to a file,
> so __vaddr may be a better option (unless I misunderstood your
> proposal).

Can you illustrate?  I'm not sure of your point here.

> We currently have T __user * and I think it's a good starting point. The
> prior attempt [1] was shut down because it was just hiding the cast
> using __force. We'd need to work through those cases again and rather
> start changing the function prototypes to avoid unnecessary casting in
> the callers (e.g. get_user_pages(void __user *) or come up with a new
> type) while changing the explicit casting to a macro where it needs to
> be obvious that we are converting a user pointer, potentially typed
> (tagged), to an untyped address range. We may need a user_ptr_to_ulong()
> macro or similar (it seems that we have a u64_to_user_ptr, wasn't aware
> of it).
> 
> It may actually not be far from what you suggested but I'd keep the
> current T __user * to denote possible dereference.

This may not have been clear, but __object and __vaddr would be
orthogonal to __user.  Since __object and __vaddr strictly constrain
what can be done with an lvalue, they could be cast on, but not be
cast off without __force.

Syscall arguments and pointer in ioctl structs etc. would typically
be annotated as __object __user * or __vaddr __user *.  Plain old
__user * would work as before, but would be more permissive and give
static analysers less information to go on.

Conversion or use or __object or __vaddr pointers would require specific
APIs in the kernel, so that we can be clear about the semantics.

Doing things this way would allow migration to annotation of most or all
ABI pointers with __object or __vaddr over time, but we wouldn't have to
do it all in one go.  Problem cases (which won't be the majority) could
continue to be plain __user.

This does not magically solve the challenges of MTE, but might provide
tools that are useful to help avoid bitrot and regressions over time.

I agree though that there might be a fair number of of cases that don't
conveniently fall under __object or __vaddr semantics.  It's hard to
know without trying it.

_Most_ syscall arguments seem to be fairly obviously one or another
though, and this approach has some possibility of scaling to ioctls
and other odd interfaces.

Cheers
---Dave

Khalid Aziz May 24, 2019, 2:25 p.m. UTC | #35

On 5/24/19 4:11 AM, Catalin Marinas wrote:
> On Thu, May 23, 2019 at 03:49:05PM -0600, Khalid Aziz wrote:
>> On 5/23/19 2:11 PM, Catalin Marinas wrote:
>>> On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
>>>> On 5/21/19 6:04 PM, Kees Cook wrote:
>>>>> As an aside: I think Sparc ADI support in Linux actually side-stepped
>>>>> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
>>>>> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
>>>>> for kernel code.") I think this was a mistake we should not repeat for
>>>>> arm64 (we do seem to be at least in agreement about this, I think).
>>>>>
>>>>> [1] https://lore.kernel.org/patchwork/patch/654481/
>>>>
>>>> That is a very early version of the sparc ADI patch. Support for tagged
>>>> addresses in syscalls was added in later versions and is in the patch
>>>> that is in the kernel.
>>>
>>> I tried to figure out but I'm not familiar with the sparc port. How did
>>> you solve the tagged address going into various syscall implementations
>>> in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it
>>> ends up deeper in the core code?
>>
>> Another spot I should point out in ADI patch - Tags are not stored in
>> VMAs and IOMMU does not support ADI tags on M7. ADI tags are stripped
>> before userspace addresses are passed to IOMMU in the following snippet
>> from the patch:
>>
>> diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
>> index 5335ba3c850e..357b6047653a 100644
>> --- a/arch/sparc/mm/gup.c
>> +++ b/arch/sparc/mm/gup.c
>> @@ -201,6 +202,24 @@ int __get_user_pages_fast(unsigned long start, int
>> nr_pages
>> , int write,
>>         pgd_t *pgdp;
>>         int nr = 0;
>>
>> +#ifdef CONFIG_SPARC64
>> +       if (adi_capable()) {
>> +               long addr = start;
>> +
>> +               /* If userspace has passed a versioned address, kernel
>> +                * will not find it in the VMAs since it does not store
>> +                * the version tags in the list of VMAs. Storing version
>> +                * tags in list of VMAs is impractical since they can be
>> +                * changed any time from userspace without dropping into
>> +                * kernel. Any address search in VMAs will be done with
>> +                * non-versioned addresses. Ensure the ADI version bits
>> +                * are dropped here by sign extending the last bit before
>> +                * ADI bits. IOMMU does not implement version tags.
>> +                */
>> +               addr = (addr << (long)adi_nbits()) >> (long)adi_nbits();
>> +               start = addr;
>> +       }
>> +#endif
>>         start &= PAGE_MASK;
>>         addr = start;
>>         len = (unsigned long) nr_pages << PAGE_SHIFT;
> 
> Thanks Khalid. I missed that sparc does not enable HAVE_GENERIC_GUP, so
> you fix this case here. If we add the generic untagged_addr() macro in
> the generic code, I think sparc can start making use of it rather than
> open-coding the shifts.

Hi Catalin,

Yes, that will be good. Right now addresses are untagged in sparc code
in only two spots but that will expand as we expand use of tags.
Scalabale solution is definitely better.

> 
> There are a few other other places where tags can leak and the core code
> would get confused (for example, madvise()). I presume your user space
> doesn't exercise them. On arm64 we plan to just allow the C library to
> tag any new memory allocation, so those core code paths would need to be
> covered.
> 
> And similarly, devices, IOMMU, any DMA would ignore tags.
> 

Right. You are doing lot more with tags than sparc code intended to do.
I had looked into implementing just malloc(), mmap() and possibly
shmat() in library that automatically tags pointers. Expanding tags to
any pointers in C library will require covering lot more paths in kernel.

--
Khalid

Andrey Konovalov May 28, 2019, 2:14 p.m. UTC | #36

Thanks for a lot of valuable input! I've read through all the replies
and got somewhat lost. What are the changes I need to do to this
series?

1. Should I move untagging for memory syscalls back to the generic
code so other arches would make use of it as well, or should I keep
the arm64 specific memory syscalls wrappers and address the comments
on that patch?

2. Should I make untagging opt-in and controlled by a command line argument?

3. Should I "add Documentation/core-api/user-addresses.rst to describe
proper care and handling of user space pointers with untagged_addr(),
with examples based on all the cases seen so far in this series"?
Which examples specifically should it cover?

Is there something else?

Catalin Marinas May 28, 2019, 5:02 p.m. UTC | #37

On Thu, May 23, 2019 at 02:31:16PM -0700, Kees Cook wrote:
> syzkaller already attempts to randomly inject non-canonical and
> 0xFFFF....FFFF addresses for user pointers in syscalls in an effort to
> find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was
> able to write directly to kernel memory[1].
> 
> It seems that using TBI by default and not allowing a switch back to
> "normal" ABI without a reboot actually means that userspace cannot inject
> kernel pointers into syscalls any more, since they'll get universally
> stripped now. Is my understanding correct, here? i.e. exploiting
> CVE-2017-5123 would be impossible under TBI?
> 
> If so, then I think we should commit to the TBI ABI and have a boot
> flag to disable it, but NOT have a process flag, as that would allow
> attackers to bypass the masking. The only flag should be "TBI or MTE".
> 
> If so, can I get top byte masking for other architectures too? Like,
> just to strip high bits off userspace addresses? ;)

Just for fun, hack/attempt at your idea which should not interfere with
TBI. Only briefly tested on arm64 (and the s390 __TYPE_IS_PTR macro is
pretty weird ;)):

--------------------------8<---------------------------------
diff --git a/arch/s390/include/asm/compat.h b/arch/s390/include/asm/compat.h
index 63b46e30b2c3..338455a74eff 100644
--- a/arch/s390/include/asm/compat.h
+++ b/arch/s390/include/asm/compat.h
@@ -11,9 +11,6 @@
 
 #include <asm-generic/compat.h>
 
-#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p( \
-				typeof(0?(__force t)0:0ULL), u64))
-
 #define __SC_DELOUSE(t,v) ({ \
 	BUILD_BUG_ON(sizeof(t) > 4 && !__TYPE_IS_PTR(t)); \
 	(__force t)(__TYPE_IS_PTR(t) ? ((v) & 0x7fffffff) : (v)); \
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..b1b9fe8502da 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -119,8 +119,15 @@ struct io_uring_params;
 #define __TYPE_IS_L(t)	(__TYPE_AS(t, 0L))
 #define __TYPE_IS_UL(t)	(__TYPE_AS(t, 0UL))
 #define __TYPE_IS_LL(t) (__TYPE_AS(t, 0LL) || __TYPE_AS(t, 0ULL))
+#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p(typeof(0 ? (__force t)0 : 0ULL), u64))
 #define __SC_LONG(t, a) __typeof(__builtin_choose_expr(__TYPE_IS_LL(t), 0LL, 0L)) a
+#ifdef CONFIG_64BIT
+#define __SC_CAST(t, a)	(__TYPE_IS_PTR(t) \
+				? (__force t) ((__u64)a & ~(1UL << 55)) \
+				: (__force t) a)
+#else
 #define __SC_CAST(t, a)	(__force t) a
+#endif
 #define __SC_ARGS(t, a)	a
 #define __SC_TEST(t, a) (void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(t) && sizeof(t) > sizeof(long))

Christoph Hellwig May 29, 2019, 6:11 a.m. UTC | #38

On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
> Thanks for a lot of valuable input! I've read through all the replies
> and got somewhat lost. What are the changes I need to do to this
> series?
> 
> 1. Should I move untagging for memory syscalls back to the generic
> code so other arches would make use of it as well, or should I keep
> the arm64 specific memory syscalls wrappers and address the comments
> on that patch?

It absolutely needs to move to common code.  Having arch code leads
to pointless (often unintentional) semantic difference between
architectures, and lots of boilerplate code.

Btw, can anyone of the arm crowd or Khalid comment on the linux-mm
thread on generic gup where I'm dealing with the pre-existing ADI
case of pointer untagging?

Catalin Marinas May 29, 2019, 12:12 p.m. UTC | #39

On Tue, May 28, 2019 at 11:11:26PM -0700, Christoph Hellwig wrote:
> On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
> > Thanks for a lot of valuable input! I've read through all the replies
> > and got somewhat lost. What are the changes I need to do to this
> > series?
> > 
> > 1. Should I move untagging for memory syscalls back to the generic
> > code so other arches would make use of it as well, or should I keep
> > the arm64 specific memory syscalls wrappers and address the comments
> > on that patch?
> 
> It absolutely needs to move to common code.  Having arch code leads
> to pointless (often unintentional) semantic difference between
> architectures, and lots of boilerplate code.

That's fine by me as long as we agree on the semantics (which shouldn't
be hard; Khalid already following up). We should probably also move the
proposed ABI document [1] into a common place (or part of since we'll
have arm64-specifics like prctl() calls to explicitly opt in to memory
tagging).

[1] https://lore.kernel.org/lkml/20190318163533.26838-1-vincenzo.frascino@arm.com/T/#u

Catalin Marinas May 30, 2019, 5:15 p.m. UTC | #40

On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
> Thanks for a lot of valuable input! I've read through all the replies
> and got somewhat lost. What are the changes I need to do to this
> series?
> 
> 1. Should I move untagging for memory syscalls back to the generic
> code so other arches would make use of it as well, or should I keep
> the arm64 specific memory syscalls wrappers and address the comments
> on that patch?

Keep them generic again but make sure we get agreement with Khalid on
the actual ABI implications for sparc.

> 2. Should I make untagging opt-in and controlled by a command line argument?

Opt-in, yes, but per task rather than kernel command line option.
prctl() is a possibility of opting in.

> 3. Should I "add Documentation/core-api/user-addresses.rst to describe
> proper care and handling of user space pointers with untagged_addr(),
> with examples based on all the cases seen so far in this series"?
> Which examples specifically should it cover?

I think we can leave 3 for now as not too urgent. What I'd like is for
Vincenzo's TBI user ABI document to go into a more common place since we
can expand it to cover both sparc and arm64. We'd need an arm64-specific
doc as well for things like prctl() and later MTE that sparc may support
differently.

Andrey Konovalov May 31, 2019, 2:29 p.m. UTC | #41

On Thu, May 30, 2019 at 7:15 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
> > Thanks for a lot of valuable input! I've read through all the replies
> > and got somewhat lost. What are the changes I need to do to this
> > series?
> >
> > 1. Should I move untagging for memory syscalls back to the generic
> > code so other arches would make use of it as well, or should I keep
> > the arm64 specific memory syscalls wrappers and address the comments
> > on that patch?
>
> Keep them generic again but make sure we get agreement with Khalid on
> the actual ABI implications for sparc.

OK, will do. I find it hard to understand what the ABI implications
are. I'll post the next version without untagging in brk, mmap,
munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat
and shmdt.

>
> > 2. Should I make untagging opt-in and controlled by a command line argument?
>
> Opt-in, yes, but per task rather than kernel command line option.
> prctl() is a possibility of opting in.

OK. Should I store a flag somewhere in task_struct? Should it be
inheritable on clone?

>
> > 3. Should I "add Documentation/core-api/user-addresses.rst to describe
> > proper care and handling of user space pointers with untagged_addr(),
> > with examples based on all the cases seen so far in this series"?
> > Which examples specifically should it cover?
>
> I think we can leave 3 for now as not too urgent. What I'd like is for
> Vincenzo's TBI user ABI document to go into a more common place since we
> can expand it to cover both sparc and arm64. We'd need an arm64-specific
> doc as well for things like prctl() and later MTE that sparc may support
> differently.

OK.

>
> --
> Catalin

Catalin Marinas May 31, 2019, 4:19 p.m. UTC | #42

On Fri, May 31, 2019 at 04:29:10PM +0200, Andrey Konovalov wrote:
> On Thu, May 30, 2019 at 7:15 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
> > > Thanks for a lot of valuable input! I've read through all the replies
> > > and got somewhat lost. What are the changes I need to do to this
> > > series?
> > >
> > > 1. Should I move untagging for memory syscalls back to the generic
> > > code so other arches would make use of it as well, or should I keep
> > > the arm64 specific memory syscalls wrappers and address the comments
> > > on that patch?
> >
> > Keep them generic again but make sure we get agreement with Khalid on
> > the actual ABI implications for sparc.
> 
> OK, will do. I find it hard to understand what the ABI implications
> are. I'll post the next version without untagging in brk, mmap,
> munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat
> and shmdt.

It's more about not relaxing the ABI to accept non-zero top-byte unless
we have a use-case for it. For mmap() etc., I don't think that's needed
but if you think otherwise, please raise it.

> > > 2. Should I make untagging opt-in and controlled by a command line argument?
> >
> > Opt-in, yes, but per task rather than kernel command line option.
> > prctl() is a possibility of opting in.
> 
> OK. Should I store a flag somewhere in task_struct? Should it be
> inheritable on clone?

A TIF flag would do but I'd say leave it out for now (default opted in)
until we figure out the best way to do this (can be a patch on top of
this series).

Thanks.

Andrey Konovalov May 31, 2019, 4:24 p.m. UTC | #43

On Fri, May 31, 2019 at 6:20 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Fri, May 31, 2019 at 04:29:10PM +0200, Andrey Konovalov wrote:
> > On Thu, May 30, 2019 at 7:15 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
> > > > Thanks for a lot of valuable input! I've read through all the replies
> > > > and got somewhat lost. What are the changes I need to do to this
> > > > series?
> > > >
> > > > 1. Should I move untagging for memory syscalls back to the generic
> > > > code so other arches would make use of it as well, or should I keep
> > > > the arm64 specific memory syscalls wrappers and address the comments
> > > > on that patch?
> > >
> > > Keep them generic again but make sure we get agreement with Khalid on
> > > the actual ABI implications for sparc.
> >
> > OK, will do. I find it hard to understand what the ABI implications
> > are. I'll post the next version without untagging in brk, mmap,
> > munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat
> > and shmdt.
>
> It's more about not relaxing the ABI to accept non-zero top-byte unless
> we have a use-case for it. For mmap() etc., I don't think that's needed
> but if you think otherwise, please raise it.
>
> > > > 2. Should I make untagging opt-in and controlled by a command line argument?
> > >
> > > Opt-in, yes, but per task rather than kernel command line option.
> > > prctl() is a possibility of opting in.
> >
> > OK. Should I store a flag somewhere in task_struct? Should it be
> > inheritable on clone?
>
> A TIF flag would do but I'd say leave it out for now (default opted in)
> until we figure out the best way to do this (can be a patch on top of
> this series).

You mean leave the whole opt-in/prctl part out? So the only change
would be to move untagging for memory syscalls into generic code?

>
> Thanks.
>
> --
> Catalin

Catalin Marinas May 31, 2019, 4:46 p.m. UTC | #44

On Fri, May 31, 2019 at 06:24:06PM +0200, Andrey Konovalov wrote:
> On Fri, May 31, 2019 at 6:20 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Fri, May 31, 2019 at 04:29:10PM +0200, Andrey Konovalov wrote:
> > > On Thu, May 30, 2019 at 7:15 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
> > > > > Thanks for a lot of valuable input! I've read through all the replies
> > > > > and got somewhat lost. What are the changes I need to do to this
> > > > > series?
> > > > >
> > > > > 1. Should I move untagging for memory syscalls back to the generic
> > > > > code so other arches would make use of it as well, or should I keep
> > > > > the arm64 specific memory syscalls wrappers and address the comments
> > > > > on that patch?
> > > >
> > > > Keep them generic again but make sure we get agreement with Khalid on
> > > > the actual ABI implications for sparc.
> > >
> > > OK, will do. I find it hard to understand what the ABI implications
> > > are. I'll post the next version without untagging in brk, mmap,
> > > munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat
> > > and shmdt.
> >
> > It's more about not relaxing the ABI to accept non-zero top-byte unless
> > we have a use-case for it. For mmap() etc., I don't think that's needed
> > but if you think otherwise, please raise it.
> >
> > > > > 2. Should I make untagging opt-in and controlled by a command line argument?
> > > >
> > > > Opt-in, yes, but per task rather than kernel command line option.
> > > > prctl() is a possibility of opting in.
> > >
> > > OK. Should I store a flag somewhere in task_struct? Should it be
> > > inheritable on clone?
> >
> > A TIF flag would do but I'd say leave it out for now (default opted in)
> > until we figure out the best way to do this (can be a patch on top of
> > this series).
> 
> You mean leave the whole opt-in/prctl part out? So the only change
> would be to move untagging for memory syscalls into generic code?

Yes (or just wait until next week to see if the discussion settles
down).

Kees Cook June 2, 2019, 5:06 a.m. UTC | #45

On Tue, May 28, 2019 at 06:02:45PM +0100, Catalin Marinas wrote:
> On Thu, May 23, 2019 at 02:31:16PM -0700, Kees Cook wrote:
> > syzkaller already attempts to randomly inject non-canonical and
> > 0xFFFF....FFFF addresses for user pointers in syscalls in an effort to
> > find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was
> > able to write directly to kernel memory[1].
> > 
> > It seems that using TBI by default and not allowing a switch back to
> > "normal" ABI without a reboot actually means that userspace cannot inject
> > kernel pointers into syscalls any more, since they'll get universally
> > stripped now. Is my understanding correct, here? i.e. exploiting
> > CVE-2017-5123 would be impossible under TBI?
> > 
> > If so, then I think we should commit to the TBI ABI and have a boot
> > flag to disable it, but NOT have a process flag, as that would allow
> > attackers to bypass the masking. The only flag should be "TBI or MTE".
> > 
> > If so, can I get top byte masking for other architectures too? Like,
> > just to strip high bits off userspace addresses? ;)
> 
> Just for fun, hack/attempt at your idea which should not interfere with
> TBI. Only briefly tested on arm64 (and the s390 __TYPE_IS_PTR macro is
> pretty weird ;)):

OMG, this is amazing and bonkers. I love it.

> --------------------------8<---------------------------------
> diff --git a/arch/s390/include/asm/compat.h b/arch/s390/include/asm/compat.h
> index 63b46e30b2c3..338455a74eff 100644
> --- a/arch/s390/include/asm/compat.h
> +++ b/arch/s390/include/asm/compat.h
> @@ -11,9 +11,6 @@
>  
>  #include <asm-generic/compat.h>
>  
> -#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p( \
> -				typeof(0?(__force t)0:0ULL), u64))
> -
>  #define __SC_DELOUSE(t,v) ({ \
>  	BUILD_BUG_ON(sizeof(t) > 4 && !__TYPE_IS_PTR(t)); \
>  	(__force t)(__TYPE_IS_PTR(t) ? ((v) & 0x7fffffff) : (v)); \
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e2870fe1be5b..b1b9fe8502da 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -119,8 +119,15 @@ struct io_uring_params;
>  #define __TYPE_IS_L(t)	(__TYPE_AS(t, 0L))
>  #define __TYPE_IS_UL(t)	(__TYPE_AS(t, 0UL))
>  #define __TYPE_IS_LL(t) (__TYPE_AS(t, 0LL) || __TYPE_AS(t, 0ULL))
> +#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p(typeof(0 ? (__force t)0 : 0ULL), u64))
>  #define __SC_LONG(t, a) __typeof(__builtin_choose_expr(__TYPE_IS_LL(t), 0LL, 0L)) a
> +#ifdef CONFIG_64BIT
> +#define __SC_CAST(t, a)	(__TYPE_IS_PTR(t) \
> +				? (__force t) ((__u64)a & ~(1UL << 55)) \
> +				: (__force t) a)
> +#else
>  #define __SC_CAST(t, a)	(__force t) a
> +#endif
>  #define __SC_ARGS(t, a)	a
>  #define __SC_TEST(t, a) (void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(t) && sizeof(t) > sizeof(long))

I'm laughing, I'm crying. Now I have to go look at the disassembly to
see how this actually looks. (I mean, it _does_ solve my specific case
of the waitid() flaw, but wouldn't help with pointers deeper in structs,
etc, though TBI does, I think still help with that. I have to catch back
up on the thread...) Anyway, if it's not too expensive it'd block
reachability for those kinds of flaws.

I wonder what my chances are of actually getting this landed? :)
(Though, I guess I need to find a "VA width" macro instead of a raw 55.)

Thanks for thinking of __SC_CAST() and __TYPE_IS_PTR() together. Really
made my day. :)

[v15,00/17] arm64: untag user pointers passed to the kernel

Message

Comments