Message ID | cover.1557160186.git.andreyknvl@google.com (mailing list archive) |
---|---|
Headers | show |
Series | arm64: untag user pointers passed to the kernel | expand |
Hi Andrey, On Mon, May 06, 2019 at 06:30:46PM +0200, Andrey Konovalov wrote: > One of the alternative approaches to untagging that was considered is to > completely strip the pointer tag as the pointer enters the kernel with > some kind of a syscall wrapper, but that won't work with the countless > number of different ioctl calls. With this approach we would need a custom > wrapper for each ioctl variation, which doesn't seem practical. The more I look at this problem, the less convinced I am that we can solve it in a way that results in a stable ABI covering ioctls(). While for the Android kernel codebase it could be simpler as you don't upgrade the kernel version every 2.5 months, for the mainline kernel this doesn't scale. Any run-time checks are relatively limited in terms of drivers covered. Better static checking would be nice as a long term solution but we didn't get anywhere with the discussion last year. IMO (RFC for now), I see two ways forward: 1. Make this a user space problem and do not allow tagged pointers into the syscall ABI. A libc wrapper would have to convert structures, parameters before passing them into the kernel. Note that we can still support the hardware MTE in the kernel by enabling tagged memory ranges, saving/restoring tags etc. but not allowing tagged addresses at the syscall boundary. 2. Similar shim to the above libc wrapper but inside the kernel (arch/arm64 only; most pointer arguments could be covered with an __SC_CAST similar to the s390 one). There are two differences from what we've discussed in the past: a) this is an opt-in by the user which would have to explicitly call prctl(). If it returns -ENOTSUPP etc., the user won't be allowed to pass tagged pointers to the kernel. This would probably be the responsibility of the C lib to make sure it doesn't tag heap allocations. If the user did not opt-in, the syscalls are routed through the normal path (no untagging address shim). b) ioctl() and other blacklisted syscalls (prctl) will not accept tagged pointers (to be documented in Vicenzo's ABI patches). It doesn't solve the problems we are trying to address but 2.a saves us from blindly relaxing the ABI without knowing how to easily assess new code being merged (over 500K lines between kernel versions). Existing applications (who don't opt-in) won't inadvertently start using the new ABI which could risk becoming de-facto ABI that we need to support on the long run. Option 1 wouldn't solve the ioctl() problem either and while it makes things simpler for the kernel, I am aware that it's slightly more complicated in user space (but I really don't mind if you prefer option 1 ;)). The tagged pointers (whether hwasan or MTE) should ideally be a transparent feature for the application writer but I don't think we can solve it entirely and make it seamless for the multitude of ioctls(). I'd say you only opt in to such feature if you know what you are doing and the user code takes care of specific cases like ioctl(), hence the prctl() proposal even for the hwasan. Comments welcomed.
On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > Hi Andrey, > > On Mon, May 06, 2019 at 06:30:46PM +0200, Andrey Konovalov wrote: > > One of the alternative approaches to untagging that was considered is to > > completely strip the pointer tag as the pointer enters the kernel with > > some kind of a syscall wrapper, but that won't work with the countless > > number of different ioctl calls. With this approach we would need a custom > > wrapper for each ioctl variation, which doesn't seem practical. > > The more I look at this problem, the less convinced I am that we can > solve it in a way that results in a stable ABI covering ioctls(). While > for the Android kernel codebase it could be simpler as you don't upgrade > the kernel version every 2.5 months, for the mainline kernel this > doesn't scale. Any run-time checks are relatively limited in terms of > drivers covered. Better static checking would be nice as a long term > solution but we didn't get anywhere with the discussion last year. > > IMO (RFC for now), I see two ways forward: > > 1. Make this a user space problem and do not allow tagged pointers into > the syscall ABI. A libc wrapper would have to convert structures, > parameters before passing them into the kernel. Note that we can > still support the hardware MTE in the kernel by enabling tagged > memory ranges, saving/restoring tags etc. but not allowing tagged > addresses at the syscall boundary. > > 2. Similar shim to the above libc wrapper but inside the kernel > (arch/arm64 only; most pointer arguments could be covered with an > __SC_CAST similar to the s390 one). There are two differences from > what we've discussed in the past: > > a) this is an opt-in by the user which would have to explicitly call > prctl(). If it returns -ENOTSUPP etc., the user won't be allowed > to pass tagged pointers to the kernel. This would probably be the > responsibility of the C lib to make sure it doesn't tag heap > allocations. If the user did not opt-in, the syscalls are routed > through the normal path (no untagging address shim). > > b) ioctl() and other blacklisted syscalls (prctl) will not accept > tagged pointers (to be documented in Vicenzo's ABI patches). > > It doesn't solve the problems we are trying to address but 2.a saves us > from blindly relaxing the ABI without knowing how to easily assess new > code being merged (over 500K lines between kernel versions). Existing > applications (who don't opt-in) won't inadvertently start using the new > ABI which could risk becoming de-facto ABI that we need to support on > the long run. > > Option 1 wouldn't solve the ioctl() problem either and while it makes > things simpler for the kernel, I am aware that it's slightly more > complicated in user space (but I really don't mind if you prefer option > 1 ;)). > > The tagged pointers (whether hwasan or MTE) should ideally be a > transparent feature for the application writer but I don't think we can > solve it entirely and make it seamless for the multitude of ioctls(). > I'd say you only opt in to such feature if you know what you are doing > and the user code takes care of specific cases like ioctl(), hence the > prctl() proposal even for the hwasan. > > Comments welcomed. Any userspace shim approach is problematic for Android because of the apps that use raw system calls. AFAIK, all apps written in Go are in that camp - I'm not sure how common they are, but getting them all recompiled is probably not realistic. The way I see it, a patch that breaks handling of tagged pointers is not that different from, say, a patch that adds a wild pointer dereference. Both are bugs; the difference is that (a) the former breaks a relatively uncommon target and (b) it's arguably an easier mistake to make. If MTE adoption goes well, (a) will not be the case for long. This is a bit of a chicken-and-egg problem. In a world where memory allocators on one or several popular platforms generate pointers with non-zero tags, any such breakage will be caught in testing. Unfortunately to reach that state we need the kernel to start accepting tagged pointers first, and then hold on for a couple of years until userspace catches up. Perhaps we can start by whitelisting ioctls by driver?
On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote: > On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > IMO (RFC for now), I see two ways forward: > > > > 1. Make this a user space problem and do not allow tagged pointers into > > the syscall ABI. A libc wrapper would have to convert structures, > > parameters before passing them into the kernel. Note that we can > > still support the hardware MTE in the kernel by enabling tagged > > memory ranges, saving/restoring tags etc. but not allowing tagged > > addresses at the syscall boundary. > > > > 2. Similar shim to the above libc wrapper but inside the kernel > > (arch/arm64 only; most pointer arguments could be covered with an > > __SC_CAST similar to the s390 one). There are two differences from > > what we've discussed in the past: > > > > a) this is an opt-in by the user which would have to explicitly call > > prctl(). If it returns -ENOTSUPP etc., the user won't be allowed > > to pass tagged pointers to the kernel. This would probably be the > > responsibility of the C lib to make sure it doesn't tag heap > > allocations. If the user did not opt-in, the syscalls are routed > > through the normal path (no untagging address shim). > > > > b) ioctl() and other blacklisted syscalls (prctl) will not accept > > tagged pointers (to be documented in Vicenzo's ABI patches). [...] > Any userspace shim approach is problematic for Android because of the > apps that use raw system calls. AFAIK, all apps written in Go are in > that camp - I'm not sure how common they are, but getting them all > recompiled is probably not realistic. That's a fair point (I wasn't expecting it would get much traction anyway ;)). OTOH, it allows upstreaming of the MTE patches while we continue the discussions around TBI. > The way I see it, a patch that breaks handling of tagged pointers is > not that different from, say, a patch that adds a wild pointer > dereference. Both are bugs; the difference is that (a) the former > breaks a relatively uncommon target and (b) it's arguably an easier > mistake to make. If MTE adoption goes well, (a) will not be the case > for long. It's also the fact such patch would go unnoticed for a long time until someone exercises that code path. And when they do, the user would be pretty much in the dark trying to figure what what went wrong, why a SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed all the places where it matters in the current kernel codebase (ignoring future patches). I think we should revisit the static checking discussions we had last year. Run-time checking (even with compiler instrumentation and syzkaller fuzzing) would only cover the code paths specific to a Linux or Android installation. > This is a bit of a chicken-and-egg problem. In a world where memory > allocators on one or several popular platforms generate pointers with > non-zero tags, any such breakage will be caught in testing. > Unfortunately to reach that state we need the kernel to start > accepting tagged pointers first, and then hold on for a couple of > years until userspace catches up. Would the kernel also catch up with providing a stable ABI? Because we have two moving targets. On one hand, you have Android or some Linux distro that stick to a stable kernel version for some time, so they have better chance of clearing most of the problems. On the other hand, we have mainline kernel that gets over 500K lines every release. As maintainer, I can't rely on my testing alone as this is on a limited number of platforms. So my concern is that every kernel release has a significant chance of breaking the ABI, unless we have a better way of identifying potential issues. > Perhaps we can start by whitelisting ioctls by driver? This was also raised by Ruben in private but without a (static) tool to to check, manually going through all the drivers doesn't scale. It's very likely that most drivers don't care, just a get_user/put_user is already handled by these patches. Searching for find_vma() was identifying one such use-case but is this sufficient? Are there other cases we need to explicitly untag a pointer? The other point I'd like feedback on is 2.a above. I see _some_ value into having the user opt-in to this relaxed ABI rather than blinding exposing it to all applications. Dave suggested (in private) a new personality (e.g. PER_LINUX_TBI) inherited by children. It would be the responsibility of the C library to check the current personality bits and only tag pointers on allocation *if* the kernel allowed it. The kernel could provide the AT_FLAGS bit as in Vincenzo's patches if the personality was set but can't set it retrospectively if the user called sys_personality. By default, /sbin/init would not have this personality and libc would not tag pointers, so we can guarantee that your distro boots normally with a new kernel version. We could have an envp that gets caught by /sbin/init so you can pass it on the kernel command line (or a dynamic loader at run-time). But the default should be the current ABI behaviour. We can enforce the current behaviour by having access_ok() check the personality or a TIF flag but we may relax this enforcement at some point in the future as we learn more about the implications of TBI. Thanks.
On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote: > The tagged pointers (whether hwasan or MTE) should ideally be a > transparent feature for the application writer but I don't think we can > solve it entirely and make it seamless for the multitude of ioctls(). > I'd say you only opt in to such feature if you know what you are doing > and the user code takes care of specific cases like ioctl(), hence the > prctl() proposal even for the hwasan. I'm not sure such a dire view is warrented.. The ioctl situation is not so bad, other than a few special cases, most drivers just take a 'void __user *' and pass it as an argument to some function that accepts a 'void __user *'. sparse et al verify this. As long as the core functions do the right thing the drivers will be OK. The only place things get dicy is if someone casts to unsigned long (ie for vma work) but I think that reflects that our driver facing APIs for VMAs are compatible with static analysis (ie I have no earthly idea why get_user_pages() accepts an unsigned long), not that this is too hard. Jason
On Tue, May 21, 2019 at 07:29:33PM +0100, Catalin Marinas wrote: > On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote: > > On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > IMO (RFC for now), I see two ways forward: > > > [...] > > > 2. Similar shim to the above libc wrapper but inside the kernel > > > (arch/arm64 only; most pointer arguments could be covered with an > > > __SC_CAST similar to the s390 one). There are two differences from > > > what we've discussed in the past: > > > > > > a) this is an opt-in by the user which would have to explicitly call > > > prctl(). If it returns -ENOTSUPP etc., the user won't be allowed > > > to pass tagged pointers to the kernel. This would probably be the > > > responsibility of the C lib to make sure it doesn't tag heap > > > allocations. If the user did not opt-in, the syscalls are routed > > > through the normal path (no untagging address shim). > > > > > > b) ioctl() and other blacklisted syscalls (prctl) will not accept > > > tagged pointers (to be documented in Vicenzo's ABI patches). > > > > The way I see it, a patch that breaks handling of tagged pointers is > > not that different from, say, a patch that adds a wild pointer > > dereference. Both are bugs; the difference is that (a) the former > > breaks a relatively uncommon target and (b) it's arguably an easier > > mistake to make. If MTE adoption goes well, (a) will not be the case > > for long. > > It's also the fact such patch would go unnoticed for a long time until > someone exercises that code path. And when they do, the user would be > pretty much in the dark trying to figure what what went wrong, why a > SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed > all the places where it matters in the current kernel codebase (ignoring > future patches). So, looking forward a bit, this isn't going to be an ARM-specific issue for long. In fact, I think we shouldn't have arm-specific syscall wrappers in this series: I think untagged_addr() should likely be added at the top-level and have it be a no-op for other architectures. So given this becoming a kernel-wide multi-architecture issue (under the assumption that x86, RISC-V, and others will gain similar TBI or MTE things), we should solve it in a way that we can re-use. We need something that is going to work everywhere. And it needs to be supported by the kernel for the simple reason that the kernel needs to do MTE checks during copy_from_user(): having that information stripped means we lose any userspace-assigned MTE protections if they get handled by the kernel, which is a total non-starter, IMO. As an aside: I think Sparc ADI support in Linux actually side-stepped this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI for kernel code.") I think this was a mistake we should not repeat for arm64 (we do seem to be at least in agreement about this, I think). [1] https://lore.kernel.org/patchwork/patch/654481/ > > This is a bit of a chicken-and-egg problem. In a world where memory > > allocators on one or several popular platforms generate pointers with > > non-zero tags, any such breakage will be caught in testing. > > Unfortunately to reach that state we need the kernel to start > > accepting tagged pointers first, and then hold on for a couple of > > years until userspace catches up. > > Would the kernel also catch up with providing a stable ABI? Because we > have two moving targets. > > On one hand, you have Android or some Linux distro that stick to a > stable kernel version for some time, so they have better chance of > clearing most of the problems. On the other hand, we have mainline > kernel that gets over 500K lines every release. As maintainer, I can't > rely on my testing alone as this is on a limited number of platforms. So > my concern is that every kernel release has a significant chance of > breaking the ABI, unless we have a better way of identifying potential > issues. I just want to make sure I fully understand your concern about this being an ABI break, and I work best with examples. The closest situation I can see would be: - some program has no idea about MTE - malloc() starts returning MTE-tagged addresses - program doesn't break from that change - program uses some syscall that is missing untagged_addr() and fails - kernel has now broken userspace that used to work The trouble I see with this is that it is largely theoretical and requires part of userspace to collude to start using a new CPU feature that tickles a bug in the kernel. As I understand the golden rule, this is a bug in the kernel (a missed ioctl() or such) to be fixed, not a global breaking of some userspace behavior. I feel like I'm missing something about this being seen as an ABI break. The kernel already fails on userspace addresses that have high bits set -- are there things that _depend_ on this failure to operate?
Hi Kees, Thanks for joining the thread ;). On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote: > On Tue, May 21, 2019 at 07:29:33PM +0100, Catalin Marinas wrote: > > On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote: > > > On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > IMO (RFC for now), I see two ways forward: > > > > [...] > > > > 2. Similar shim to the above libc wrapper but inside the kernel > > > > (arch/arm64 only; most pointer arguments could be covered with an > > > > __SC_CAST similar to the s390 one). There are two differences from > > > > what we've discussed in the past: > > > > > > > > a) this is an opt-in by the user which would have to explicitly call > > > > prctl(). If it returns -ENOTSUPP etc., the user won't be allowed > > > > to pass tagged pointers to the kernel. This would probably be the > > > > responsibility of the C lib to make sure it doesn't tag heap > > > > allocations. If the user did not opt-in, the syscalls are routed > > > > through the normal path (no untagging address shim). > > > > > > > > b) ioctl() and other blacklisted syscalls (prctl) will not accept > > > > tagged pointers (to be documented in Vicenzo's ABI patches). > > > > > > The way I see it, a patch that breaks handling of tagged pointers is > > > not that different from, say, a patch that adds a wild pointer > > > dereference. Both are bugs; the difference is that (a) the former > > > breaks a relatively uncommon target and (b) it's arguably an easier > > > mistake to make. If MTE adoption goes well, (a) will not be the case > > > for long. > > > > It's also the fact such patch would go unnoticed for a long time until > > someone exercises that code path. And when they do, the user would be > > pretty much in the dark trying to figure what what went wrong, why a > > SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed > > all the places where it matters in the current kernel codebase (ignoring > > future patches). > > So, looking forward a bit, this isn't going to be an ARM-specific issue > for long. I do hope so. > In fact, I think we shouldn't have arm-specific syscall wrappers > in this series: I think untagged_addr() should likely be added at the > top-level and have it be a no-op for other architectures. That's what the current patchset does, so we have this as a starting point. Kostya raised another potential issue with the syscall wrappers: with MTE the kernel will be forced to enable the match-all (wildcard) pointers for user space accesses since copy_from_user() would only get a 0 tag. So it has wider implications than just uaccess routines not checking the colour. > So given this becoming a kernel-wide multi-architecture issue (under > the assumption that x86, RISC-V, and others will gain similar TBI or > MTE things), we should solve it in a way that we can re-use. Can we do any better to aid the untagged_addr() placement (e.g. better type annotations, better static analysis)? We have to distinguish between user pointers that may be dereferenced by the kernel (I think almost fully covered with this patchset) and user addresses represented as ulong that may: a) be converted to a user pointer and dereferenced; I think that's the case for many overloaded ulong/u64 arguments b) used for address space management, rbtree look-ups etc. where the tag is no longer relevant and it even gets in the way We tried last year to identify void __user * casts to unsigned long using sparse on the assumption that pointers can be tagged while ulong is about address space management and needs to lose such tag. I think we could have pushed this further. For example, get_user_pages() takes an unsigned long but it is perfectly capable of untagging the address itself. Shall we change its first argument to void __user * (together with all its callers)? find_vma(), OTOH, could untag the address but it doesn't help since vm_start/end don't have such information (that's more about the content or type that the user decided) and the callers check against it. Are there any other places where this matters? These patches tracked down find_vma() as some heuristics but we may need better static analysis to identify other cases. > We need something that is going to work everywhere. And it needs to be > supported by the kernel for the simple reason that the kernel needs to > do MTE checks during copy_from_user(): having that information stripped > means we lose any userspace-assigned MTE protections if they get handled > by the kernel, which is a total non-starter, IMO. Such feedback is welcomed ;). > As an aside: I think Sparc ADI support in Linux actually side-stepped > this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must > be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI > for kernel code.") I think this was a mistake we should not repeat for > arm64 (we do seem to be at least in agreement about this, I think). > > [1] https://lore.kernel.org/patchwork/patch/654481/ I tried to drag the SPARC guys into this discussion but without much success. > > > This is a bit of a chicken-and-egg problem. In a world where memory > > > allocators on one or several popular platforms generate pointers with > > > non-zero tags, any such breakage will be caught in testing. > > > Unfortunately to reach that state we need the kernel to start > > > accepting tagged pointers first, and then hold on for a couple of > > > years until userspace catches up. > > > > Would the kernel also catch up with providing a stable ABI? Because we > > have two moving targets. > > > > On one hand, you have Android or some Linux distro that stick to a > > stable kernel version for some time, so they have better chance of > > clearing most of the problems. On the other hand, we have mainline > > kernel that gets over 500K lines every release. As maintainer, I can't > > rely on my testing alone as this is on a limited number of platforms. So > > my concern is that every kernel release has a significant chance of > > breaking the ABI, unless we have a better way of identifying potential > > issues. > > I just want to make sure I fully understand your concern about this > being an ABI break, and I work best with examples. The closest situation > I can see would be: > > - some program has no idea about MTE Apart from some libraries like libc (and maybe those that handle specific device ioctls), I think most programs should have no idea about MTE. I wouldn't expect programmers to have to change their app just because we have a new feature that colours heap allocations. > - malloc() starts returning MTE-tagged addresses > - program doesn't break from that change > - program uses some syscall that is missing untagged_addr() and fails > - kernel has now broken userspace that used to work That's one aspect though probably more of a case of plugging in a new device (graphics card, network etc.) and the ioctl to the new device doesn't work. The other is that, assuming we reach a point where the kernel entirely supports this relaxed ABI, can we guarantee that it won't break in the future. Let's say some subsequent kernel change (some refactoring) misses out an untagged_addr(). This renders a previously TBI/MTE-capable syscall unusable. Can we rely only on testing? > The trouble I see with this is that it is largely theoretical and > requires part of userspace to collude to start using a new CPU feature > that tickles a bug in the kernel. As I understand the golden rule, > this is a bug in the kernel (a missed ioctl() or such) to be fixed, > not a global breaking of some userspace behavior. Yes, we should follow the rule that it's a kernel bug but it doesn't help the user that a newly installed kernel causes user space to no longer reach a prompt. Hence the proposal of an opt-in via personality (for MTE we would need an explicit opt-in by the user anyway since the top byte is no longer ignored but checked against the allocation tag). > I feel like I'm missing something about this being seen as an ABI > break. The kernel already fails on userspace addresses that have high > bits set -- are there things that _depend_ on this failure to operate? It's about providing a relaxed ABI which allows non-zero top byte and breaking it later inadvertently without having something better in place to analyse the kernel changes. Thanks.
On Tue, May 21, 2019 at 03:48:56PM -0300, Jason Gunthorpe wrote: > On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote: > > > The tagged pointers (whether hwasan or MTE) should ideally be a > > transparent feature for the application writer but I don't think we can > > solve it entirely and make it seamless for the multitude of ioctls(). > > I'd say you only opt in to such feature if you know what you are doing > > and the user code takes care of specific cases like ioctl(), hence the > > prctl() proposal even for the hwasan. > > I'm not sure such a dire view is warrented.. > > The ioctl situation is not so bad, other than a few special cases, > most drivers just take a 'void __user *' and pass it as an argument to > some function that accepts a 'void __user *'. sparse et al verify > this. > > As long as the core functions do the right thing the drivers will be > OK. > > The only place things get dicy is if someone casts to unsigned long > (ie for vma work) but I think that reflects that our driver facing > APIs for VMAs are compatible with static analysis (ie I have no > earthly idea why get_user_pages() accepts an unsigned long), not that > this is too hard. If multiple people will care about this, perhaps we should try to annotate types more explicitly in SYSCALL_DEFINEx() and ABI data structures. For example, we could have a couple of mutually exclusive modifiers T __object * T __vaddr * (or U __vaddr) In the first case the pointer points to an object (in the C sense) that the call may dereference but not use for any other purpose. In the latter case the pointer (or other type) is a virtual address that the call does not dereference but my do other things with. Also U __really(T) to tell static analysers the real type of pointers smuggled through UAPI disguised as other types (*cough* KVM, etc.) We could gradually make sparse more strict about the presence of annotations and allowed conversions, add get/put_user() variants that demand explicit annotation, etc. find_vma() wouldn't work with a __object pointer, for example. A get_user_pages_for_dereference() might be needed for __object pointers (embodying a promise from the caller that only the object will be dereferenced within the mapped pages). Thoughts? This kind of thing would need widespread buy-in in order to be viable. Cheers ---Dave
On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > Hi Kees, > > Thanks for joining the thread ;). > > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote: > > On Tue, May 21, 2019 at 07:29:33PM +0100, Catalin Marinas wrote: > > > On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote: > > > > On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > > IMO (RFC for now), I see two ways forward: > > > > > [...] > > > > > 2. Similar shim to the above libc wrapper but inside the kernel > > > > > (arch/arm64 only; most pointer arguments could be covered with an > > > > > __SC_CAST similar to the s390 one). There are two differences from > > > > > what we've discussed in the past: > > > > > > > > > > a) this is an opt-in by the user which would have to explicitly call > > > > > prctl(). If it returns -ENOTSUPP etc., the user won't be allowed > > > > > to pass tagged pointers to the kernel. This would probably be the > > > > > responsibility of the C lib to make sure it doesn't tag heap > > > > > allocations. If the user did not opt-in, the syscalls are routed > > > > > through the normal path (no untagging address shim). > > > > > > > > > > b) ioctl() and other blacklisted syscalls (prctl) will not accept > > > > > tagged pointers (to be documented in Vicenzo's ABI patches). > > > > > > > > The way I see it, a patch that breaks handling of tagged pointers is > > > > not that different from, say, a patch that adds a wild pointer > > > > dereference. Both are bugs; the difference is that (a) the former > > > > breaks a relatively uncommon target and (b) it's arguably an easier > > > > mistake to make. If MTE adoption goes well, (a) will not be the case > > > > for long. > > > > > > It's also the fact such patch would go unnoticed for a long time until > > > someone exercises that code path. And when they do, the user would be > > > pretty much in the dark trying to figure what what went wrong, why a > > > SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed > > > all the places where it matters in the current kernel codebase (ignoring > > > future patches). > > > > So, looking forward a bit, this isn't going to be an ARM-specific issue > > for long. > > I do hope so. > > > In fact, I think we shouldn't have arm-specific syscall wrappers > > in this series: I think untagged_addr() should likely be added at the > > top-level and have it be a no-op for other architectures. > > That's what the current patchset does, so we have this as a starting > point. Kostya raised another potential issue with the syscall wrappers: > with MTE the kernel will be forced to enable the match-all (wildcard) > pointers for user space accesses since copy_from_user() would only get a > 0 tag. So it has wider implications than just uaccess routines not > checking the colour. > > > So given this becoming a kernel-wide multi-architecture issue (under > > the assumption that x86, RISC-V, and others will gain similar TBI or > > MTE things), we should solve it in a way that we can re-use. > > Can we do any better to aid the untagged_addr() placement (e.g. better > type annotations, better static analysis)? We have to distinguish > between user pointers that may be dereferenced by the kernel (I think > almost fully covered with this patchset) and user addresses represented > as ulong that may: > > a) be converted to a user pointer and dereferenced; I think that's the > case for many overloaded ulong/u64 arguments > > b) used for address space management, rbtree look-ups etc. where the tag > is no longer relevant and it even gets in the way > > We tried last year to identify void __user * casts to unsigned long > using sparse on the assumption that pointers can be tagged while ulong > is about address space management and needs to lose such tag. I think we > could have pushed this further. For example, get_user_pages() takes an > unsigned long but it is perfectly capable of untagging the address > itself. Shall we change its first argument to void __user * (together > with all its callers)? > > find_vma(), OTOH, could untag the address but it doesn't help since > vm_start/end don't have such information (that's more about the content > or type that the user decided) and the callers check against it. > > Are there any other places where this matters? These patches tracked > down find_vma() as some heuristics but we may need better static > analysis to identify other cases. > > > We need something that is going to work everywhere. And it needs to be > > supported by the kernel for the simple reason that the kernel needs to > > do MTE checks during copy_from_user(): having that information stripped > > means we lose any userspace-assigned MTE protections if they get handled > > by the kernel, which is a total non-starter, IMO. > > Such feedback is welcomed ;). > > > As an aside: I think Sparc ADI support in Linux actually side-stepped > > this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must > > be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI > > for kernel code.") I think this was a mistake we should not repeat for > > arm64 (we do seem to be at least in agreement about this, I think). > > > > [1] https://lore.kernel.org/patchwork/patch/654481/ > > I tried to drag the SPARC guys into this discussion but without much > success. > > > > > This is a bit of a chicken-and-egg problem. In a world where memory > > > > allocators on one or several popular platforms generate pointers with > > > > non-zero tags, any such breakage will be caught in testing. > > > > Unfortunately to reach that state we need the kernel to start > > > > accepting tagged pointers first, and then hold on for a couple of > > > > years until userspace catches up. > > > > > > Would the kernel also catch up with providing a stable ABI? Because we > > > have two moving targets. > > > > > > On one hand, you have Android or some Linux distro that stick to a > > > stable kernel version for some time, so they have better chance of > > > clearing most of the problems. On the other hand, we have mainline > > > kernel that gets over 500K lines every release. As maintainer, I can't > > > rely on my testing alone as this is on a limited number of platforms. So > > > my concern is that every kernel release has a significant chance of > > > breaking the ABI, unless we have a better way of identifying potential > > > issues. > > > > I just want to make sure I fully understand your concern about this > > being an ABI break, and I work best with examples. The closest situation > > I can see would be: > > > > - some program has no idea about MTE > > Apart from some libraries like libc (and maybe those that handle > specific device ioctls), I think most programs should have no idea about > MTE. I wouldn't expect programmers to have to change their app just > because we have a new feature that colours heap allocations. obviously i'm biased as a libc maintainer, but... i don't think it helps to move this to libc --- now you just have an extra dependency where to have a guaranteed working system you need to update your kernel and libc together. (or at least update your libc to understand new ioctls etc _before_ you can update your kernel.) > > - malloc() starts returning MTE-tagged addresses > > - program doesn't break from that change > > - program uses some syscall that is missing untagged_addr() and fails > > - kernel has now broken userspace that used to work > > That's one aspect though probably more of a case of plugging in a new > device (graphics card, network etc.) and the ioctl to the new device > doesn't work. > > The other is that, assuming we reach a point where the kernel entirely > supports this relaxed ABI, can we guarantee that it won't break in the > future. Let's say some subsequent kernel change (some refactoring) > misses out an untagged_addr(). This renders a previously TBI/MTE-capable > syscall unusable. Can we rely only on testing? > > > The trouble I see with this is that it is largely theoretical and > > requires part of userspace to collude to start using a new CPU feature > > that tickles a bug in the kernel. As I understand the golden rule, > > this is a bug in the kernel (a missed ioctl() or such) to be fixed, > > not a global breaking of some userspace behavior. > > Yes, we should follow the rule that it's a kernel bug but it doesn't > help the user that a newly installed kernel causes user space to no > longer reach a prompt. Hence the proposal of an opt-in via personality > (for MTE we would need an explicit opt-in by the user anyway since the > top byte is no longer ignored but checked against the allocation tag). but realistically would this actually get used in this way? or would any given system either be MTE or non-MTE. in which case a kernel configuration option would seem to make more sense. (because either way, the hypothetical user basically needs to recompile the kernel to get back on their feet. or all of userspace.) i'm not sure i see this new way for a kernel update to break my system and need to be fixed forward/rolled back as any different from any of the existing ways in which this can happen :-) as an end-user i have to rely on whoever's sending me software updates to test adequately enough that they find the problems. as an end user, there isn't any difference between "my phone rebooted when i tried to take a photo because of a kernel/driver leak", say, and "my phone rebooted when i tried to take a photo because of missing untagging of a pointer passed via ioctl". i suspect you and i have very different people in mind when we say "user" :-) > > I feel like I'm missing something about this being seen as an ABI > > break. The kernel already fails on userspace addresses that have high > > bits set -- are there things that _depend_ on this failure to operate? > > It's about providing a relaxed ABI which allows non-zero top byte and > breaking it later inadvertently without having something better in place > to analyse the kernel changes. > > Thanks. > > -- > Catalin
On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote: > On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote: > > > I just want to make sure I fully understand your concern about this > > > being an ABI break, and I work best with examples. The closest situation > > > I can see would be: > > > > > > - some program has no idea about MTE > > > > Apart from some libraries like libc (and maybe those that handle > > specific device ioctls), I think most programs should have no idea about > > MTE. I wouldn't expect programmers to have to change their app just > > because we have a new feature that colours heap allocations. > > obviously i'm biased as a libc maintainer, but... > > i don't think it helps to move this to libc --- now you just have an > extra dependency where to have a guaranteed working system you need to > update your kernel and libc together. (or at least update your libc to > understand new ioctls etc _before_ you can update your kernel.) That's not what I meant (or I misunderstood you). If we have a relaxed ABI in the kernel and a libc that returns tagged pointers on malloc() I wouldn't expect the programmer to do anything different in the application code like explicit untagging. Basically the program would continue to run unmodified irrespective of whether you use an old libc without tagged pointers or a new one which tags heap allocations. What I do expect is that the libc checks for the presence of the relaxed ABI, currently proposed as an AT_FLAGS bit (for MTE we'd have a HWCAP_MTE), and only tag the malloc() pointers if the kernel supports the relaxed ABI. As you said, you shouldn't expect that the C library and kernel are upgraded together, so they should be able to work in any new/old version combination. > > > The trouble I see with this is that it is largely theoretical and > > > requires part of userspace to collude to start using a new CPU feature > > > that tickles a bug in the kernel. As I understand the golden rule, > > > this is a bug in the kernel (a missed ioctl() or such) to be fixed, > > > not a global breaking of some userspace behavior. > > > > Yes, we should follow the rule that it's a kernel bug but it doesn't > > help the user that a newly installed kernel causes user space to no > > longer reach a prompt. Hence the proposal of an opt-in via personality > > (for MTE we would need an explicit opt-in by the user anyway since the > > top byte is no longer ignored but checked against the allocation tag). > > but realistically would this actually get used in this way? or would > any given system either be MTE or non-MTE. in which case a kernel > configuration option would seem to make more sense. (because either > way, the hypothetical user basically needs to recompile the kernel to > get back on their feet. or all of userspace.) The two hard requirements I have for supporting any new hardware feature in Linux are (1) a single kernel image binary continues to run on old hardware while making use of the new feature if available and (2) old user space continues to run on new hardware while new user space can take advantage of the new feature. The distro user space usually has a hard requirement that it continues to run on (certain) old hardware. We can't enforce this in the kernel but we offer the option to user space developers of checking feature availability through HWCAP bits. The Android story may be different as you have more control about which kernel configurations are deployed on specific SoCs. I'm looking more from a Linux distro angle where you just get an off-the-shelf OS image and install it on your hardware, either taking advantage of new features or just not using them if the software was not updated. Or, if updated software is installed on old hardware, it would just run. For MTE, we just can't enable it by default since there are applications who use the top byte of a pointer and expect it to be ignored rather than failing with a mismatched tag. Just think of a hwasan compiled binary where TBI is expected to work and you try to run it with MTE turned on. I would also expect the C library or dynamic loader to check for the presence of a HWCAP_MTE bit before starting to tag memory allocations, otherwise it would get SIGILL on the first MTE instruction it tries to execute. > i'm not sure i see this new way for a kernel update to break my system > and need to be fixed forward/rolled back as any different from any of > the existing ways in which this can happen :-) as an end-user i have > to rely on whoever's sending me software updates to test adequately > enough that they find the problems. as an end user, there isn't any > difference between "my phone rebooted when i tried to take a photo > because of a kernel/driver leak", say, and "my phone rebooted when i > tried to take a photo because of missing untagging of a pointer passed > via ioctl". > > i suspect you and i have very different people in mind when we say "user" :-) Indeed, I think we have different users in mind. I didn't mean the end user who doesn't really care which C library version it's running on their phone but rather advanced users (not necessarily kernel developers) that prefer to build their own kernels with every release. We could extend this to kernel developers who don't have time to track down why a new kernel triggers lots of SIGSEGVs during boot.
On Wed, May 22, 2019 at 9:35 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote: > > On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote: > > > > I just want to make sure I fully understand your concern about this > > > > being an ABI break, and I work best with examples. The closest situation > > > > I can see would be: > > > > > > > > - some program has no idea about MTE > > > > > > Apart from some libraries like libc (and maybe those that handle > > > specific device ioctls), I think most programs should have no idea about > > > MTE. I wouldn't expect programmers to have to change their app just > > > because we have a new feature that colours heap allocations. > > > > obviously i'm biased as a libc maintainer, but... > > > > i don't think it helps to move this to libc --- now you just have an > > extra dependency where to have a guaranteed working system you need to > > update your kernel and libc together. (or at least update your libc to > > understand new ioctls etc _before_ you can update your kernel.) > > That's not what I meant (or I misunderstood you). If we have a relaxed > ABI in the kernel and a libc that returns tagged pointers on malloc() I > wouldn't expect the programmer to do anything different in the > application code like explicit untagging. Basically the program would > continue to run unmodified irrespective of whether you use an old libc > without tagged pointers or a new one which tags heap allocations. > > What I do expect is that the libc checks for the presence of the relaxed > ABI, currently proposed as an AT_FLAGS bit (for MTE we'd have a > HWCAP_MTE), and only tag the malloc() pointers if the kernel supports > the relaxed ABI. As you said, you shouldn't expect that the C library > and kernel are upgraded together, so they should be able to work in any > new/old version combination. yes, that part makes sense. i do think we'd use the AT_FLAGS bit, for exactly this. i was questioning the argument about the ioctl issues, and saying that from my perspective, untagging bugs are not really any different than any other kind of kernel bug. > > > > The trouble I see with this is that it is largely theoretical and > > > > requires part of userspace to collude to start using a new CPU feature > > > > that tickles a bug in the kernel. As I understand the golden rule, > > > > this is a bug in the kernel (a missed ioctl() or such) to be fixed, > > > > not a global breaking of some userspace behavior. > > > > > > Yes, we should follow the rule that it's a kernel bug but it doesn't > > > help the user that a newly installed kernel causes user space to no > > > longer reach a prompt. Hence the proposal of an opt-in via personality > > > (for MTE we would need an explicit opt-in by the user anyway since the > > > top byte is no longer ignored but checked against the allocation tag). > > > > but realistically would this actually get used in this way? or would > > any given system either be MTE or non-MTE. in which case a kernel > > configuration option would seem to make more sense. (because either > > way, the hypothetical user basically needs to recompile the kernel to > > get back on their feet. or all of userspace.) > > The two hard requirements I have for supporting any new hardware feature > in Linux are (1) a single kernel image binary continues to run on old > hardware while making use of the new feature if available and (2) old > user space continues to run on new hardware while new user space can > take advantage of the new feature. > > The distro user space usually has a hard requirement that it continues > to run on (certain) old hardware. We can't enforce this in the kernel > but we offer the option to user space developers of checking feature > availability through HWCAP bits. > > The Android story may be different as you have more control about which > kernel configurations are deployed on specific SoCs. I'm looking more > from a Linux distro angle where you just get an off-the-shelf OS image > and install it on your hardware, either taking advantage of new features > or just not using them if the software was not updated. Or, if updated > software is installed on old hardware, it would just run. > > For MTE, we just can't enable it by default since there are applications > who use the top byte of a pointer and expect it to be ignored rather > than failing with a mismatched tag. Just think of a hwasan compiled > binary where TBI is expected to work and you try to run it with MTE > turned on. > > I would also expect the C library or dynamic loader to check for the > presence of a HWCAP_MTE bit before starting to tag memory allocations, > otherwise it would get SIGILL on the first MTE instruction it tries to > execute. (a bit off-topic, but i thought the MTE instructions were encoded in the no-op space, to avoid this?) > > i'm not sure i see this new way for a kernel update to break my system > > and need to be fixed forward/rolled back as any different from any of > > the existing ways in which this can happen :-) as an end-user i have > > to rely on whoever's sending me software updates to test adequately > > enough that they find the problems. as an end user, there isn't any > > difference between "my phone rebooted when i tried to take a photo > > because of a kernel/driver leak", say, and "my phone rebooted when i > > tried to take a photo because of missing untagging of a pointer passed > > via ioctl". > > > > i suspect you and i have very different people in mind when we say "user" :-) > > Indeed, I think we have different users in mind. I didn't mean the end > user who doesn't really care which C library version it's running on > their phone but rather advanced users (not necessarily kernel > developers) that prefer to build their own kernels with every release. > We could extend this to kernel developers who don't have time to track > down why a new kernel triggers lots of SIGSEGVs during boot. i still don't see how this isn't just a regular testing/CI issue, the same as any other kind of kernel bug. it's already the case that i can get a bad kernel... > -- > Catalin
On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote: > On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote: > > > I just want to make sure I fully understand your concern about this > > > being an ABI break, and I work best with examples. The closest situation > > > I can see would be: > > > > > > - some program has no idea about MTE > > > > Apart from some libraries like libc (and maybe those that handle > > specific device ioctls), I think most programs should have no idea about > > MTE. I wouldn't expect programmers to have to change their app just > > because we have a new feature that colours heap allocations. Right -- things should Just Work from the application perspective. > obviously i'm biased as a libc maintainer, but... > > i don't think it helps to move this to libc --- now you just have an > extra dependency where to have a guaranteed working system you need to > update your kernel and libc together. (or at least update your libc to > understand new ioctls etc _before_ you can update your kernel.) I think (hope?) we've all agreed that we shouldn't pass this off to userspace. At the very least, it reduces the utility of MTE, and at worst it complicates userspace when this is clearly a kernel/architecture issue. > > > > - malloc() starts returning MTE-tagged addresses > > > - program doesn't break from that change > > > - program uses some syscall that is missing untagged_addr() and fails > > > - kernel has now broken userspace that used to work > > > > That's one aspect though probably more of a case of plugging in a new > > device (graphics card, network etc.) and the ioctl to the new device > > doesn't work. I think MTE will likely be rather like NX/PXN and SMAP/PAN: there will be glitches, and we can disable stuff either via CONFIG or (as is more common now) via a kernel commandline with untagged_addr() containing a static branch, etc. But I actually don't think we need to go this route (see below...) > > The other is that, assuming we reach a point where the kernel entirely > > supports this relaxed ABI, can we guarantee that it won't break in the > > future. Let's say some subsequent kernel change (some refactoring) > > misses out an untagged_addr(). This renders a previously TBI/MTE-capable > > syscall unusable. Can we rely only on testing? > > > > > The trouble I see with this is that it is largely theoretical and > > > requires part of userspace to collude to start using a new CPU feature > > > that tickles a bug in the kernel. As I understand the golden rule, > > > this is a bug in the kernel (a missed ioctl() or such) to be fixed, > > > not a global breaking of some userspace behavior. > > > > Yes, we should follow the rule that it's a kernel bug but it doesn't > > help the user that a newly installed kernel causes user space to no > > longer reach a prompt. Hence the proposal of an opt-in via personality > > (for MTE we would need an explicit opt-in by the user anyway since the > > top byte is no longer ignored but checked against the allocation tag). > > but realistically would this actually get used in this way? or would > any given system either be MTE or non-MTE. in which case a kernel > configuration option would seem to make more sense. (because either > way, the hypothetical user basically needs to recompile the kernel to > get back on their feet. or all of userspace.) Right: the point is to design things so that we do our best to not break userspace that is using the new feature (which I think this series has done well). But supporting MTE/TBI is just like supporting PAN: if someone refactors a driver and swaps a copy_from_user() to a memcpy(), it's going to break under PAN. There will be the same long tail of these bugs like any other, but my sense is that they are small and rare. But I agree: they're going to be pretty weird bugs to track down. The final result, however, will be excellent annotation in the kernel for where userspace addresses get used and people make assumptions about them. The sooner we get the series landed and gain QEMU support (or real hardware), the faster we can hammer out these missed corner-cases. What's the timeline for either of those things, BTW? > > > I feel like I'm missing something about this being seen as an ABI > > > break. The kernel already fails on userspace addresses that have high > > > bits set -- are there things that _depend_ on this failure to operate? > > > > It's about providing a relaxed ABI which allows non-zero top byte and > > breaking it later inadvertently without having something better in place > > to analyse the kernel changes. It sounds like the question is how to switch a process in or out of this ABI (but I don't think that's the real issue: I think it's just a matter of whether or not a process uses tags at all). Doing it at the prctl() level doesn't make sense to me, except maybe to detect MTE support or something. ("Should I tag allocations?") And that state is controlled by the kernel: the kernel does it or it doesn't. If a process wants to not tag, that's also up to the allocator where it can decide not to ask the kernel, and just not tag. Nothing breaks in userspace if a process is NOT tagging and untagged_addr() exists or is missing. This, I think, is the core way this doesn't trip over the golden rule: an old system image will run fine (because it's not tagging). A *new* system may encounter bugs with tagging because it's a new feature: this is The Way Of Things. But we don't break old userspace because old userspace isn't using tags. So the agreement appears to be between the kernel and the allocator. Kernel says "I support this" or not. Telling the allocator to not tag if something breaks sounds like an entirely userspace decision, yes?
On Wed, May 22, 2019 at 12:21 PM Kees Cook <keescook@chromium.org> wrote: > > On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote: > > On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote: > > > > I just want to make sure I fully understand your concern about this > > > > being an ABI break, and I work best with examples. The closest situation > > > > I can see would be: > > > > > > > > - some program has no idea about MTE > > > > > > Apart from some libraries like libc (and maybe those that handle > > > specific device ioctls), I think most programs should have no idea about > > > MTE. I wouldn't expect programmers to have to change their app just > > > because we have a new feature that colours heap allocations. > > Right -- things should Just Work from the application perspective. > > > obviously i'm biased as a libc maintainer, but... > > > > i don't think it helps to move this to libc --- now you just have an > > extra dependency where to have a guaranteed working system you need to > > update your kernel and libc together. (or at least update your libc to > > understand new ioctls etc _before_ you can update your kernel.) > > I think (hope?) we've all agreed that we shouldn't pass this off to > userspace. At the very least, it reduces the utility of MTE, and at worst > it complicates userspace when this is clearly a kernel/architecture issue. > > > > > > > - malloc() starts returning MTE-tagged addresses > > > > - program doesn't break from that change > > > > - program uses some syscall that is missing untagged_addr() and fails > > > > - kernel has now broken userspace that used to work > > > > > > That's one aspect though probably more of a case of plugging in a new > > > device (graphics card, network etc.) and the ioctl to the new device > > > doesn't work. > > I think MTE will likely be rather like NX/PXN and SMAP/PAN: there will > be glitches, and we can disable stuff either via CONFIG or (as is more > common now) via a kernel commandline with untagged_addr() containing a > static branch, etc. But I actually don't think we need to go this route > (see below...) > > > > The other is that, assuming we reach a point where the kernel entirely > > > supports this relaxed ABI, can we guarantee that it won't break in the > > > future. Let's say some subsequent kernel change (some refactoring) > > > misses out an untagged_addr(). This renders a previously TBI/MTE-capable > > > syscall unusable. Can we rely only on testing? > > > > > > > The trouble I see with this is that it is largely theoretical and > > > > requires part of userspace to collude to start using a new CPU feature > > > > that tickles a bug in the kernel. As I understand the golden rule, > > > > this is a bug in the kernel (a missed ioctl() or such) to be fixed, > > > > not a global breaking of some userspace behavior. > > > > > > Yes, we should follow the rule that it's a kernel bug but it doesn't > > > help the user that a newly installed kernel causes user space to no > > > longer reach a prompt. Hence the proposal of an opt-in via personality > > > (for MTE we would need an explicit opt-in by the user anyway since the > > > top byte is no longer ignored but checked against the allocation tag). > > > > but realistically would this actually get used in this way? or would > > any given system either be MTE or non-MTE. in which case a kernel > > configuration option would seem to make more sense. (because either > > way, the hypothetical user basically needs to recompile the kernel to > > get back on their feet. or all of userspace.) > > Right: the point is to design things so that we do our best to not break > userspace that is using the new feature (which I think this series has > done well). But supporting MTE/TBI is just like supporting PAN: if someone > refactors a driver and swaps a copy_from_user() to a memcpy(), it's going > to break under PAN. There will be the same long tail of these bugs like > any other, but my sense is that they are small and rare. But I agree: > they're going to be pretty weird bugs to track down. The final result, > however, will be excellent annotation in the kernel for where userspace > addresses get used and people make assumptions about them. > > The sooner we get the series landed and gain QEMU support (or real > hardware), the faster we can hammer out these missed corner-cases. > What's the timeline for either of those things, BTW? > > > > > I feel like I'm missing something about this being seen as an ABI > > > > break. The kernel already fails on userspace addresses that have high > > > > bits set -- are there things that _depend_ on this failure to operate? > > > > > > It's about providing a relaxed ABI which allows non-zero top byte and > > > breaking it later inadvertently without having something better in place > > > to analyse the kernel changes. > > It sounds like the question is how to switch a process in or out of this > ABI (but I don't think that's the real issue: I think it's just a matter > of whether or not a process uses tags at all). Doing it at the prctl() > level doesn't make sense to me, except maybe to detect MTE support or > something. ("Should I tag allocations?") And that state is controlled > by the kernel: the kernel does it or it doesn't. > > If a process wants to not tag, that's also up to the allocator where > it can decide not to ask the kernel, and just not tag. Nothing breaks in > userspace if a process is NOT tagging and untagged_addr() exists or is > missing. This, I think, is the core way this doesn't trip over the > golden rule: an old system image will run fine (because it's not > tagging). A *new* system may encounter bugs with tagging because it's a > new feature: this is The Way Of Things. But we don't break old userspace > because old userspace isn't using tags. > > So the agreement appears to be between the kernel and the allocator. > Kernel says "I support this" or not. Telling the allocator to not tag if > something breaks sounds like an entirely userspace decision, yes? sgtm, and the AT_FLAGS suggestion sounds fine for our needs in that regard. > -- > Kees Cook
On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote: > The two hard requirements I have for supporting any new hardware feature > in Linux are (1) a single kernel image binary continues to run on old > hardware while making use of the new feature if available and (2) old > user space continues to run on new hardware while new user space can > take advantage of the new feature. Agreed! And I think the series meets these requirements, yes? > For MTE, we just can't enable it by default since there are applications > who use the top byte of a pointer and expect it to be ignored rather > than failing with a mismatched tag. Just think of a hwasan compiled > binary where TBI is expected to work and you try to run it with MTE > turned on. Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI conflicting with MTE. And anything that starts using TBI suddenly can't run in the future because it's being interpreted as MTE bits? (Is that the ABI concern? I feel like we got into the weeds about ioctl()s and one-off bugs...) So there needs to be some way to let the kernel know which of three things it should be doing: 1- leaving userspace addresses as-is (present) 2- wiping the top bits before using (this series) 3- wiping the top bits for most things, but retaining them for MTE as needed (the future) I expect MTE to be the "default" in the future. Once a system's libc has grown support for it, everything will be trying to use MTE. TBI will be the special case (but TBI is effectively a prerequisite). AFAICT, the only difference I see between 2 and 3 will be the tag handling in usercopy (all other places will continue to ignore the top bits). Is that accurate? Is "1" a per-process state we want to keep? (I assume not, but rather it is available via no TBI/MTE CONFIG or a boot-time option, if at all?) To choose between "2" and "3", it seems we need a per-process flag to opt into TBI (and out of MTE). For userspace, how would a future binary choose TBI over MTE? If it's a library issue, we can't use an ELF bit, since the choice may be "late" after ELF load (this implies the need for a prctl().) If it's binary-only ("built with HWKASan") then an ELF bit seems sufficient. And without the marking, I'd expect the kernel to enforce MTE when there are high bits. > I would also expect the C library or dynamic loader to check for the > presence of a HWCAP_MTE bit before starting to tag memory allocations, > otherwise it would get SIGILL on the first MTE instruction it tries to > execute. I've got the same question as Elliot: aren't MTE instructions just NOP to older CPUs? I.e. if the CPU (or kernel) don't support it, it just gets entirely ignored: checking is only needed to satisfy curiosity or behavioral expectations. To me, the conflict seems to be using TBI in the face of expecting MTE to be the default state of the future. (But the internal changes needed for TBI -- this series -- is a prereq for MTE.)
On Wed, May 22, 2019 at 1:47 PM Kees Cook <keescook@chromium.org> wrote: > > On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote: > > The two hard requirements I have for supporting any new hardware feature > > in Linux are (1) a single kernel image binary continues to run on old > > hardware while making use of the new feature if available and (2) old > > user space continues to run on new hardware while new user space can > > take advantage of the new feature. > > Agreed! And I think the series meets these requirements, yes? > > > For MTE, we just can't enable it by default since there are applications > > who use the top byte of a pointer and expect it to be ignored rather > > than failing with a mismatched tag. Just think of a hwasan compiled > > binary where TBI is expected to work and you try to run it with MTE > > turned on. > > Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI > conflicting with MTE. And anything that starts using TBI suddenly can't > run in the future because it's being interpreted as MTE bits? (Is that > the ABI concern? I feel like we got into the weeds about ioctl()s and > one-off bugs...) > > So there needs to be some way to let the kernel know which of three > things it should be doing: > 1- leaving userspace addresses as-is (present) > 2- wiping the top bits before using (this series) > 3- wiping the top bits for most things, but retaining them for MTE as > needed (the future) > > I expect MTE to be the "default" in the future. Once a system's libc has > grown support for it, everything will be trying to use MTE. TBI will be > the special case (but TBI is effectively a prerequisite). > > AFAICT, the only difference I see between 2 and 3 will be the tag handling > in usercopy (all other places will continue to ignore the top bits). Is > that accurate? > > Is "1" a per-process state we want to keep? (I assume not, but rather it > is available via no TBI/MTE CONFIG or a boot-time option, if at all?) > > To choose between "2" and "3", it seems we need a per-process flag to > opt into TBI (and out of MTE). For userspace, how would a future binary > choose TBI over MTE? If it's a library issue, we can't use an ELF bit, > since the choice may be "late" after ELF load (this implies the need > for a prctl().) If it's binary-only ("built with HWKASan") then an ELF > bit seems sufficient. And without the marking, I'd expect the kernel to > enforce MTE when there are high bits. > > > I would also expect the C library or dynamic loader to check for the > > presence of a HWCAP_MTE bit before starting to tag memory allocations, > > otherwise it would get SIGILL on the first MTE instruction it tries to > > execute. > > I've got the same question as Elliot: aren't MTE instructions just NOP > to older CPUs? I.e. if the CPU (or kernel) don't support it, it just > gets entirely ignored: checking is only needed to satisfy curiosity > or behavioral expectations. MTE instructions are not NOP. Most of them have side effects (changing register values, zeroing memory). This only matters for stack tagging, though. Heap tagging is a runtime decision in the allocator. If an image needs to run on old hardware, it will have to do heap tagging only. > To me, the conflict seems to be using TBI in the face of expecting MTE to > be the default state of the future. (But the internal changes needed > for TBI -- this series -- is a prereq for MTE.) > > -- > Kees Cook
On Wed, May 22, 2019 at 4:03 PM Evgenii Stepanov <eugenis@google.com> wrote: > > On Wed, May 22, 2019 at 1:47 PM Kees Cook <keescook@chromium.org> wrote: > > > > On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote: > > > The two hard requirements I have for supporting any new hardware feature > > > in Linux are (1) a single kernel image binary continues to run on old > > > hardware while making use of the new feature if available and (2) old > > > user space continues to run on new hardware while new user space can > > > take advantage of the new feature. > > > > Agreed! And I think the series meets these requirements, yes? > > > > > For MTE, we just can't enable it by default since there are applications > > > who use the top byte of a pointer and expect it to be ignored rather > > > than failing with a mismatched tag. Just think of a hwasan compiled > > > binary where TBI is expected to work and you try to run it with MTE > > > turned on. > > > > Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI > > conflicting with MTE. And anything that starts using TBI suddenly can't > > run in the future because it's being interpreted as MTE bits? (Is that > > the ABI concern? I feel like we got into the weeds about ioctl()s and > > one-off bugs...) > > > > So there needs to be some way to let the kernel know which of three > > things it should be doing: > > 1- leaving userspace addresses as-is (present) > > 2- wiping the top bits before using (this series) > > 3- wiping the top bits for most things, but retaining them for MTE as > > needed (the future) > > > > I expect MTE to be the "default" in the future. Once a system's libc has > > grown support for it, everything will be trying to use MTE. TBI will be > > the special case (but TBI is effectively a prerequisite). > > > > AFAICT, the only difference I see between 2 and 3 will be the tag handling > > in usercopy (all other places will continue to ignore the top bits). Is > > that accurate? > > > > Is "1" a per-process state we want to keep? (I assume not, but rather it > > is available via no TBI/MTE CONFIG or a boot-time option, if at all?) > > > > To choose between "2" and "3", it seems we need a per-process flag to > > opt into TBI (and out of MTE). For userspace, how would a future binary > > choose TBI over MTE? If it's a library issue, we can't use an ELF bit, > > since the choice may be "late" after ELF load (this implies the need > > for a prctl().) If it's binary-only ("built with HWKASan") then an ELF > > bit seems sufficient. And without the marking, I'd expect the kernel to > > enforce MTE when there are high bits. > > > > > I would also expect the C library or dynamic loader to check for the > > > presence of a HWCAP_MTE bit before starting to tag memory allocations, > > > otherwise it would get SIGILL on the first MTE instruction it tries to > > > execute. > > > > I've got the same question as Elliot: aren't MTE instructions just NOP > > to older CPUs? I.e. if the CPU (or kernel) don't support it, it just > > gets entirely ignored: checking is only needed to satisfy curiosity > > or behavioral expectations. > > MTE instructions are not NOP. Most of them have side effects (changing > register values, zeroing memory). no, i meant "they're encoded in a space that was previously no-ops, so running on MTE code on old hardware doesn't cause SIGILL". > This only matters for stack tagging, though. Heap tagging is a runtime > decision in the allocator. > > If an image needs to run on old hardware, it will have to do heap tagging only. > > > To me, the conflict seems to be using TBI in the face of expecting MTE to > > be the default state of the future. (But the internal changes needed > > for TBI -- this series -- is a prereq for MTE.) > > > > -- > > Kees Cook
On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote: > On Tue, May 21, 2019 at 03:48:56PM -0300, Jason Gunthorpe wrote: > > On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote: > > > > > The tagged pointers (whether hwasan or MTE) should ideally be a > > > transparent feature for the application writer but I don't think we can > > > solve it entirely and make it seamless for the multitude of ioctls(). > > > I'd say you only opt in to such feature if you know what you are doing > > > and the user code takes care of specific cases like ioctl(), hence the > > > prctl() proposal even for the hwasan. > > > > I'm not sure such a dire view is warrented.. > > > > The ioctl situation is not so bad, other than a few special cases, > > most drivers just take a 'void __user *' and pass it as an argument to > > some function that accepts a 'void __user *'. sparse et al verify > > this. > > > > As long as the core functions do the right thing the drivers will be > > OK. > > > > The only place things get dicy is if someone casts to unsigned long > > (ie for vma work) but I think that reflects that our driver facing > > APIs for VMAs are compatible with static analysis (ie I have no > > earthly idea why get_user_pages() accepts an unsigned long), not that > > this is too hard. > > If multiple people will care about this, perhaps we should try to > annotate types more explicitly in SYSCALL_DEFINEx() and ABI data > structures. > > For example, we could have a couple of mutually exclusive modifiers > > T __object * > T __vaddr * (or U __vaddr) > > In the first case the pointer points to an object (in the C sense) > that the call may dereference but not use for any other purpose. How would you use these two differently? So far the kernel has worked that __user should tag any pointer that is from userspace and then you can't do anything with it until you transform it into a kernel something > to tell static analysers the real type of pointers smuggled through > UAPI disguised as other types (*cough* KVM, etc.) Yes, that would help alot, we often have to pass pointers through a u64 in the uAPI, and there is no static checker support to make sure they are run through the u64_to_user_ptr() helper. Jason
On Wed, May 22, 2019 at 04:09:31PM -0700, enh wrote: > On Wed, May 22, 2019 at 4:03 PM Evgenii Stepanov <eugenis@google.com> wrote: > > On Wed, May 22, 2019 at 1:47 PM Kees Cook <keescook@chromium.org> wrote: > > > On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote: > > > > I would also expect the C library or dynamic loader to check for the > > > > presence of a HWCAP_MTE bit before starting to tag memory allocations, > > > > otherwise it would get SIGILL on the first MTE instruction it tries to > > > > execute. > > > > > > I've got the same question as Elliot: aren't MTE instructions just NOP > > > to older CPUs? I.e. if the CPU (or kernel) don't support it, it just > > > gets entirely ignored: checking is only needed to satisfy curiosity > > > or behavioral expectations. > > > > MTE instructions are not NOP. Most of them have side effects (changing > > register values, zeroing memory). > > no, i meant "they're encoded in a space that was previously no-ops, so > running on MTE code on old hardware doesn't cause SIGILL". It does result in SIGILL, there wasn't enough encoding left in the NOP space for old/current CPU implementations (in hindsight, we should have reserved a bigger NOP space). As Evgenii said, the libc needs to be careful when tagging the heap as it would cause a SIGILL if the HWCAP_MTE is not set. The standard application doesn't need to be recompiled as it would not issue MTE colouring instructions, just standard LDR/STR. Stack tagging is problematic if you want to colour each frame individually, the function prologue would need the non-NOP MTE instructions. The best we can do here is just having the (thread) stacks of different colours.
On Wed, May 22, 2019 at 09:20:52PM -0300, Jason Gunthorpe wrote: > On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote: > > On Tue, May 21, 2019 at 03:48:56PM -0300, Jason Gunthorpe wrote: > > > On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote: > > > > > > > The tagged pointers (whether hwasan or MTE) should ideally be a > > > > transparent feature for the application writer but I don't think we can > > > > solve it entirely and make it seamless for the multitude of ioctls(). > > > > I'd say you only opt in to such feature if you know what you are doing > > > > and the user code takes care of specific cases like ioctl(), hence the > > > > prctl() proposal even for the hwasan. > > > > > > I'm not sure such a dire view is warrented.. > > > > > > The ioctl situation is not so bad, other than a few special cases, > > > most drivers just take a 'void __user *' and pass it as an argument to > > > some function that accepts a 'void __user *'. sparse et al verify > > > this. > > > > > > As long as the core functions do the right thing the drivers will be > > > OK. > > > > > > The only place things get dicy is if someone casts to unsigned long > > > (ie for vma work) but I think that reflects that our driver facing > > > APIs for VMAs are compatible with static analysis (ie I have no > > > earthly idea why get_user_pages() accepts an unsigned long), not that > > > this is too hard. > > > > If multiple people will care about this, perhaps we should try to > > annotate types more explicitly in SYSCALL_DEFINEx() and ABI data > > structures. > > > > For example, we could have a couple of mutually exclusive modifiers > > > > T __object * > > T __vaddr * (or U __vaddr) > > > > In the first case the pointer points to an object (in the C sense) > > that the call may dereference but not use for any other purpose. > > How would you use these two differently? > > So far the kernel has worked that __user should tag any pointer that > is from userspace and then you can't do anything with it until you > transform it into a kernel something Ultimately it would be good to disallow casting __object pointers execpt to compatible __object pointer types, and to make get_user etc. demand __object. __vaddr pointers / addresses would be freely castable, but not to __object and so would not be dereferenceable even indirectly. Or that's the general idea. Figuring out a sane set of rules that we could actually check / enforce would require a bit of work. (Whether the __vaddr base type is a pointer or an integer type is probably moot, due to the restrictions we would place on the use of these anyway.) > > to tell static analysers the real type of pointers smuggled through > > UAPI disguised as other types (*cough* KVM, etc.) > > Yes, that would help alot, we often have to pass pointers through a > u64 in the uAPI, and there is no static checker support to make sure > they are run through the u64_to_user_ptr() helper. Agreed. Cheers ---Dave
On Wed, May 22, 2019 at 01:47:36PM -0700, Kees Cook wrote: > On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote: > > The two hard requirements I have for supporting any new hardware feature > > in Linux are (1) a single kernel image binary continues to run on old > > hardware while making use of the new feature if available and (2) old > > user space continues to run on new hardware while new user space can > > take advantage of the new feature. > > Agreed! And I think the series meets these requirements, yes? Yes. I mentioned this just to make sure people don't expect different kernel builds for different hardware features. There is also the obvious requirement which I didn't mention: new user space continues to run on new/subsequent kernel versions. That's one of the points of contention for this series (ignoring MTE) with the maintainers having to guarantee this without much effort. IOW, do the 500K+ new lines in a subsequent kernel version break any user space out there? I'm only talking about the relaxed TBI ABI. Are the usual LTP, syskaller sufficient? Better static analysis would definitely help. > > For MTE, we just can't enable it by default since there are applications > > who use the top byte of a pointer and expect it to be ignored rather > > than failing with a mismatched tag. Just think of a hwasan compiled > > binary where TBI is expected to work and you try to run it with MTE > > turned on. > > Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI > conflicting with MTE. And anything that starts using TBI suddenly can't > run in the future because it's being interpreted as MTE bits? (Is that > the ABI concern? That's another aspect to figure out when we add the MTE support. I don't think we'd be able to do this without an explicit opt-in by the user. Or, if we ever want MTE to be turned on by default (i.e. tag checking), even if everything is tagged with 0, we have to disallow TBI for user and this includes hwasan. There were a small number of programs using the TBI (I think some JavaScript compilers tried this). But now we are bringing in the hwasan support and this can be a large user base. Shall we add an ELF note for such binaries that use TBI/hwasan? This series is still required for MTE but we may decide not to relax the ABI blindly, therefore the opt-in (prctl) or personality idea. > I feel like we got into the weeds about ioctl()s and one-off bugs...) This needs solving as well. Most driver developers won't know why untagged_addr() is needed unless we have more rigorous types or type annotations and a tool to check them (we should probably revive the old sparse thread). > So there needs to be some way to let the kernel know which of three > things it should be doing: > 1- leaving userspace addresses as-is (present) > 2- wiping the top bits before using (this series) (I'd say tolerating rather than wiping since get_user still uses the tag in the current series) The current series does not allow any choice between 1 and 2, the default ABI basically becomes option 2. > 3- wiping the top bits for most things, but retaining them for MTE as > needed (the future) 2 and 3 are not entirely compatible as a tagged pointer may be checked against the memory colour by the hardware. So you can't have hwasan binary with MTE enabled. > I expect MTE to be the "default" in the future. Once a system's libc has > grown support for it, everything will be trying to use MTE. TBI will be > the special case (but TBI is effectively a prerequisite). The kernel handling of tagged pointers is indeed a prerequisite. The ABI distinction between the above 2 and 3 needs to be solved. > AFAICT, the only difference I see between 2 and 3 will be the tag handling > in usercopy (all other places will continue to ignore the top bits). Is > that accurate? Yes, mostly (for the kernel). If MTE is enabled by default for a hwasan binary, it will SEGFAULT (either in user space or in kernel uaccess). How does the kernel choose between 2 and 3? > Is "1" a per-process state we want to keep? (I assume not, but rather it > is available via no TBI/MTE CONFIG or a boot-time option, if at all?) Possibly, though not necessarily per process. For testing or if something goes wrong during boot, a command line option with a static label would do. The AT_FLAGS bit needs to be checked by user space. My preference would be per-process. > To choose between "2" and "3", it seems we need a per-process flag to > opt into TBI (and out of MTE). Or leave option 2 the default and get it to opt in to MTE. > For userspace, how would a future binary choose TBI over MTE? If it's > a library issue, we can't use an ELF bit, since the choice may be > "late" after ELF load (this implies the need for a prctl().) If it's > binary-only ("built with HWKASan") then an ELF bit seems sufficient. > And without the marking, I'd expect the kernel to enforce MTE when > there are high bits. The current plan is that a future binary issues a prctl(), after checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions are not in the current NOP space). I'd expect this to be done by the libc or dynamic loader under the assumption that the binaries it loads do _not_ use the top pointer byte for anything else. With hwasan compiled objects this gets more confusing (any ELF note to identify them?). (there is also the risk of existing applications using TBI already but I'm not aware of any still using this feature other than hwasan)
On Wed, May 22, 2019 at 12:21:27PM -0700, Kees Cook wrote: > If a process wants to not tag, that's also up to the allocator where > it can decide not to ask the kernel, and just not tag. Nothing breaks in > userspace if a process is NOT tagging and untagged_addr() exists or is > missing. This, I think, is the core way this doesn't trip over the > golden rule: an old system image will run fine (because it's not > tagging). A *new* system may encounter bugs with tagging because it's a > new feature: this is The Way Of Things. But we don't break old userspace > because old userspace isn't using tags. With this series and hwasan binaries, at some point in the future they will be considered "old userspace" and they do use pointer tags which expect to be ignored by both the hardware and the kernel. MTE breaks this assumption.
On Wed, May 22, 2019 at 09:58:22AM -0700, enh wrote: > i was questioning the argument about the ioctl issues, and saying that > from my perspective, untagging bugs are not really any different than > any other kind of kernel bug. Once this series gets in, they are indeed just kernel bugs. What I want is an easier way to identify them, ideally before they trigger in the field. > i still don't see how this isn't just a regular testing/CI issue, the > same as any other kind of kernel bug. it's already the case that i can > get a bad kernel... The testing would have a smaller code coverage in terms of drivers, filesystems than something like a static checker (though one does not exclude the other).
On Thu, May 23, 2019 at 7:45 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Wed, May 22, 2019 at 01:47:36PM -0700, Kees Cook wrote: > > On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote: > > > The two hard requirements I have for supporting any new hardware feature > > > in Linux are (1) a single kernel image binary continues to run on old > > > hardware while making use of the new feature if available and (2) old > > > user space continues to run on new hardware while new user space can > > > take advantage of the new feature. > > > > Agreed! And I think the series meets these requirements, yes? > > Yes. I mentioned this just to make sure people don't expect different > kernel builds for different hardware features. > > There is also the obvious requirement which I didn't mention: new user > space continues to run on new/subsequent kernel versions. That's one of > the points of contention for this series (ignoring MTE) with the > maintainers having to guarantee this without much effort. IOW, do the > 500K+ new lines in a subsequent kernel version break any user space out > there? I'm only talking about the relaxed TBI ABI. Are the usual LTP, > syskaller sufficient? Better static analysis would definitely help. > > > > For MTE, we just can't enable it by default since there are applications > > > who use the top byte of a pointer and expect it to be ignored rather > > > than failing with a mismatched tag. Just think of a hwasan compiled > > > binary where TBI is expected to work and you try to run it with MTE > > > turned on. > > > > Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI > > conflicting with MTE. And anything that starts using TBI suddenly can't > > run in the future because it's being interpreted as MTE bits? (Is that > > the ABI concern? > > That's another aspect to figure out when we add the MTE support. I don't > think we'd be able to do this without an explicit opt-in by the user. > > Or, if we ever want MTE to be turned on by default (i.e. tag checking), > even if everything is tagged with 0, we have to disallow TBI for user > and this includes hwasan. There were a small number of programs using > the TBI (I think some JavaScript compilers tried this). But now we are > bringing in the hwasan support and this can be a large user base. Shall > we add an ELF note for such binaries that use TBI/hwasan? > > This series is still required for MTE but we may decide not to relax the > ABI blindly, therefore the opt-in (prctl) or personality idea. > > > I feel like we got into the weeds about ioctl()s and one-off bugs...) > > This needs solving as well. Most driver developers won't know why > untagged_addr() is needed unless we have more rigorous types or type > annotations and a tool to check them (we should probably revive the old > sparse thread). > > > So there needs to be some way to let the kernel know which of three > > things it should be doing: > > 1- leaving userspace addresses as-is (present) > > 2- wiping the top bits before using (this series) > > (I'd say tolerating rather than wiping since get_user still uses the tag > in the current series) > > The current series does not allow any choice between 1 and 2, the > default ABI basically becomes option 2. > > > 3- wiping the top bits for most things, but retaining them for MTE as > > needed (the future) > > 2 and 3 are not entirely compatible as a tagged pointer may be checked > against the memory colour by the hardware. So you can't have hwasan > binary with MTE enabled. > > > I expect MTE to be the "default" in the future. Once a system's libc has > > grown support for it, everything will be trying to use MTE. TBI will be > > the special case (but TBI is effectively a prerequisite). > > The kernel handling of tagged pointers is indeed a prerequisite. The ABI > distinction between the above 2 and 3 needs to be solved. > > > AFAICT, the only difference I see between 2 and 3 will be the tag handling > > in usercopy (all other places will continue to ignore the top bits). Is > > that accurate? > > Yes, mostly (for the kernel). If MTE is enabled by default for a hwasan > binary, it will SEGFAULT (either in user space or in kernel uaccess). > How does the kernel choose between 2 and 3? > > > Is "1" a per-process state we want to keep? (I assume not, but rather it > > is available via no TBI/MTE CONFIG or a boot-time option, if at all?) > > Possibly, though not necessarily per process. For testing or if > something goes wrong during boot, a command line option with a static > label would do. The AT_FLAGS bit needs to be checked by user space. My > preference would be per-process. > > > To choose between "2" and "3", it seems we need a per-process flag to > > opt into TBI (and out of MTE). > > Or leave option 2 the default and get it to opt in to MTE. > > > For userspace, how would a future binary choose TBI over MTE? If it's > > a library issue, we can't use an ELF bit, since the choice may be > > "late" after ELF load (this implies the need for a prctl().) If it's > > binary-only ("built with HWKASan") then an ELF bit seems sufficient. > > And without the marking, I'd expect the kernel to enforce MTE when > > there are high bits. > > The current plan is that a future binary issues a prctl(), after > checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions > are not in the current NOP space). I'd expect this to be done by the > libc or dynamic loader under the assumption that the binaries it loads > do _not_ use the top pointer byte for anything else. yeah, it sounds like to support hwasan and MTE, the dynamic linker will need to not use either itself. > With hwasan > compiled objects this gets more confusing (any ELF note to identify > them?). no, at the moment code that wants to know checks for the presence of __hwasan_init. (and bionic doesn't actually look at any ELF notes right now.) but we can always add something if we need to. > (there is also the risk of existing applications using TBI already but > I'm not aware of any still using this feature other than hwasan) > > -- > Catalin
On Thu, May 23, 2019 at 03:44:49PM +0100, Catalin Marinas wrote: > There is also the obvious requirement which I didn't mention: new user > space continues to run on new/subsequent kernel versions. That's one of > the points of contention for this series (ignoring MTE) with the > maintainers having to guarantee this without much effort. IOW, do the > 500K+ new lines in a subsequent kernel version break any user space out > there? I'm only talking about the relaxed TBI ABI. Are the usual LTP, > syskaller sufficient? Better static analysis would definitely help. We can't have perfect coverage of people actively (or accidentally) working to trick static analyzers (and the compiler) into "forgetting" about a __user annotation. We can certainly improve analysis (I see the other sub-thread on that), but I'd like that work not to block this series. What on this front would you be comfortable with? Given it's a new feature isn't it sufficient to have a CONFIG (and/or boot option)? > Or, if we ever want MTE to be turned on by default (i.e. tag checking), > even if everything is tagged with 0, we have to disallow TBI for user > and this includes hwasan. There were a small number of programs using > the TBI (I think some JavaScript compilers tried this). But now we are > bringing in the hwasan support and this can be a large user base. Shall > we add an ELF note for such binaries that use TBI/hwasan? Just to be clear, you say "disallow TBI for user" -- you mean a particular process, yes? i.e. there is no architectural limitation that says once we're using MTE nothing can switch to TBI. i.e. a process is either doing MTE or TBI (or nothing, but that's the same as TBI). > This needs solving as well. Most driver developers won't know why > untagged_addr() is needed unless we have more rigorous types or type > annotations and a tool to check them (we should probably revive the old > sparse thread). This seems like a parallel concern: we can do that separately from this series. Without landing it, is it much harder for people to test it, look for bugs, help with types/annotations, etc. > > So there needs to be some way to let the kernel know which of three > > things it should be doing: > > 1- leaving userspace addresses as-is (present) > > 2- wiping the top bits before using (this series) > > (I'd say tolerating rather than wiping since get_user still uses the tag > in the current series) > > The current series does not allow any choice between 1 and 2, the > default ABI basically becomes option 2. What about testing tools that intentionally insert high bits for syscalls and are _expecting_ them to fail? It seems the TBI series will break them? In that case, do we need to opt into TBI as well? > > 3- wiping the top bits for most things, but retaining them for MTE as > > needed (the future) > > 2 and 3 are not entirely compatible as a tagged pointer may be checked > against the memory colour by the hardware. So you can't have hwasan > binary with MTE enabled. Right: a process must be either MTE or TBI, not both. > > I expect MTE to be the "default" in the future. Once a system's libc has > > grown support for it, everything will be trying to use MTE. TBI will be > > the special case (but TBI is effectively a prerequisite). > > The kernel handling of tagged pointers is indeed a prerequisite. The ABI > distinction between the above 2 and 3 needs to be solved. Does that need solving now or when the MTE series appears? As there is no reason to distinguish between "normal" and "TBI", that doesn't seem to need solving at this point? > > AFAICT, the only difference I see between 2 and 3 will be the tag handling > > in usercopy (all other places will continue to ignore the top bits). Is > > that accurate? > > Yes, mostly (for the kernel). If MTE is enabled by default for a hwasan > binary, it will SEGFAULT (either in user space or in kernel uaccess). > How does the kernel choose between 2 and 3? Right -- that was my question as well. > > Is "1" a per-process state we want to keep? (I assume not, but rather it > > is available via no TBI/MTE CONFIG or a boot-time option, if at all?) > > Possibly, though not necessarily per process. For testing or if > something goes wrong during boot, a command line option with a static > label would do. The AT_FLAGS bit needs to be checked by user space. My > preference would be per-process. I would agree. > > To choose between "2" and "3", it seems we need a per-process flag to > > opt into TBI (and out of MTE). > > Or leave option 2 the default and get it to opt in to MTE. Given that MTE has to "start" at some point in the binary lifetime, I'm fine with opting into MTE. I do expect, though, this will feel redundant in a couple years as everything will immediately opt-in. But, okay, this is therefore an issue for the MTE series. > The current plan is that a future binary issues a prctl(), after > checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions > are not in the current NOP space). I'd expect this to be done by the > libc or dynamic loader under the assumption that the binaries it loads > do _not_ use the top pointer byte for anything else. With hwasan > compiled objects this gets more confusing (any ELF note to identify > them?). Okay, sounds fine. > (there is also the risk of existing applications using TBI already but > I'm not aware of any still using this feature other than hwasan) Correct. Alright, the tl;dr appears to be: - you want more assurances that we can find __user stripping in the kernel more easily. (But this seems like a parallel problem.) - we might need to opt in to TBI with a prctl() - all other concerns are for the future MTE series (though it sounds like HWCAP_MTE and a prctl() solve those issues too). Is this accurate? What do you see as the blockers for this series at this point?
On Thu, May 23, 2019 at 11:42:57AM +0100, Dave P Martin wrote: > On Wed, May 22, 2019 at 09:20:52PM -0300, Jason Gunthorpe wrote: > > On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote: > > > If multiple people will care about this, perhaps we should try to > > > annotate types more explicitly in SYSCALL_DEFINEx() and ABI data > > > structures. > > > > > > For example, we could have a couple of mutually exclusive modifiers > > > > > > T __object * > > > T __vaddr * (or U __vaddr) > > > > > > In the first case the pointer points to an object (in the C sense) > > > that the call may dereference but not use for any other purpose. > > > > How would you use these two differently? > > > > So far the kernel has worked that __user should tag any pointer that > > is from userspace and then you can't do anything with it until you > > transform it into a kernel something > > Ultimately it would be good to disallow casting __object pointers execpt > to compatible __object pointer types, and to make get_user etc. demand > __object. > > __vaddr pointers / addresses would be freely castable, but not to > __object and so would not be dereferenceable even indirectly. I think it gets too complicated and there are ambiguous cases that we may not be able to distinguish. For example copy_from_user() may be used to copy a user data structure into the kernel, hence __object would work, while the same function may be used to copy opaque data to a file, so __vaddr may be a better option (unless I misunderstood your proposal). We currently have T __user * and I think it's a good starting point. The prior attempt [1] was shut down because it was just hiding the cast using __force. We'd need to work through those cases again and rather start changing the function prototypes to avoid unnecessary casting in the callers (e.g. get_user_pages(void __user *) or come up with a new type) while changing the explicit casting to a macro where it needs to be obvious that we are converting a user pointer, potentially typed (tagged), to an untyped address range. We may need a user_ptr_to_ulong() macro or similar (it seems that we have a u64_to_user_ptr, wasn't aware of it). It may actually not be far from what you suggested but I'd keep the current T __user * to denote possible dereference. [1] https://lore.kernel.org/lkml/5d54526e5ff2e5ad63d0dfdd9ab17cf359afa4f2.1535629099.git.andreyknvl@google.com/
On Thu, May 23, 2019 at 08:44:12AM -0700, enh wrote: > On Thu, May 23, 2019 at 7:45 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Wed, May 22, 2019 at 01:47:36PM -0700, Kees Cook wrote: > > > For userspace, how would a future binary choose TBI over MTE? If it's > > > a library issue, we can't use an ELF bit, since the choice may be > > > "late" after ELF load (this implies the need for a prctl().) If it's > > > binary-only ("built with HWKASan") then an ELF bit seems sufficient. > > > And without the marking, I'd expect the kernel to enforce MTE when > > > there are high bits. > > > > The current plan is that a future binary issues a prctl(), after > > checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions > > are not in the current NOP space). I'd expect this to be done by the > > libc or dynamic loader under the assumption that the binaries it loads > > do _not_ use the top pointer byte for anything else. > > yeah, it sounds like to support hwasan and MTE, the dynamic linker > will need to not use either itself. > > > With hwasan compiled objects this gets more confusing (any ELF note > > to identify them?). > > no, at the moment code that wants to know checks for the presence of > __hwasan_init. (and bionic doesn't actually look at any ELF notes > right now.) but we can always add something if we need to. It's a userspace decision to make. In the kernel, we are proposing that bionic calls a prctl() to enable MTE explicitly. It could first check for the presence of __hwasan_init.
On Thu, May 23, 2019 at 09:38:19AM -0700, Kees Cook wrote: > On Thu, May 23, 2019 at 03:44:49PM +0100, Catalin Marinas wrote: > > There is also the obvious requirement which I didn't mention: new user > > space continues to run on new/subsequent kernel versions. That's one of > > the points of contention for this series (ignoring MTE) with the > > maintainers having to guarantee this without much effort. IOW, do the > > 500K+ new lines in a subsequent kernel version break any user space out > > there? I'm only talking about the relaxed TBI ABI. Are the usual LTP, > > syskaller sufficient? Better static analysis would definitely help. > > We can't have perfect coverage of people actively (or accidentally) > working to trick static analyzers (and the compiler) into "forgetting" > about a __user annotation. We can certainly improve analysis (I see > the other sub-thread on that), but I'd like that work not to block > this series. I don't want to block this series either as it's a prerequisite for MTE but at the moment I'm not confident we tracked down all the places where things can break and what the future effort will be to analyse new kernel changes. We've made lots of progress since March last year (I think when the first version was posted) by both identifying new places in the kernel that need untagging and limiting the ranges that user space can tag (e.g. an mmap() on a device should not allow tagged pointers). Can we say we are done or more investigation is needed? > What on this front would you be comfortable with? Given it's a new > feature isn't it sufficient to have a CONFIG (and/or boot option)? I'd rather avoid re-building kernels. A boot option would do, unless we see value in a per-process (inherited) personality or prctl. The per-process option has the slight advantage that I can boot my machine normally while being able to run LTP with a relaxed ABI (and a new libc which tags pointers). I admit I don't have a very strong argument on a per-process opt-in. Another option would be a sysctl control together with a cmdline option. > > Or, if we ever want MTE to be turned on by default (i.e. tag checking), > > even if everything is tagged with 0, we have to disallow TBI for user > > and this includes hwasan. There were a small number of programs using > > the TBI (I think some JavaScript compilers tried this). But now we are > > bringing in the hwasan support and this can be a large user base. Shall > > we add an ELF note for such binaries that use TBI/hwasan? > > Just to be clear, you say "disallow TBI for user" -- you mean a > particular process, yes? i.e. there is no architectural limitation that > says once we're using MTE nothing can switch to TBI. i.e. a process is > either doing MTE or TBI (or nothing, but that's the same as TBI). I may have answered this below. The architecture does not allow TBI (i.e. faults) if MTE is enabled for a process and address range. > > > So there needs to be some way to let the kernel know which of three > > > things it should be doing: > > > 1- leaving userspace addresses as-is (present) > > > 2- wiping the top bits before using (this series) > > > > (I'd say tolerating rather than wiping since get_user still uses the tag > > in the current series) > > > > The current series does not allow any choice between 1 and 2, the > > default ABI basically becomes option 2. > > What about testing tools that intentionally insert high bits for syscalls > and are _expecting_ them to fail? It seems the TBI series will break them? > In that case, do we need to opt into TBI as well? If there are such tools, then we may need a per-process control. It's basically an ABI change. > > > 3- wiping the top bits for most things, but retaining them for MTE as > > > needed (the future) > > > > 2 and 3 are not entirely compatible as a tagged pointer may be checked > > against the memory colour by the hardware. So you can't have hwasan > > binary with MTE enabled. > > Right: a process must be either MTE or TBI, not both. Indeed. > > > I expect MTE to be the "default" in the future. Once a system's libc has > > > grown support for it, everything will be trying to use MTE. TBI will be > > > the special case (but TBI is effectively a prerequisite). > > > > The kernel handling of tagged pointers is indeed a prerequisite. The ABI > > distinction between the above 2 and 3 needs to be solved. > > Does that need solving now or when the MTE series appears? As there is > no reason to distinguish between "normal" and "TBI", that doesn't seem > to need solving at this point? We don't need to solve 2 vs 3 as long as option 3 is an explicit opt-in (prctl) by the user. Otherwise the kernel cannot guess whether the user is using TBI or not (and if it does and still opts in to MTE, we can say it's a user problem ;)). > > > To choose between "2" and "3", it seems we need a per-process flag to > > > opt into TBI (and out of MTE). > > > > Or leave option 2 the default and get it to opt in to MTE. > > Given that MTE has to "start" at some point in the binary lifetime, I'm > fine with opting into MTE. I do expect, though, this will feel redundant > in a couple years as everything will immediately opt-in. But, okay, this > is therefore an issue for the MTE series. Unless Google phases out hwasan soon ;), they may live in parallel for a while, so a choice of TBI vs MTE. > Alright, the tl;dr appears to be: > - you want more assurances that we can find __user stripping in the > kernel more easily. (But this seems like a parallel problem.) Yes, and that we found all (most) cases now. The reason I don't see it as a parallel problem is that, as maintainer, I promise an ABI to user and I'd rather stick to it. I don't want, for example, ncurses to stop working because of some ioctl() rejecting tagged pointers. If it's just the occasional one-off bug I'm fine to deal with it. But no-one convinced me yet that this is the case. As for the generic driver code (filesystems or other subsystems), without some clear direction for developers, together with static checking/sparse, on how user pointers are cast to longs (one example), it would become my responsibility to identify and fix them up with any kernel release. This series is not providing such guidance, just adding untagged_addr() in some places that we think matter. > - we might need to opt in to TBI with a prctl() Yes, although still up for discussion. > - all other concerns are for the future MTE series (though it sounds > like HWCAP_MTE and a prctl() solve those issues too). Indeed. > Is this accurate? What do you see as the blockers for this series at > this point? Well, see my comment on your first bullet point above ;). And thanks for summarising it.
On 5/21/19 6:04 PM, Kees Cook wrote: > As an aside: I think Sparc ADI support in Linux actually side-stepped > this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must > be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI > for kernel code.") I think this was a mistake we should not repeat for > arm64 (we do seem to be at least in agreement about this, I think). > > [1] https://lore.kernel.org/patchwork/patch/654481/ That is a very early version of the sparc ADI patch. Support for tagged addresses in syscalls was added in later versions and is in the patch that is in the kernel. That part "Kernel does not enable ADI for kernel code." is correct. It is a possible enhancement for future. -- Khalid
Hi Khalid, On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote: > On 5/21/19 6:04 PM, Kees Cook wrote: > > As an aside: I think Sparc ADI support in Linux actually side-stepped > > this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must > > be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI > > for kernel code.") I think this was a mistake we should not repeat for > > arm64 (we do seem to be at least in agreement about this, I think). > > > > [1] https://lore.kernel.org/patchwork/patch/654481/ > > That is a very early version of the sparc ADI patch. Support for tagged > addresses in syscalls was added in later versions and is in the patch > that is in the kernel. I tried to figure out but I'm not familiar with the sparc port. How did you solve the tagged address going into various syscall implementations in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it ends up deeper in the core code? Thanks.
On Thu, May 23, 2019 at 06:43:46PM +0100, Catalin Marinas wrote: > On Thu, May 23, 2019 at 09:38:19AM -0700, Kees Cook wrote: > > What on this front would you be comfortable with? Given it's a new > > feature isn't it sufficient to have a CONFIG (and/or boot option)? > > I'd rather avoid re-building kernels. A boot option would do, unless we > see value in a per-process (inherited) personality or prctl. The I think I've convinced myself that the normal<->TBI ABI control should be a boot parameter. More below... > > What about testing tools that intentionally insert high bits for syscalls > > and are _expecting_ them to fail? It seems the TBI series will break them? > > In that case, do we need to opt into TBI as well? > > If there are such tools, then we may need a per-process control. It's > basically an ABI change. syzkaller already attempts to randomly inject non-canonical and 0xFFFF....FFFF addresses for user pointers in syscalls in an effort to find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was able to write directly to kernel memory[1]. It seems that using TBI by default and not allowing a switch back to "normal" ABI without a reboot actually means that userspace cannot inject kernel pointers into syscalls any more, since they'll get universally stripped now. Is my understanding correct, here? i.e. exploiting CVE-2017-5123 would be impossible under TBI? If so, then I think we should commit to the TBI ABI and have a boot flag to disable it, but NOT have a process flag, as that would allow attackers to bypass the masking. The only flag should be "TBI or MTE". If so, can I get top byte masking for other architectures too? Like, just to strip high bits off userspace addresses? ;) (Oh, in looking I see this is implemented with sign-extension... why not just a mask? So it'll either be valid userspace address or forced into the non-canonical range?) [1] https://salls.github.io/Linux-Kernel-CVE-2017-5123/ > > Alright, the tl;dr appears to be: > > - you want more assurances that we can find __user stripping in the > > kernel more easily. (But this seems like a parallel problem.) > > Yes, and that we found all (most) cases now. The reason I don't see it > as a parallel problem is that, as maintainer, I promise an ABI to user > and I'd rather stick to it. I don't want, for example, ncurses to stop > working because of some ioctl() rejecting tagged pointers. But this is what I don't understand: it would need to be ncurses _using TBI_, that would stop working (having started to work before, but then regress due to a newly added one-off bug). Regular ncurses will be fine because it's not using TBI. So The Golden Rule isn't violated, and by definition, it's a specific regression caused by some bug (since TBI would have had to have worked _before_ in the situation to be considered a regression now). Which describes the normal path for kernel development... add feature, find corner cases where it doesn't work, fix them, encounter new regressions, fix those, repeat forever. > If it's just the occasional one-off bug I'm fine to deal with it. But > no-one convinced me yet that this is the case. You believe there still to be some systemic cases that haven't been found yet? And even if so -- isn't it better to work on that incrementally? > As for the generic driver code (filesystems or other subsystems), > without some clear direction for developers, together with static > checking/sparse, on how user pointers are cast to longs (one example), > it would become my responsibility to identify and fix them up with any > kernel release. This series is not providing such guidance, just adding > untagged_addr() in some places that we think matter. What about adding a nice bit of .rst documentation that describes the situation and shows how to use untagged_addr(). This is the kind of kernel-wide change that "everyone" needs to know about, and shouldn't be the arch maintainer's sole responsibility to fix. > > - we might need to opt in to TBI with a prctl() > > Yes, although still up for discussion. I think I've talked myself out of it. I say boot param only! :) So what do you say to these next steps: - change untagged_addr() to use a static branch that is controlled with a boot parameter. - add, say, Documentation/core-api/user-addresses.rst to describe proper care and handling of user space pointers with untagged_addr(), with examples based on all the cases seen so far in this series. - continue work to improve static analysis. Thanks for wading through this with me! :)
On 5/23/19 2:11 PM, Catalin Marinas wrote: > Hi Khalid, > > On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote: >> On 5/21/19 6:04 PM, Kees Cook wrote: >>> As an aside: I think Sparc ADI support in Linux actually side-stepped >>> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must >>> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI >>> for kernel code.") I think this was a mistake we should not repeat for >>> arm64 (we do seem to be at least in agreement about this, I think). >>> >>> [1] https://lore.kernel.org/patchwork/patch/654481/ >> >> That is a very early version of the sparc ADI patch. Support for tagged >> addresses in syscalls was added in later versions and is in the patch >> that is in the kernel. > > I tried to figure out but I'm not familiar with the sparc port. How did > you solve the tagged address going into various syscall implementations > in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it > ends up deeper in the core code? Tag is not removed from the user addresses. Kernel passes tagged addresses to copy_from_user and copy_to_user. MMU checks the tag embedded in the address when kernel accesses userspace addresses. This maintains the ADI integrity even when userspace attempts to access any userspace addresses through system calls. On sparc, access_ok() is defined as: #define access_ok(addr, size) __access_ok((unsigned long)(addr), size) #define __access_ok(addr, size) (__user_ok((addr) & get_fs().seg, (size))) #define __user_ok(addr, size) ({ (void)(size); (addr) < STACK_TOP; }) STACK_TOP for M7 processor (which is the first sparc processor to support ADI) is 0xfff8000000000000UL. Tagged addresses pass the access_ok() check fine. Any tag mismatches that happen during kernel access to userspace addresses are handled by do_mcd_err(). -- Khalid
On 5/23/19 2:11 PM, Catalin Marinas wrote: > Hi Khalid, > > On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote: >> On 5/21/19 6:04 PM, Kees Cook wrote: >>> As an aside: I think Sparc ADI support in Linux actually side-stepped >>> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must >>> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI >>> for kernel code.") I think this was a mistake we should not repeat for >>> arm64 (we do seem to be at least in agreement about this, I think). >>> >>> [1] https://lore.kernel.org/patchwork/patch/654481/ >> >> That is a very early version of the sparc ADI patch. Support for tagged >> addresses in syscalls was added in later versions and is in the patch >> that is in the kernel. > > I tried to figure out but I'm not familiar with the sparc port. How did > you solve the tagged address going into various syscall implementations > in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it > ends up deeper in the core code? > Another spot I should point out in ADI patch - Tags are not stored in VMAs and IOMMU does not support ADI tags on M7. ADI tags are stripped before userspace addresses are passed to IOMMU in the following snippet from the patch: diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index 5335ba3c850e..357b6047653a 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -201,6 +202,24 @@ int __get_user_pages_fast(unsigned long start, int nr_pages , int write, pgd_t *pgdp; int nr = 0; +#ifdef CONFIG_SPARC64 + if (adi_capable()) { + long addr = start; + + /* If userspace has passed a versioned address, kernel + * will not find it in the VMAs since it does not store + * the version tags in the list of VMAs. Storing version + * tags in list of VMAs is impractical since they can be + * changed any time from userspace without dropping into + * kernel. Any address search in VMAs will be done with + * non-versioned addresses. Ensure the ADI version bits + * are dropped here by sign extending the last bit before + * ADI bits. IOMMU does not implement version tags. + */ + addr = (addr << (long)adi_nbits()) >> (long)adi_nbits(); + start = addr; + } +#endif start &= PAGE_MASK; addr = start; len = (unsigned long) nr_pages << PAGE_SHIFT; -- Khalid
On Thu, May 23, 2019 at 03:49:05PM -0600, Khalid Aziz wrote: > On 5/23/19 2:11 PM, Catalin Marinas wrote: > > On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote: > >> On 5/21/19 6:04 PM, Kees Cook wrote: > >>> As an aside: I think Sparc ADI support in Linux actually side-stepped > >>> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must > >>> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI > >>> for kernel code.") I think this was a mistake we should not repeat for > >>> arm64 (we do seem to be at least in agreement about this, I think). > >>> > >>> [1] https://lore.kernel.org/patchwork/patch/654481/ > >> > >> That is a very early version of the sparc ADI patch. Support for tagged > >> addresses in syscalls was added in later versions and is in the patch > >> that is in the kernel. > > > > I tried to figure out but I'm not familiar with the sparc port. How did > > you solve the tagged address going into various syscall implementations > > in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it > > ends up deeper in the core code? > > Another spot I should point out in ADI patch - Tags are not stored in > VMAs and IOMMU does not support ADI tags on M7. ADI tags are stripped > before userspace addresses are passed to IOMMU in the following snippet > from the patch: > > diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c > index 5335ba3c850e..357b6047653a 100644 > --- a/arch/sparc/mm/gup.c > +++ b/arch/sparc/mm/gup.c > @@ -201,6 +202,24 @@ int __get_user_pages_fast(unsigned long start, int > nr_pages > , int write, > pgd_t *pgdp; > int nr = 0; > > +#ifdef CONFIG_SPARC64 > + if (adi_capable()) { > + long addr = start; > + > + /* If userspace has passed a versioned address, kernel > + * will not find it in the VMAs since it does not store > + * the version tags in the list of VMAs. Storing version > + * tags in list of VMAs is impractical since they can be > + * changed any time from userspace without dropping into > + * kernel. Any address search in VMAs will be done with > + * non-versioned addresses. Ensure the ADI version bits > + * are dropped here by sign extending the last bit before > + * ADI bits. IOMMU does not implement version tags. > + */ > + addr = (addr << (long)adi_nbits()) >> (long)adi_nbits(); > + start = addr; > + } > +#endif > start &= PAGE_MASK; > addr = start; > len = (unsigned long) nr_pages << PAGE_SHIFT; Thanks Khalid. I missed that sparc does not enable HAVE_GENERIC_GUP, so you fix this case here. If we add the generic untagged_addr() macro in the generic code, I think sparc can start making use of it rather than open-coding the shifts. There are a few other other places where tags can leak and the core code would get confused (for example, madvise()). I presume your user space doesn't exercise them. On arm64 we plan to just allow the C library to tag any new memory allocation, so those core code paths would need to be covered. And similarly, devices, IOMMU, any DMA would ignore tags.
On Thu, May 23, 2019 at 02:31:16PM -0700, Kees Cook wrote: > On Thu, May 23, 2019 at 06:43:46PM +0100, Catalin Marinas wrote: > > On Thu, May 23, 2019 at 09:38:19AM -0700, Kees Cook wrote: > > > What about testing tools that intentionally insert high bits for syscalls > > > and are _expecting_ them to fail? It seems the TBI series will break them? > > > In that case, do we need to opt into TBI as well? > > > > If there are such tools, then we may need a per-process control. It's > > basically an ABI change. > > syzkaller already attempts to randomly inject non-canonical and > 0xFFFF....FFFF addresses for user pointers in syscalls in an effort to > find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was > able to write directly to kernel memory[1]. > > It seems that using TBI by default and not allowing a switch back to > "normal" ABI without a reboot actually means that userspace cannot inject > kernel pointers into syscalls any more, since they'll get universally > stripped now. Is my understanding correct, here? i.e. exploiting > CVE-2017-5123 would be impossible under TBI? Unless the kernel is also using TBI (khwasan), in which case masking out the top byte wouldn't help. Anyway, as per this discussion, we want the tagged pointer to remain intact all the way to put_user(), so nothing gets masked out. I don't think this would have helped with the waitid() bug. > If so, then I think we should commit to the TBI ABI and have a boot > flag to disable it, but NOT have a process flag, as that would allow > attackers to bypass the masking. The only flag should be "TBI or MTE". > > If so, can I get top byte masking for other architectures too? Like, > just to strip high bits off userspace addresses? ;) But you didn't like my option 2 shim proposal which strips the tag on kernel entry because it lowers the value of MTE ;). > (Oh, in looking I see this is implemented with sign-extension... why > not just a mask? So it'll either be valid userspace address or forced > into the non-canonical range?) The TTBR0/1 selection on memory accesses is done based on bit 63 if TBI is disabled and bit 55 when enabled. Sign-extension allows us to use the same macro for both user and kernel tagged pointers. With MTE tag 0 would be match-all for TTBR0 and 0xff for TTBR1 (so that we don't modify the virtual address space of the kernel; I need to check the latest spec to be sure). Note that the VA space for both user and kernel is limited to 52-bit architecturally so, on access, bits 55..52 must be the same, 0 or 1, otherwise you get a fault. Since the syzkaller tests would also need to set bits 55-52 (actually 48 for kernel addresses, we haven't merged the 52-bit kernel VA patches yet) to hit a valid kernel address, I don't think ignoring the top byte makes much difference to the expected failure scenario. > > > Alright, the tl;dr appears to be: > > > - you want more assurances that we can find __user stripping in the > > > kernel more easily. (But this seems like a parallel problem.) > > > > Yes, and that we found all (most) cases now. The reason I don't see it > > as a parallel problem is that, as maintainer, I promise an ABI to user > > and I'd rather stick to it. I don't want, for example, ncurses to stop > > working because of some ioctl() rejecting tagged pointers. > > But this is what I don't understand: it would need to be ncurses _using > TBI_, that would stop working (having started to work before, but then > regress due to a newly added one-off bug). Regular ncurses will be fine > because it's not using TBI. So The Golden Rule isn't violated, Once we introduced TBI and the libc starts tagging heap allocations, this becomes the new "regular" user space behaviour (i.e. using TBI). So a new bug would break the golden rule. It could also be an old bug that went unnoticed (i.e. you changed the graphics card and its driver gets confused by tagged pointers coming from user-space). > and by definition, it's a specific regression caused by some bug > (since TBI would have had to have worked _before_ in the situation to > be considered a regression now). Which describes the normal path for > kernel development... add feature, find corner cases where it doesn't > work, fix them, encounter new regressions, fix those, repeat forever. > > > If it's just the occasional one-off bug I'm fine to deal with it. But > > no-one convinced me yet that this is the case. > > You believe there still to be some systemic cases that haven't been > found yet? And even if so -- isn't it better to work on that > incrementally? I want some way to systematically identify potential issues (sparse?). Since problems are most likely in drivers, I don't have all devices to check and not all users have the knowledge to track down why something failed. I think we can do this incrementally as long the TBI ABI is not the default. Even better if we made it per process. > > As for the generic driver code (filesystems or other subsystems), > > without some clear direction for developers, together with static > > checking/sparse, on how user pointers are cast to longs (one example), > > it would become my responsibility to identify and fix them up with any > > kernel release. This series is not providing such guidance, just adding > > untagged_addr() in some places that we think matter. > > What about adding a nice bit of .rst documentation that describes the > situation and shows how to use untagged_addr(). This is the kind of > kernel-wide change that "everyone" needs to know about, and shouldn't > be the arch maintainer's sole responsibility to fix. This works (if people read it) but we also need to be more prescriptive in how casting is done and how we differentiate between a pointer for dereference (T __user *) and address space management (usually unsigned long). On top of that, we'd get sparse to check for such conversions and maybe even checkpatch for some low-hanging fruit. > > > - we might need to opt in to TBI with a prctl() > > > > Yes, although still up for discussion. > > I think I've talked myself out of it. I say boot param only! :) I hope I talked you in again ;). I don't see TBI as improving kernel security. > So what do you say to these next steps: > > - change untagged_addr() to use a static branch that is controlled with > a boot parameter. access_ok() as well. > - add, say, Documentation/core-api/user-addresses.rst to describe > proper care and handling of user space pointers with untagged_addr(), > with examples based on all the cases seen so far in this series. We have u64_to_user_ptr(). What about the reverse? And maybe changing get_user_pages() to take void __user *. > - continue work to improve static analysis. Andrew Murray in the ARM kernel team started revisiting the old sparse threads, let's see how it goes.
On Thu, May 23, 2019 at 05:57:09PM +0100, Catalin Marinas wrote: > On Thu, May 23, 2019 at 11:42:57AM +0100, Dave P Martin wrote: > > On Wed, May 22, 2019 at 09:20:52PM -0300, Jason Gunthorpe wrote: > > > On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote: > > > > If multiple people will care about this, perhaps we should try to > > > > annotate types more explicitly in SYSCALL_DEFINEx() and ABI data > > > > structures. > > > > > > > > For example, we could have a couple of mutually exclusive modifiers > > > > > > > > T __object * > > > > T __vaddr * (or U __vaddr) > > > > > > > > In the first case the pointer points to an object (in the C sense) > > > > that the call may dereference but not use for any other purpose. > > > > > > How would you use these two differently? > > > > > > So far the kernel has worked that __user should tag any pointer that > > > is from userspace and then you can't do anything with it until you > > > transform it into a kernel something > > > > Ultimately it would be good to disallow casting __object pointers execpt > > to compatible __object pointer types, and to make get_user etc. demand > > __object. > > > > __vaddr pointers / addresses would be freely castable, but not to > > __object and so would not be dereferenceable even indirectly. > > I think it gets too complicated and there are ambiguous cases that we > may not be able to distinguish. For example copy_from_user() may be used > to copy a user data structure into the kernel, hence __object would > work, while the same function may be used to copy opaque data to a file, > so __vaddr may be a better option (unless I misunderstood your > proposal). Can you illustrate? I'm not sure of your point here. > We currently have T __user * and I think it's a good starting point. The > prior attempt [1] was shut down because it was just hiding the cast > using __force. We'd need to work through those cases again and rather > start changing the function prototypes to avoid unnecessary casting in > the callers (e.g. get_user_pages(void __user *) or come up with a new > type) while changing the explicit casting to a macro where it needs to > be obvious that we are converting a user pointer, potentially typed > (tagged), to an untyped address range. We may need a user_ptr_to_ulong() > macro or similar (it seems that we have a u64_to_user_ptr, wasn't aware > of it). > > It may actually not be far from what you suggested but I'd keep the > current T __user * to denote possible dereference. This may not have been clear, but __object and __vaddr would be orthogonal to __user. Since __object and __vaddr strictly constrain what can be done with an lvalue, they could be cast on, but not be cast off without __force. Syscall arguments and pointer in ioctl structs etc. would typically be annotated as __object __user * or __vaddr __user *. Plain old __user * would work as before, but would be more permissive and give static analysers less information to go on. Conversion or use or __object or __vaddr pointers would require specific APIs in the kernel, so that we can be clear about the semantics. Doing things this way would allow migration to annotation of most or all ABI pointers with __object or __vaddr over time, but we wouldn't have to do it all in one go. Problem cases (which won't be the majority) could continue to be plain __user. This does not magically solve the challenges of MTE, but might provide tools that are useful to help avoid bitrot and regressions over time. I agree though that there might be a fair number of of cases that don't conveniently fall under __object or __vaddr semantics. It's hard to know without trying it. _Most_ syscall arguments seem to be fairly obviously one or another though, and this approach has some possibility of scaling to ioctls and other odd interfaces. Cheers ---Dave
On 5/24/19 4:11 AM, Catalin Marinas wrote: > On Thu, May 23, 2019 at 03:49:05PM -0600, Khalid Aziz wrote: >> On 5/23/19 2:11 PM, Catalin Marinas wrote: >>> On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote: >>>> On 5/21/19 6:04 PM, Kees Cook wrote: >>>>> As an aside: I think Sparc ADI support in Linux actually side-stepped >>>>> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must >>>>> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI >>>>> for kernel code.") I think this was a mistake we should not repeat for >>>>> arm64 (we do seem to be at least in agreement about this, I think). >>>>> >>>>> [1] https://lore.kernel.org/patchwork/patch/654481/ >>>> >>>> That is a very early version of the sparc ADI patch. Support for tagged >>>> addresses in syscalls was added in later versions and is in the patch >>>> that is in the kernel. >>> >>> I tried to figure out but I'm not familiar with the sparc port. How did >>> you solve the tagged address going into various syscall implementations >>> in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it >>> ends up deeper in the core code? >> >> Another spot I should point out in ADI patch - Tags are not stored in >> VMAs and IOMMU does not support ADI tags on M7. ADI tags are stripped >> before userspace addresses are passed to IOMMU in the following snippet >> from the patch: >> >> diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c >> index 5335ba3c850e..357b6047653a 100644 >> --- a/arch/sparc/mm/gup.c >> +++ b/arch/sparc/mm/gup.c >> @@ -201,6 +202,24 @@ int __get_user_pages_fast(unsigned long start, int >> nr_pages >> , int write, >> pgd_t *pgdp; >> int nr = 0; >> >> +#ifdef CONFIG_SPARC64 >> + if (adi_capable()) { >> + long addr = start; >> + >> + /* If userspace has passed a versioned address, kernel >> + * will not find it in the VMAs since it does not store >> + * the version tags in the list of VMAs. Storing version >> + * tags in list of VMAs is impractical since they can be >> + * changed any time from userspace without dropping into >> + * kernel. Any address search in VMAs will be done with >> + * non-versioned addresses. Ensure the ADI version bits >> + * are dropped here by sign extending the last bit before >> + * ADI bits. IOMMU does not implement version tags. >> + */ >> + addr = (addr << (long)adi_nbits()) >> (long)adi_nbits(); >> + start = addr; >> + } >> +#endif >> start &= PAGE_MASK; >> addr = start; >> len = (unsigned long) nr_pages << PAGE_SHIFT; > > Thanks Khalid. I missed that sparc does not enable HAVE_GENERIC_GUP, so > you fix this case here. If we add the generic untagged_addr() macro in > the generic code, I think sparc can start making use of it rather than > open-coding the shifts. Hi Catalin, Yes, that will be good. Right now addresses are untagged in sparc code in only two spots but that will expand as we expand use of tags. Scalabale solution is definitely better. > > There are a few other other places where tags can leak and the core code > would get confused (for example, madvise()). I presume your user space > doesn't exercise them. On arm64 we plan to just allow the C library to > tag any new memory allocation, so those core code paths would need to be > covered. > > And similarly, devices, IOMMU, any DMA would ignore tags. > Right. You are doing lot more with tags than sparc code intended to do. I had looked into implementing just malloc(), mmap() and possibly shmat() in library that automatically tags pointers. Expanding tags to any pointers in C library will require covering lot more paths in kernel. -- Khalid
Thanks for a lot of valuable input! I've read through all the replies and got somewhat lost. What are the changes I need to do to this series? 1. Should I move untagging for memory syscalls back to the generic code so other arches would make use of it as well, or should I keep the arm64 specific memory syscalls wrappers and address the comments on that patch? 2. Should I make untagging opt-in and controlled by a command line argument? 3. Should I "add Documentation/core-api/user-addresses.rst to describe proper care and handling of user space pointers with untagged_addr(), with examples based on all the cases seen so far in this series"? Which examples specifically should it cover? Is there something else?
On Thu, May 23, 2019 at 02:31:16PM -0700, Kees Cook wrote: > syzkaller already attempts to randomly inject non-canonical and > 0xFFFF....FFFF addresses for user pointers in syscalls in an effort to > find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was > able to write directly to kernel memory[1]. > > It seems that using TBI by default and not allowing a switch back to > "normal" ABI without a reboot actually means that userspace cannot inject > kernel pointers into syscalls any more, since they'll get universally > stripped now. Is my understanding correct, here? i.e. exploiting > CVE-2017-5123 would be impossible under TBI? > > If so, then I think we should commit to the TBI ABI and have a boot > flag to disable it, but NOT have a process flag, as that would allow > attackers to bypass the masking. The only flag should be "TBI or MTE". > > If so, can I get top byte masking for other architectures too? Like, > just to strip high bits off userspace addresses? ;) Just for fun, hack/attempt at your idea which should not interfere with TBI. Only briefly tested on arm64 (and the s390 __TYPE_IS_PTR macro is pretty weird ;)): --------------------------8<--------------------------------- diff --git a/arch/s390/include/asm/compat.h b/arch/s390/include/asm/compat.h index 63b46e30b2c3..338455a74eff 100644 --- a/arch/s390/include/asm/compat.h +++ b/arch/s390/include/asm/compat.h @@ -11,9 +11,6 @@ #include <asm-generic/compat.h> -#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p( \ - typeof(0?(__force t)0:0ULL), u64)) - #define __SC_DELOUSE(t,v) ({ \ BUILD_BUG_ON(sizeof(t) > 4 && !__TYPE_IS_PTR(t)); \ (__force t)(__TYPE_IS_PTR(t) ? ((v) & 0x7fffffff) : (v)); \ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e2870fe1be5b..b1b9fe8502da 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -119,8 +119,15 @@ struct io_uring_params; #define __TYPE_IS_L(t) (__TYPE_AS(t, 0L)) #define __TYPE_IS_UL(t) (__TYPE_AS(t, 0UL)) #define __TYPE_IS_LL(t) (__TYPE_AS(t, 0LL) || __TYPE_AS(t, 0ULL)) +#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p(typeof(0 ? (__force t)0 : 0ULL), u64)) #define __SC_LONG(t, a) __typeof(__builtin_choose_expr(__TYPE_IS_LL(t), 0LL, 0L)) a +#ifdef CONFIG_64BIT +#define __SC_CAST(t, a) (__TYPE_IS_PTR(t) \ + ? (__force t) ((__u64)a & ~(1UL << 55)) \ + : (__force t) a) +#else #define __SC_CAST(t, a) (__force t) a +#endif #define __SC_ARGS(t, a) a #define __SC_TEST(t, a) (void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(t) && sizeof(t) > sizeof(long))
On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote: > Thanks for a lot of valuable input! I've read through all the replies > and got somewhat lost. What are the changes I need to do to this > series? > > 1. Should I move untagging for memory syscalls back to the generic > code so other arches would make use of it as well, or should I keep > the arm64 specific memory syscalls wrappers and address the comments > on that patch? It absolutely needs to move to common code. Having arch code leads to pointless (often unintentional) semantic difference between architectures, and lots of boilerplate code. Btw, can anyone of the arm crowd or Khalid comment on the linux-mm thread on generic gup where I'm dealing with the pre-existing ADI case of pointer untagging?
On Tue, May 28, 2019 at 11:11:26PM -0700, Christoph Hellwig wrote: > On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote: > > Thanks for a lot of valuable input! I've read through all the replies > > and got somewhat lost. What are the changes I need to do to this > > series? > > > > 1. Should I move untagging for memory syscalls back to the generic > > code so other arches would make use of it as well, or should I keep > > the arm64 specific memory syscalls wrappers and address the comments > > on that patch? > > It absolutely needs to move to common code. Having arch code leads > to pointless (often unintentional) semantic difference between > architectures, and lots of boilerplate code. That's fine by me as long as we agree on the semantics (which shouldn't be hard; Khalid already following up). We should probably also move the proposed ABI document [1] into a common place (or part of since we'll have arm64-specifics like prctl() calls to explicitly opt in to memory tagging). [1] https://lore.kernel.org/lkml/20190318163533.26838-1-vincenzo.frascino@arm.com/T/#u
On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote: > Thanks for a lot of valuable input! I've read through all the replies > and got somewhat lost. What are the changes I need to do to this > series? > > 1. Should I move untagging for memory syscalls back to the generic > code so other arches would make use of it as well, or should I keep > the arm64 specific memory syscalls wrappers and address the comments > on that patch? Keep them generic again but make sure we get agreement with Khalid on the actual ABI implications for sparc. > 2. Should I make untagging opt-in and controlled by a command line argument? Opt-in, yes, but per task rather than kernel command line option. prctl() is a possibility of opting in. > 3. Should I "add Documentation/core-api/user-addresses.rst to describe > proper care and handling of user space pointers with untagged_addr(), > with examples based on all the cases seen so far in this series"? > Which examples specifically should it cover? I think we can leave 3 for now as not too urgent. What I'd like is for Vincenzo's TBI user ABI document to go into a more common place since we can expand it to cover both sparc and arm64. We'd need an arm64-specific doc as well for things like prctl() and later MTE that sparc may support differently.
On Thu, May 30, 2019 at 7:15 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote: > > Thanks for a lot of valuable input! I've read through all the replies > > and got somewhat lost. What are the changes I need to do to this > > series? > > > > 1. Should I move untagging for memory syscalls back to the generic > > code so other arches would make use of it as well, or should I keep > > the arm64 specific memory syscalls wrappers and address the comments > > on that patch? > > Keep them generic again but make sure we get agreement with Khalid on > the actual ABI implications for sparc. OK, will do. I find it hard to understand what the ABI implications are. I'll post the next version without untagging in brk, mmap, munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat and shmdt. > > > 2. Should I make untagging opt-in and controlled by a command line argument? > > Opt-in, yes, but per task rather than kernel command line option. > prctl() is a possibility of opting in. OK. Should I store a flag somewhere in task_struct? Should it be inheritable on clone? > > > 3. Should I "add Documentation/core-api/user-addresses.rst to describe > > proper care and handling of user space pointers with untagged_addr(), > > with examples based on all the cases seen so far in this series"? > > Which examples specifically should it cover? > > I think we can leave 3 for now as not too urgent. What I'd like is for > Vincenzo's TBI user ABI document to go into a more common place since we > can expand it to cover both sparc and arm64. We'd need an arm64-specific > doc as well for things like prctl() and later MTE that sparc may support > differently. OK. > > -- > Catalin
On Fri, May 31, 2019 at 04:29:10PM +0200, Andrey Konovalov wrote: > On Thu, May 30, 2019 at 7:15 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote: > > > Thanks for a lot of valuable input! I've read through all the replies > > > and got somewhat lost. What are the changes I need to do to this > > > series? > > > > > > 1. Should I move untagging for memory syscalls back to the generic > > > code so other arches would make use of it as well, or should I keep > > > the arm64 specific memory syscalls wrappers and address the comments > > > on that patch? > > > > Keep them generic again but make sure we get agreement with Khalid on > > the actual ABI implications for sparc. > > OK, will do. I find it hard to understand what the ABI implications > are. I'll post the next version without untagging in brk, mmap, > munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat > and shmdt. It's more about not relaxing the ABI to accept non-zero top-byte unless we have a use-case for it. For mmap() etc., I don't think that's needed but if you think otherwise, please raise it. > > > 2. Should I make untagging opt-in and controlled by a command line argument? > > > > Opt-in, yes, but per task rather than kernel command line option. > > prctl() is a possibility of opting in. > > OK. Should I store a flag somewhere in task_struct? Should it be > inheritable on clone? A TIF flag would do but I'd say leave it out for now (default opted in) until we figure out the best way to do this (can be a patch on top of this series). Thanks.
On Fri, May 31, 2019 at 6:20 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Fri, May 31, 2019 at 04:29:10PM +0200, Andrey Konovalov wrote: > > On Thu, May 30, 2019 at 7:15 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote: > > > > Thanks for a lot of valuable input! I've read through all the replies > > > > and got somewhat lost. What are the changes I need to do to this > > > > series? > > > > > > > > 1. Should I move untagging for memory syscalls back to the generic > > > > code so other arches would make use of it as well, or should I keep > > > > the arm64 specific memory syscalls wrappers and address the comments > > > > on that patch? > > > > > > Keep them generic again but make sure we get agreement with Khalid on > > > the actual ABI implications for sparc. > > > > OK, will do. I find it hard to understand what the ABI implications > > are. I'll post the next version without untagging in brk, mmap, > > munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat > > and shmdt. > > It's more about not relaxing the ABI to accept non-zero top-byte unless > we have a use-case for it. For mmap() etc., I don't think that's needed > but if you think otherwise, please raise it. > > > > > 2. Should I make untagging opt-in and controlled by a command line argument? > > > > > > Opt-in, yes, but per task rather than kernel command line option. > > > prctl() is a possibility of opting in. > > > > OK. Should I store a flag somewhere in task_struct? Should it be > > inheritable on clone? > > A TIF flag would do but I'd say leave it out for now (default opted in) > until we figure out the best way to do this (can be a patch on top of > this series). You mean leave the whole opt-in/prctl part out? So the only change would be to move untagging for memory syscalls into generic code? > > Thanks. > > -- > Catalin
On Fri, May 31, 2019 at 06:24:06PM +0200, Andrey Konovalov wrote: > On Fri, May 31, 2019 at 6:20 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Fri, May 31, 2019 at 04:29:10PM +0200, Andrey Konovalov wrote: > > > On Thu, May 30, 2019 at 7:15 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote: > > > > > Thanks for a lot of valuable input! I've read through all the replies > > > > > and got somewhat lost. What are the changes I need to do to this > > > > > series? > > > > > > > > > > 1. Should I move untagging for memory syscalls back to the generic > > > > > code so other arches would make use of it as well, or should I keep > > > > > the arm64 specific memory syscalls wrappers and address the comments > > > > > on that patch? > > > > > > > > Keep them generic again but make sure we get agreement with Khalid on > > > > the actual ABI implications for sparc. > > > > > > OK, will do. I find it hard to understand what the ABI implications > > > are. I'll post the next version without untagging in brk, mmap, > > > munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat > > > and shmdt. > > > > It's more about not relaxing the ABI to accept non-zero top-byte unless > > we have a use-case for it. For mmap() etc., I don't think that's needed > > but if you think otherwise, please raise it. > > > > > > > 2. Should I make untagging opt-in and controlled by a command line argument? > > > > > > > > Opt-in, yes, but per task rather than kernel command line option. > > > > prctl() is a possibility of opting in. > > > > > > OK. Should I store a flag somewhere in task_struct? Should it be > > > inheritable on clone? > > > > A TIF flag would do but I'd say leave it out for now (default opted in) > > until we figure out the best way to do this (can be a patch on top of > > this series). > > You mean leave the whole opt-in/prctl part out? So the only change > would be to move untagging for memory syscalls into generic code? Yes (or just wait until next week to see if the discussion settles down).
On Tue, May 28, 2019 at 06:02:45PM +0100, Catalin Marinas wrote: > On Thu, May 23, 2019 at 02:31:16PM -0700, Kees Cook wrote: > > syzkaller already attempts to randomly inject non-canonical and > > 0xFFFF....FFFF addresses for user pointers in syscalls in an effort to > > find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was > > able to write directly to kernel memory[1]. > > > > It seems that using TBI by default and not allowing a switch back to > > "normal" ABI without a reboot actually means that userspace cannot inject > > kernel pointers into syscalls any more, since they'll get universally > > stripped now. Is my understanding correct, here? i.e. exploiting > > CVE-2017-5123 would be impossible under TBI? > > > > If so, then I think we should commit to the TBI ABI and have a boot > > flag to disable it, but NOT have a process flag, as that would allow > > attackers to bypass the masking. The only flag should be "TBI or MTE". > > > > If so, can I get top byte masking for other architectures too? Like, > > just to strip high bits off userspace addresses? ;) > > Just for fun, hack/attempt at your idea which should not interfere with > TBI. Only briefly tested on arm64 (and the s390 __TYPE_IS_PTR macro is > pretty weird ;)): OMG, this is amazing and bonkers. I love it. > --------------------------8<--------------------------------- > diff --git a/arch/s390/include/asm/compat.h b/arch/s390/include/asm/compat.h > index 63b46e30b2c3..338455a74eff 100644 > --- a/arch/s390/include/asm/compat.h > +++ b/arch/s390/include/asm/compat.h > @@ -11,9 +11,6 @@ > > #include <asm-generic/compat.h> > > -#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p( \ > - typeof(0?(__force t)0:0ULL), u64)) > - > #define __SC_DELOUSE(t,v) ({ \ > BUILD_BUG_ON(sizeof(t) > 4 && !__TYPE_IS_PTR(t)); \ > (__force t)(__TYPE_IS_PTR(t) ? ((v) & 0x7fffffff) : (v)); \ > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index e2870fe1be5b..b1b9fe8502da 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -119,8 +119,15 @@ struct io_uring_params; > #define __TYPE_IS_L(t) (__TYPE_AS(t, 0L)) > #define __TYPE_IS_UL(t) (__TYPE_AS(t, 0UL)) > #define __TYPE_IS_LL(t) (__TYPE_AS(t, 0LL) || __TYPE_AS(t, 0ULL)) > +#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p(typeof(0 ? (__force t)0 : 0ULL), u64)) > #define __SC_LONG(t, a) __typeof(__builtin_choose_expr(__TYPE_IS_LL(t), 0LL, 0L)) a > +#ifdef CONFIG_64BIT > +#define __SC_CAST(t, a) (__TYPE_IS_PTR(t) \ > + ? (__force t) ((__u64)a & ~(1UL << 55)) \ > + : (__force t) a) > +#else > #define __SC_CAST(t, a) (__force t) a > +#endif > #define __SC_ARGS(t, a) a > #define __SC_TEST(t, a) (void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(t) && sizeof(t) > sizeof(long)) I'm laughing, I'm crying. Now I have to go look at the disassembly to see how this actually looks. (I mean, it _does_ solve my specific case of the waitid() flaw, but wouldn't help with pointers deeper in structs, etc, though TBI does, I think still help with that. I have to catch back up on the thread...) Anyway, if it's not too expensive it'd block reachability for those kinds of flaws. I wonder what my chances are of actually getting this landed? :) (Though, I guess I need to find a "VA width" macro instead of a raw 55.) Thanks for thinking of __SC_CAST() and __TYPE_IS_PTR() together. Really made my day. :)
=== Overview arm64 has a feature called Top Byte Ignore, which allows to embed pointer tags into the top byte of each pointer. Userspace programs (such as HWASan, a memory debugging tool [1]) might use this feature and pass tagged user pointers to the kernel through syscalls or other interfaces. Right now the kernel is already able to handle user faults with tagged pointers, due to these patches: 1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a tagged pointer") 2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged pointers") 3. 276e9327 ("arm64: entry: improve data abort handling of tagged pointers") This patchset extends tagged pointer support to syscall arguments. As per the proposed ABI change [3], tagged pointers are only allowed to be passed to syscalls when they point to memory ranges obtained by anonymous mmap() or sbrk() (see the patchset [3] for more details). For non-memory syscalls this is done by untaging user pointers when the kernel performs pointer checking to find out whether the pointer comes from userspace (most notably in access_ok). The untagging is done only when the pointer is being checked, the tag is preserved as the pointer makes its way through the kernel and stays tagged when the kernel dereferences the pointer when perfoming user memory accesses. Memory syscalls (mmap, mprotect, etc.) don't do user memory accesses but rather deal with memory ranges, and untagged pointers are better suited to describe memory ranges internally. Thus for memory syscalls we untag pointers completely when they enter the kernel. === Other approaches One of the alternative approaches to untagging that was considered is to completely strip the pointer tag as the pointer enters the kernel with some kind of a syscall wrapper, but that won't work with the countless number of different ioctl calls. With this approach we would need a custom wrapper for each ioctl variation, which doesn't seem practical. An alternative approach to untagging pointers in memory syscalls prologues is to inspead allow tagged pointers to be passed to find_vma() (and other vma related functions) and untag them there. Unfortunately, a lot of find_vma() callers then compare or subtract the returned vma start and end fields against the pointer that was being searched. Thus this approach would still require changing all find_vma() callers. === Testing The following testing approaches has been taken to find potential issues with user pointer untagging: 1. Static testing (with sparse [2] and separately with a custom static analyzer based on Clang) to track casts of __user pointers to integer types to find places where untagging needs to be done. 2. Static testing with grep to find parts of the kernel that call find_vma() (and other similar functions) or directly compare against vm_start/vm_end fields of vma. 3. Static testing with grep to find parts of the kernel that compare user pointers with TASK_SIZE or other similar consts and macros. 4. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running a modified syzkaller version that passes tagged pointers to the kernel. Based on the results of the testing the requried patches have been added to the patchset. === Notes This patchset is meant to be merged together with "arm64 relaxed ABI" [3]. This patchset is a prerequisite for ARM's memory tagging hardware feature support [4]. This patchset has been merged into the Pixel 2 & 3 kernel trees and is now being used to enable testing of Pixel phones with HWASan. Thanks! [1] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html [2] https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060e0145f292 [3] https://lkml.org/lkml/2019/3/18/819 [4] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a Changes in v15: - Removed unnecessary untagging from radeon_ttm_tt_set_userptr(). - Removed unnecessary untagging from amdgpu_ttm_tt_set_userptr(). - Moved untagging to validate_range() in userfaultfd code. - Moved untagging to ib_uverbs_(re)reg_mr() from mlx4_get_umem_mr(). - Rebased onto 5.1. Changes in v14: - Moved untagging for most memory syscalls to an arm64 specific implementation, instead of doing that in the common code. - Dropped "net, arm64: untag user pointers in tcp_zerocopy_receive", since the provided user pointers don't come from an anonymous map and thus are not covered by this ABI relaxation. - Dropped "kernel, arm64: untag user pointers in prctl_set_mm*". - Moved untagging from __check_mem_type() to tee_shm_register(). - Updated untagging for the amdgpu and radeon drivers to cover the MMU notifier, as suggested by Felix. - Since this ABI relaxation doesn't actually allow tagged instruction pointers, dropped the following patches: - Dropped "tracing, arm64: untag user pointers in seq_print_user_ip". - Dropped "uprobes, arm64: untag user pointers in find_active_uprobe". - Dropped "bpf, arm64: untag user pointers in stack_map_get_build_id_offset". - Rebased onto 5.1-rc7 (37624b58). Changes in v13: - Simplified untagging in tcp_zerocopy_receive(). - Looked at find_vma() callers in drivers/, which allowed to identify a few other places where untagging is needed. - Added patch "mm, arm64: untag user pointers in get_vaddr_frames". - Added patch "drm/amdgpu, arm64: untag user pointers in amdgpu_ttm_tt_get_user_pages". - Added patch "drm/radeon, arm64: untag user pointers in radeon_ttm_tt_pin_userptr". - Added patch "IB/mlx4, arm64: untag user pointers in mlx4_get_umem_mr". - Added patch "media/v4l2-core, arm64: untag user pointers in videobuf_dma_contig_user_get". - Added patch "tee/optee, arm64: untag user pointers in check_mem_type". - Added patch "vfio/type1, arm64: untag user pointers". Changes in v12: - Changed untagging in tcp_zerocopy_receive() to also untag zc->address. - Fixed untagging in prctl_set_mm* to only untag pointers for vma lookups and validity checks, but leave them as is for actual user space accesses. - Updated the link to the v2 of the "arm64 relaxed ABI" patchset [3]. - Dropped the documentation patch, as the "arm64 relaxed ABI" patchset [3] handles that. Changes in v11: - Added "uprobes, arm64: untag user pointers in find_active_uprobe" patch. - Added "bpf, arm64: untag user pointers in stack_map_get_build_id_offset" patch. - Fixed "tracing, arm64: untag user pointers in seq_print_user_ip" to correctly perform subtration with a tagged addr. - Moved untagged_addr() from SYSCALL_DEFINE3(mprotect) and SYSCALL_DEFINE4(pkey_mprotect) to do_mprotect_pkey(). - Moved untagged_addr() definition for other arches from include/linux/memory.h to include/linux/mm.h. - Changed untagging in strn*_user() to perform userspace accesses through tagged pointers. - Updated the documentation to mention that passing tagged pointers to memory syscalls is allowed. - Updated the test to use malloc'ed memory instead of stack memory. Changes in v10: - Added "mm, arm64: untag user pointers passed to memory syscalls" back. - New patch "fs, arm64: untag user pointers in fs/userfaultfd.c". - New patch "net, arm64: untag user pointers in tcp_zerocopy_receive". - New patch "kernel, arm64: untag user pointers in prctl_set_mm*". - New patch "tracing, arm64: untag user pointers in seq_print_user_ip". Changes in v9: - Rebased onto 4.20-rc6. - Used u64 instead of __u64 in type casts in the untagged_addr macro for arm64. - Added braces around (addr) in the untagged_addr macro for other arches. Changes in v8: - Rebased onto 65102238 (4.20-rc1). - Added a note to the cover letter on why syscall wrappers/shims that untag user pointers won't work. - Added a note to the cover letter that this patchset has been merged into the Pixel 2 kernel tree. - Documentation fixes, in particular added a list of syscalls that don't support tagged user pointers. Changes in v7: - Rebased onto 17b57b18 (4.19-rc6). - Dropped the "arm64: untag user address in __do_user_fault" patch, since the existing patches already handle user faults properly. - Dropped the "usb, arm64: untag user addresses in devio" patch, since the passed pointer must come from a vma and therefore be untagged. - Dropped the "arm64: annotate user pointers casts detected by sparse" patch (see the discussion to the replies of the v6 of this patchset). - Added more context to the cover letter. - Updated Documentation/arm64/tagged-pointers.txt. Changes in v6: - Added annotations for user pointer casts found by sparse. - Rebased onto 050cdc6c (4.19-rc1+). Changes in v5: - Added 3 new patches that add untagging to places found with static analysis. - Rebased onto 44c929e1 (4.18-rc8). Changes in v4: - Added a selftest for checking that passing tagged pointers to the kernel succeeds. - Rebased onto 81e97f013 (4.18-rc1+). Changes in v3: - Rebased onto e5c51f30 (4.17-rc6+). - Added linux-arch@ to the list of recipients. Changes in v2: - Rebased onto 2d618bdf (4.17-rc3+). - Removed excessive untagging in gup.c. - Removed untagging pointers returned from __uaccess_mask_ptr. Changes in v1: - Rebased onto 4.17-rc1. Changes in RFC v2: - Added "#ifndef untagged_addr..." fallback in linux/uaccess.h instead of defining it for each arch individually. - Updated Documentation/arm64/tagged-pointers.txt. - Dropped "mm, arm64: untag user addresses in memory syscalls". - Rebased onto 3eb2ce82 (4.16-rc7). Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Andrey Konovalov (17): uaccess: add untagged_addr definition for other arches arm64: untag user pointers in access_ok and __uaccess_mask_ptr lib, arm64: untag user pointers in strn*_user mm: add ksys_ wrappers to memory syscalls arms64: untag user pointers passed to memory syscalls mm: untag user pointers in do_pages_move mm, arm64: untag user pointers in mm/gup.c mm, arm64: untag user pointers in get_vaddr_frames fs, arm64: untag user pointers in copy_mount_options fs, arm64: untag user pointers in fs/userfaultfd.c drm/amdgpu, arm64: untag user pointers drm/radeon, arm64: untag user pointers in radeon_gem_userptr_ioctl IB, arm64: untag user pointers in ib_uverbs_(re)reg_mr() media/v4l2-core, arm64: untag user pointers in videobuf_dma_contig_user_get tee, arm64: untag user pointers in tee_shm_register vfio/type1, arm64: untag user pointers in vaddr_get_pfn selftests, arm64: add a selftest for passing tagged pointers to kernel arch/arm64/include/asm/uaccess.h | 10 +- arch/arm64/kernel/sys.c | 128 ++++++++++++++++- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 + drivers/gpu/drm/radeon/radeon_gem.c | 2 + drivers/infiniband/core/uverbs_cmd.c | 4 + drivers/media/v4l2-core/videobuf-dma-contig.c | 9 +- drivers/tee/tee_shm.c | 1 + drivers/vfio/vfio_iommu_type1.c | 2 + fs/namespace.c | 2 +- fs/userfaultfd.c | 22 +-- include/linux/mm.h | 4 + include/linux/syscalls.h | 22 +++ ipc/shm.c | 7 +- lib/strncpy_from_user.c | 3 +- lib/strnlen_user.c | 3 +- mm/frame_vector.c | 2 + mm/gup.c | 4 + mm/madvise.c | 129 +++++++++--------- mm/mempolicy.c | 21 ++- mm/migrate.c | 1 + mm/mincore.c | 57 ++++---- mm/mlock.c | 20 ++- mm/mmap.c | 30 +++- mm/mprotect.c | 6 +- mm/mremap.c | 27 ++-- mm/msync.c | 35 +++-- tools/testing/selftests/arm64/.gitignore | 1 + tools/testing/selftests/arm64/Makefile | 11 ++ .../testing/selftests/arm64/run_tags_test.sh | 12 ++ tools/testing/selftests/arm64/tags_test.c | 21 +++ 31 files changed, 436 insertions(+), 164 deletions(-) create mode 100644 tools/testing/selftests/arm64/.gitignore create mode 100644 tools/testing/selftests/arm64/Makefile create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh create mode 100644 tools/testing/selftests/arm64/tags_test.c