Message ID | cover.1729440856.git.lorenzo.stoakes@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | implement lightweight guard pages | expand |
* Lorenzo Stoakes: > Early testing of the prototype version of this code suggests a 5 times > speed up in memory mapping invocations (in conjunction with use of > process_madvise()) and a 13% reduction in VMAs on an entirely idle android > system and unoptimised code. > > We expect with optimisation and a loaded system with a larger number of > guard pages this could significantly increase, but in any case these > numbers are encouraging. > > This way, rather than having separate VMAs specifying which parts of a > range are guard pages, instead we have a VMA spanning the entire range of > memory a user is permitted to access and including ranges which are to be > 'guarded'. > > After mapping this, a user can specify which parts of the range should > result in a fatal signal when accessed. > > By restricting the ability to specify guard pages to memory mapped by > existing VMAs, we can rely on the mappings being torn down when the > mappings are ultimately unmapped and everything works simply as if the > memory were not faulted in, from the point of view of the containing VMAs. We have a glibc (so not Android) dynamic linker bug that asks us to remove PROT_NONE mappings in mapped shared objects: Extra struct vm_area_struct with ---p created when PAGE_SIZE < max-page-size <https://sourceware.org/bugzilla/show_bug.cgi?id=31076> It's slightly different from a guard page because our main goal is to avoid other mappings to end up in those gaps, which has been shown to cause odd application behavior in cases where it happens. If I understand the series correctly, the kernel would not automatically attribute those PROT_NONE gaps to the previous or subsequent mapping. We would have to extend one of the surrounding mapps and apply MADV_POISON to that over-mapped part. That doesn't seem too onerous. Could the ELF loader in the kernel do the same thing for the main executable and the program loader?
On Sun, Oct 20, 2024 at 07:37:54PM +0200, Florian Weimer wrote: > * Lorenzo Stoakes: > > > Early testing of the prototype version of this code suggests a 5 times > > speed up in memory mapping invocations (in conjunction with use of > > process_madvise()) and a 13% reduction in VMAs on an entirely idle android > > system and unoptimised code. > > > > We expect with optimisation and a loaded system with a larger number of > > guard pages this could significantly increase, but in any case these > > numbers are encouraging. > > > > This way, rather than having separate VMAs specifying which parts of a > > range are guard pages, instead we have a VMA spanning the entire range of > > memory a user is permitted to access and including ranges which are to be > > 'guarded'. > > > > After mapping this, a user can specify which parts of the range should > > result in a fatal signal when accessed. > > > > By restricting the ability to specify guard pages to memory mapped by > > existing VMAs, we can rely on the mappings being torn down when the > > mappings are ultimately unmapped and everything works simply as if the > > memory were not faulted in, from the point of view of the containing VMAs. > > We have a glibc (so not Android) dynamic linker bug that asks us to > remove PROT_NONE mappings in mapped shared objects: > > Extra struct vm_area_struct with ---p created when PAGE_SIZE < max-page-size > <https://sourceware.org/bugzilla/show_bug.cgi?id=31076> > > It's slightly different from a guard page because our main goal is to > avoid other mappings to end up in those gaps, which has been shown to > cause odd application behavior in cases where it happens. If I > understand the series correctly, the kernel would not automatically > attribute those PROT_NONE gaps to the previous or subsequent mapping. > We would have to extend one of the surrounding mapps and apply > MADV_POISON to that over-mapped part. That doesn't seem too onerous. > > Could the ELF loader in the kernel do the same thing for the main > executable and the program loader? Currently this implementation is only available for private anonymous memory. We may look at extending it to shmem and file-backed memory in the future but there are a whole host of things to overcome to make that work so it's one step at a time! :)
Hi Florian, Lorenzo, This looks great! What I am VERY interested in is if poisoned pages cause SIGSEGV even when the access happens in the kernel. Namely, the syscall still returns EFAULT, but also SIGSEGV is queued on return to user-space. Catching bad accesses in system calls is currently the weak spot for all user-space bug detection tools (GWP-ASan, libefence, libefency, etc). It's almost possible with userfaultfd, but catching faults in the kernel requires admin capability, so not really an option for generic bug detection tools (+inconvinience of userfaultfd setup/handler). Intercepting all EFAULT from syscalls is not generally possible (w/o ptrace, usually not an option as well), and EFAULT does not always mean a bug. Triggering SIGSEGV even in syscalls would be not just a performance optimization, but a new useful capability that would allow it to catch more bugs. Thanks
On 23.10.24 08:24, Dmitry Vyukov wrote: > Hi Florian, Lorenzo, > > This looks great! > > What I am VERY interested in is if poisoned pages cause SIGSEGV even when > the access happens in the kernel. Namely, the syscall still returns EFAULT, > but also SIGSEGV is queued on return to user-space. > > Catching bad accesses in system calls is currently the weak spot for > all user-space bug detection tools (GWP-ASan, libefence, libefency, etc). > It's almost possible with userfaultfd, but catching faults in the kernel > requires admin capability, so not really an option for generic bug > detection tools (+inconvinience of userfaultfd setup/handler). > Intercepting all EFAULT from syscalls is not generally possible > (w/o ptrace, usually not an option as well), and EFAULT does not always > mean a bug. > > Triggering SIGSEGV even in syscalls would be not just a performance > optimization, but a new useful capability that would allow it to catch > more bugs. Right, we discussed that offline also as a possible extension to the userfaultfd SIGBUS mode. I did not look into that yet, but I was wonder if there could be cases where a different process could trigger that SIGSEGV, and how to (and if to) handle that. For example, ptrace (access_remote_vm()) -> GUP likely can trigger that. I think with userfaultfd() we will currently return -EFAULT, because we call get_user_page_vma_remote() that is not prepared for dropping the mmap lock. Possibly that is the right thing to do, but not sure :) These "remote" faults set FOLL_REMOTE -> FAULT_FLAG_REMOTE, so we might be able to distinguish them and perform different handling.
+cc Linus as reference a commit of his below... On Wed, Oct 23, 2024 at 09:19:03AM +0200, David Hildenbrand wrote: > On 23.10.24 08:24, Dmitry Vyukov wrote: > > Hi Florian, Lorenzo, > > > > This looks great! Thanks! > > > > What I am VERY interested in is if poisoned pages cause SIGSEGV even when > > the access happens in the kernel. Namely, the syscall still returns EFAULT, > > but also SIGSEGV is queued on return to user-space. Yeah we don't in any way. I think adding something like this would be a bit of its own project. The fault andler for this is in handle_pte_marker() in mm/memory.c, where we do the following: /* Hitting a guard page is always a fatal condition. */ if (marker & PTE_MARKER_GUARD) return VM_FAULT_SIGSEGV; So basically we pass this back to whoever invoked the fault. For uaccess we end up in arch-specific code that eventually checks exception tables etc. and for x86-64 that's kernelmode_fixup_or_oops(). There used to be a sig_on_uaccess_err in the x86-specific thread_struct that let you propagate it but Linus pulled it out in commit 02b670c1f88e ("x86/mm: Remove broken vsyscall emulation code from the page fault code") where it was presumably used for vsyscall. Of course we could just get something much higher up the stack to send the signal, but we'd need to be careful we weren't breaking anything doing it... I address GUP below. > > > > Catching bad accesses in system calls is currently the weak spot for > > all user-space bug detection tools (GWP-ASan, libefence, libefency, etc). > > It's almost possible with userfaultfd, but catching faults in the kernel > > requires admin capability, so not really an option for generic bug > > detection tools (+inconvinience of userfaultfd setup/handler). > > Intercepting all EFAULT from syscalls is not generally possible > > (w/o ptrace, usually not an option as well), and EFAULT does not always > > mean a bug. > > > > Triggering SIGSEGV even in syscalls would be not just a performance > > optimization, but a new useful capability that would allow it to catch > > more bugs. > > Right, we discussed that offline also as a possible extension to the > userfaultfd SIGBUS mode. > > I did not look into that yet, but I was wonder if there could be cases where > a different process could trigger that SIGSEGV, and how to (and if to) > handle that. > > For example, ptrace (access_remote_vm()) -> GUP likely can trigger that. I > think with userfaultfd() we will currently return -EFAULT, because we call > get_user_page_vma_remote() that is not prepared for dropping the mmap lock. > Possibly that is the right thing to do, but not sure :) > > These "remote" faults set FOLL_REMOTE -> FAULT_FLAG_REMOTE, so we might be > able to distinguish them and perform different handling. So all GUP will return -EFAULT when hitting guard pages unless we change something. In GUP we handle this in faultin_page(): if (ret & VM_FAULT_ERROR) { int err = vm_fault_to_errno(ret, flags); if (err) return err; BUG(); } And vm_fault_to_errno() is: static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) { if (vm_fault & VM_FAULT_OOM) return -ENOMEM; if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE)) return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT; if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV)) return -EFAULT; return 0; } Again, I think if we wanted special handling here we'd need to probably propagate that fault from higher up, but yes we'd need to for one definitely not do so if it's remote but I worry about other cases. > > -- > Cheers, > > David / dhildenb > Overall while I sympathise with this, it feels dangerous and a pretty major change, because there'll be something somewhere that will break because it expects faults to be swallowed that we no longer do swallow. So I'd say it'd be something we should defer, but of course it's a highly user-facing change so how easy that would be I don't know. But I definitely don't think a 'introduce the ability to do cheap PROT_NONE guards' series is the place to also fundmentally change how user access page faults are handled within the kernel :)
On Wed, 23 Oct 2024 at 10:12, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > +cc Linus as reference a commit of his below... > > On Wed, Oct 23, 2024 at 09:19:03AM +0200, David Hildenbrand wrote: > > On 23.10.24 08:24, Dmitry Vyukov wrote: > > > Hi Florian, Lorenzo, > > > > > > This looks great! > > Thanks! > > > > > > > What I am VERY interested in is if poisoned pages cause SIGSEGV even when > > > the access happens in the kernel. Namely, the syscall still returns EFAULT, > > > but also SIGSEGV is queued on return to user-space. > > Yeah we don't in any way. > > I think adding something like this would be a bit of its own project. I can totally understand this. > The fault andler for this is in handle_pte_marker() in mm/memory.c, where > we do the following: > > /* Hitting a guard page is always a fatal condition. */ > if (marker & PTE_MARKER_GUARD) > return VM_FAULT_SIGSEGV; > > So basically we pass this back to whoever invoked the fault. For uaccess we > end up in arch-specific code that eventually checks exception tables > etc. and for x86-64 that's kernelmode_fixup_or_oops(). > > There used to be a sig_on_uaccess_err in the x86-specific thread_struct > that let you propagate it but Linus pulled it out in commit 02b670c1f88e > ("x86/mm: Remove broken vsyscall emulation code from the page fault code") > where it was presumably used for vsyscall. > > Of course we could just get something much higher up the stack to send the > signal, but we'd need to be careful we weren't breaking anything doing > it... Can setting TIF_NOTIFY_RESUME and then doing the rest when returning to userspace help here? > I address GUP below. > > > > > > > Catching bad accesses in system calls is currently the weak spot for > > > all user-space bug detection tools (GWP-ASan, libefence, libefency, etc). > > > It's almost possible with userfaultfd, but catching faults in the kernel > > > requires admin capability, so not really an option for generic bug > > > detection tools (+inconvinience of userfaultfd setup/handler). > > > Intercepting all EFAULT from syscalls is not generally possible > > > (w/o ptrace, usually not an option as well), and EFAULT does not always > > > mean a bug. > > > > > > Triggering SIGSEGV even in syscalls would be not just a performance > > > optimization, but a new useful capability that would allow it to catch > > > more bugs. > > > > Right, we discussed that offline also as a possible extension to the > > userfaultfd SIGBUS mode. > > > > I did not look into that yet, but I was wonder if there could be cases where > > a different process could trigger that SIGSEGV, and how to (and if to) > > handle that. > > > > For example, ptrace (access_remote_vm()) -> GUP likely can trigger that. I > > think with userfaultfd() we will currently return -EFAULT, because we call > > get_user_page_vma_remote() that is not prepared for dropping the mmap lock. > > Possibly that is the right thing to do, but not sure :) That's a good corner case. I guess also process_vm_readv/writev. Not triggering the signal in these cases looks like the right thing to do. > > These "remote" faults set FOLL_REMOTE -> FAULT_FLAG_REMOTE, so we might be > > able to distinguish them and perform different handling. > > So all GUP will return -EFAULT when hitting guard pages unless we change > something. > > In GUP we handle this in faultin_page(): > > if (ret & VM_FAULT_ERROR) { > int err = vm_fault_to_errno(ret, flags); > > if (err) > return err; > BUG(); > } > > And vm_fault_to_errno() is: > > static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) > { > if (vm_fault & VM_FAULT_OOM) > return -ENOMEM; > if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE)) > return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT; > if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV)) > return -EFAULT; > return 0; > } > > Again, I think if we wanted special handling here we'd need to probably > propagate that fault from higher up, but yes we'd need to for one > definitely not do so if it's remote but I worry about other cases. > > > > > -- > > Cheers, > > > > David / dhildenb > > > > Overall while I sympathise with this, it feels dangerous and a pretty major > change, because there'll be something somewhere that will break because it > expects faults to be swallowed that we no longer do swallow. > > So I'd say it'd be something we should defer, but of course it's a highly > user-facing change so how easy that would be I don't know. > > But I definitely don't think a 'introduce the ability to do cheap PROT_NONE > guards' series is the place to also fundmentally change how user access > page faults are handled within the kernel :) Will delivering signals on kernel access be a backwards compatible change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? It's just somewhat painful to detect/update all userspace if we add this feature in future. Can we say signal delivery on kernel accesses is unspecified?
On 10/23/24 10:56, Dmitry Vyukov wrote: >> >> Overall while I sympathise with this, it feels dangerous and a pretty major >> change, because there'll be something somewhere that will break because it >> expects faults to be swallowed that we no longer do swallow. >> >> So I'd say it'd be something we should defer, but of course it's a highly >> user-facing change so how easy that would be I don't know. >> >> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE >> guards' series is the place to also fundmentally change how user access >> page faults are handled within the kernel :) > > Will delivering signals on kernel access be a backwards compatible > change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? > It's just somewhat painful to detect/update all userspace if we add > this feature in future. Can we say signal delivery on kernel accesses > is unspecified? Would adding signal delivery to guard PTEs only help enough the ASAN etc usecase? Wouldn't it be instead possible to add some prctl to opt-in the whole ASANized process to deliver all existing segfaults as signals instead of -EFAULT ?
On 23.10.24 11:06, Vlastimil Babka wrote: > On 10/23/24 10:56, Dmitry Vyukov wrote: >>> >>> Overall while I sympathise with this, it feels dangerous and a pretty major >>> change, because there'll be something somewhere that will break because it >>> expects faults to be swallowed that we no longer do swallow. >>> >>> So I'd say it'd be something we should defer, but of course it's a highly >>> user-facing change so how easy that would be I don't know. >>> >>> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE >>> guards' series is the place to also fundmentally change how user access >>> page faults are handled within the kernel :) >> >> Will delivering signals on kernel access be a backwards compatible >> change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? >> It's just somewhat painful to detect/update all userspace if we add >> this feature in future. Can we say signal delivery on kernel accesses >> is unspecified? > > Would adding signal delivery to guard PTEs only help enough the ASAN etc > usecase? Wouldn't it be instead possible to add some prctl to opt-in the > whole ASANized process to deliver all existing segfaults as signals instead > of -EFAULT ? Not sure if it is an "instead", you might have to deliver the signal in addition to letting the syscall fail (not that I would be an expert on signal delivery :D ). prctl sounds better, or some way to configure the behavior on VMA ranges; otherwise we would need yet another marker, which is not the end of the world but would make it slightly more confusing.
On Wed, 23 Oct 2024 at 11:06, Vlastimil Babka <vbabka@suse.cz> wrote: > > On 10/23/24 10:56, Dmitry Vyukov wrote: > >> > >> Overall while I sympathise with this, it feels dangerous and a pretty major > >> change, because there'll be something somewhere that will break because it > >> expects faults to be swallowed that we no longer do swallow. > >> > >> So I'd say it'd be something we should defer, but of course it's a highly > >> user-facing change so how easy that would be I don't know. > >> > >> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE > >> guards' series is the place to also fundmentally change how user access > >> page faults are handled within the kernel :) > > > > Will delivering signals on kernel access be a backwards compatible > > change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? > > It's just somewhat painful to detect/update all userspace if we add > > this feature in future. Can we say signal delivery on kernel accesses > > is unspecified? > > Would adding signal delivery to guard PTEs only help enough the ASAN etc > usecase? Wouldn't it be instead possible to add some prctl to opt-in the > whole ASANized process to deliver all existing segfaults as signals instead > of -EFAULT ? ASAN per se does not need this (it does not use page protection). However, if you mean bug detection tools in general, then, yes, that's what I had in mind. There are also things like stack guard pages in libc that would benefit from that as well. But I observed that some libraries intentionally use EFAULT to probe for memory readability, i.e. use some cheap syscall to probe memory before reading it. So changing behavior globally may not work.
On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote: > On 23.10.24 11:06, Vlastimil Babka wrote: > > On 10/23/24 10:56, Dmitry Vyukov wrote: > > > > > > > > Overall while I sympathise with this, it feels dangerous and a pretty major > > > > change, because there'll be something somewhere that will break because it > > > > expects faults to be swallowed that we no longer do swallow. > > > > > > > > So I'd say it'd be something we should defer, but of course it's a highly > > > > user-facing change so how easy that would be I don't know. > > > > > > > > But I definitely don't think a 'introduce the ability to do cheap PROT_NONE > > > > guards' series is the place to also fundmentally change how user access > > > > page faults are handled within the kernel :) > > > > > > Will delivering signals on kernel access be a backwards compatible > > > change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? > > > It's just somewhat painful to detect/update all userspace if we add > > > this feature in future. Can we say signal delivery on kernel accesses > > > is unspecified? > > > > Would adding signal delivery to guard PTEs only help enough the ASAN etc > > usecase? Wouldn't it be instead possible to add some prctl to opt-in the > > whole ASANized process to deliver all existing segfaults as signals instead > > of -EFAULT ? > > Not sure if it is an "instead", you might have to deliver the signal in > addition to letting the syscall fail (not that I would be an expert on > signal delivery :D ). > > prctl sounds better, or some way to configure the behavior on VMA ranges; > otherwise we would need yet another marker, which is not the end of the > world but would make it slightly more confusing. > Yeah prctl() sounds sensible, and since we are explicitly adding a marker for guard pages here we can do this as a follow up too without breaking any userland expectations, i.e. 'new feature to make guard pages signal' is not going to contradict the default behaviour. So all makes sense to me, but I do think best as a follow up! :) > -- > Cheers, > > David / dhildenb >
On 23.10.24 11:18, Lorenzo Stoakes wrote: > On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote: >> On 23.10.24 11:06, Vlastimil Babka wrote: >>> On 10/23/24 10:56, Dmitry Vyukov wrote: >>>>> >>>>> Overall while I sympathise with this, it feels dangerous and a pretty major >>>>> change, because there'll be something somewhere that will break because it >>>>> expects faults to be swallowed that we no longer do swallow. >>>>> >>>>> So I'd say it'd be something we should defer, but of course it's a highly >>>>> user-facing change so how easy that would be I don't know. >>>>> >>>>> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE >>>>> guards' series is the place to also fundmentally change how user access >>>>> page faults are handled within the kernel :) >>>> >>>> Will delivering signals on kernel access be a backwards compatible >>>> change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? >>>> It's just somewhat painful to detect/update all userspace if we add >>>> this feature in future. Can we say signal delivery on kernel accesses >>>> is unspecified? >>> >>> Would adding signal delivery to guard PTEs only help enough the ASAN etc >>> usecase? Wouldn't it be instead possible to add some prctl to opt-in the >>> whole ASANized process to deliver all existing segfaults as signals instead >>> of -EFAULT ? >> >> Not sure if it is an "instead", you might have to deliver the signal in >> addition to letting the syscall fail (not that I would be an expert on >> signal delivery :D ). >> >> prctl sounds better, or some way to configure the behavior on VMA ranges; >> otherwise we would need yet another marker, which is not the end of the >> world but would make it slightly more confusing. >> > > Yeah prctl() sounds sensible, and since we are explicitly adding a marker > for guard pages here we can do this as a follow up too without breaking any > userland expectations, i.e. 'new feature to make guard pages signal' is not > going to contradict the default behaviour. > > So all makes sense to me, but I do think best as a follow up! :) Yeah, fully agreed. And my gut feeling is that it might not be that easy ... :) In the end, what we want is *some* notification that a guard PTE was accessed. Likely the notification must not necessarily completely synchronous (although it would be ideal) and it must not be a signal. Maybe having a different way to obtain that information from user space would work.
On Wed, 23 Oct 2024 at 11:29, David Hildenbrand <david@redhat.com> wrote: > > On 23.10.24 11:18, Lorenzo Stoakes wrote: > > On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote: > >> On 23.10.24 11:06, Vlastimil Babka wrote: > >>> On 10/23/24 10:56, Dmitry Vyukov wrote: > >>>>> > >>>>> Overall while I sympathise with this, it feels dangerous and a pretty major > >>>>> change, because there'll be something somewhere that will break because it > >>>>> expects faults to be swallowed that we no longer do swallow. > >>>>> > >>>>> So I'd say it'd be something we should defer, but of course it's a highly > >>>>> user-facing change so how easy that would be I don't know. > >>>>> > >>>>> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE > >>>>> guards' series is the place to also fundmentally change how user access > >>>>> page faults are handled within the kernel :) > >>>> > >>>> Will delivering signals on kernel access be a backwards compatible > >>>> change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? > >>>> It's just somewhat painful to detect/update all userspace if we add > >>>> this feature in future. Can we say signal delivery on kernel accesses > >>>> is unspecified? > >>> > >>> Would adding signal delivery to guard PTEs only help enough the ASAN etc > >>> usecase? Wouldn't it be instead possible to add some prctl to opt-in the > >>> whole ASANized process to deliver all existing segfaults as signals instead > >>> of -EFAULT ? > >> > >> Not sure if it is an "instead", you might have to deliver the signal in > >> addition to letting the syscall fail (not that I would be an expert on > >> signal delivery :D ). > >> > >> prctl sounds better, or some way to configure the behavior on VMA ranges; > >> otherwise we would need yet another marker, which is not the end of the > >> world but would make it slightly more confusing. > >> > > > > Yeah prctl() sounds sensible, and since we are explicitly adding a marker > > for guard pages here we can do this as a follow up too without breaking any > > userland expectations, i.e. 'new feature to make guard pages signal' is not > > going to contradict the default behaviour. > > > > So all makes sense to me, but I do think best as a follow up! :) > > Yeah, fully agreed. And my gut feeling is that it might not be that easy > ... :) > > In the end, what we want is *some* notification that a guard PTE was > accessed. Likely the notification must not necessarily completely > synchronous (although it would be ideal) and it must not be a signal. > > Maybe having a different way to obtain that information from user space > would work. For bug detection tools (like GWP-ASan [1]) it's essential to have useful stack traces. As such, having this signal be synchronous would be more useful. I don't see how one could get a useful stack trace (or other information like what's stashed away in ucontext like CPU registers) if this were asynchronous. [1] https://arxiv.org/pdf/2311.09394
On 23.10.24 13:31, Marco Elver wrote: > On Wed, 23 Oct 2024 at 11:29, David Hildenbrand <david@redhat.com> wrote: >> >> On 23.10.24 11:18, Lorenzo Stoakes wrote: >>> On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote: >>>> On 23.10.24 11:06, Vlastimil Babka wrote: >>>>> On 10/23/24 10:56, Dmitry Vyukov wrote: >>>>>>> >>>>>>> Overall while I sympathise with this, it feels dangerous and a pretty major >>>>>>> change, because there'll be something somewhere that will break because it >>>>>>> expects faults to be swallowed that we no longer do swallow. >>>>>>> >>>>>>> So I'd say it'd be something we should defer, but of course it's a highly >>>>>>> user-facing change so how easy that would be I don't know. >>>>>>> >>>>>>> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE >>>>>>> guards' series is the place to also fundmentally change how user access >>>>>>> page faults are handled within the kernel :) >>>>>> >>>>>> Will delivering signals on kernel access be a backwards compatible >>>>>> change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? >>>>>> It's just somewhat painful to detect/update all userspace if we add >>>>>> this feature in future. Can we say signal delivery on kernel accesses >>>>>> is unspecified? >>>>> >>>>> Would adding signal delivery to guard PTEs only help enough the ASAN etc >>>>> usecase? Wouldn't it be instead possible to add some prctl to opt-in the >>>>> whole ASANized process to deliver all existing segfaults as signals instead >>>>> of -EFAULT ? >>>> >>>> Not sure if it is an "instead", you might have to deliver the signal in >>>> addition to letting the syscall fail (not that I would be an expert on >>>> signal delivery :D ). >>>> >>>> prctl sounds better, or some way to configure the behavior on VMA ranges; >>>> otherwise we would need yet another marker, which is not the end of the >>>> world but would make it slightly more confusing. >>>> >>> >>> Yeah prctl() sounds sensible, and since we are explicitly adding a marker >>> for guard pages here we can do this as a follow up too without breaking any >>> userland expectations, i.e. 'new feature to make guard pages signal' is not >>> going to contradict the default behaviour. >>> >>> So all makes sense to me, but I do think best as a follow up! :) >> >> Yeah, fully agreed. And my gut feeling is that it might not be that easy >> ... :) >> >> In the end, what we want is *some* notification that a guard PTE was >> accessed. Likely the notification must not necessarily completely >> synchronous (although it would be ideal) and it must not be a signal. >> >> Maybe having a different way to obtain that information from user space >> would work. > > For bug detection tools (like GWP-ASan [1]) it's essential to have > useful stack traces. As such, having this signal be synchronous would > be more useful. I don't see how one could get a useful stack trace (or > other information like what's stashed away in ucontext like CPU > registers) if this were asynchronous. Yes, I know. But it would be better than not getting *any* notification except of some syscalls simply failing with -EFAULT, and not having an idea which address was even accessed. Maybe the signal injection is easier than I think, but I somehow doubt it ...
On Wed, Oct 23, 2024 at 01:36:10PM +0200, David Hildenbrand wrote: > On 23.10.24 13:31, Marco Elver wrote: > > On Wed, 23 Oct 2024 at 11:29, David Hildenbrand <david@redhat.com> wrote: > > > > > > On 23.10.24 11:18, Lorenzo Stoakes wrote: > > > > On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote: > > > > > On 23.10.24 11:06, Vlastimil Babka wrote: > > > > > > On 10/23/24 10:56, Dmitry Vyukov wrote: > > > > > > > > > > > > > > > > Overall while I sympathise with this, it feels dangerous and a pretty major > > > > > > > > change, because there'll be something somewhere that will break because it > > > > > > > > expects faults to be swallowed that we no longer do swallow. > > > > > > > > > > > > > > > > So I'd say it'd be something we should defer, but of course it's a highly > > > > > > > > user-facing change so how easy that would be I don't know. > > > > > > > > > > > > > > > > But I definitely don't think a 'introduce the ability to do cheap PROT_NONE > > > > > > > > guards' series is the place to also fundmentally change how user access > > > > > > > > page faults are handled within the kernel :) > > > > > > > > > > > > > > Will delivering signals on kernel access be a backwards compatible > > > > > > > change? Or will we need a different API? MADV_GUARD_POISON_KERNEL? > > > > > > > It's just somewhat painful to detect/update all userspace if we add > > > > > > > this feature in future. Can we say signal delivery on kernel accesses > > > > > > > is unspecified? > > > > > > > > > > > > Would adding signal delivery to guard PTEs only help enough the ASAN etc > > > > > > usecase? Wouldn't it be instead possible to add some prctl to opt-in the > > > > > > whole ASANized process to deliver all existing segfaults as signals instead > > > > > > of -EFAULT ? > > > > > > > > > > Not sure if it is an "instead", you might have to deliver the signal in > > > > > addition to letting the syscall fail (not that I would be an expert on > > > > > signal delivery :D ). > > > > > > > > > > prctl sounds better, or some way to configure the behavior on VMA ranges; > > > > > otherwise we would need yet another marker, which is not the end of the > > > > > world but would make it slightly more confusing. > > > > > > > > > > > > > Yeah prctl() sounds sensible, and since we are explicitly adding a marker > > > > for guard pages here we can do this as a follow up too without breaking any > > > > userland expectations, i.e. 'new feature to make guard pages signal' is not > > > > going to contradict the default behaviour. > > > > > > > > So all makes sense to me, but I do think best as a follow up! :) > > > > > > Yeah, fully agreed. And my gut feeling is that it might not be that easy > > > ... :) > > > > > > In the end, what we want is *some* notification that a guard PTE was > > > accessed. Likely the notification must not necessarily completely > > > synchronous (although it would be ideal) and it must not be a signal. > > > > > > Maybe having a different way to obtain that information from user space > > > would work. > > > > For bug detection tools (like GWP-ASan [1]) it's essential to have > > useful stack traces. As such, having this signal be synchronous would > > be more useful. I don't see how one could get a useful stack trace (or > > other information like what's stashed away in ucontext like CPU > > registers) if this were asynchronous. > > Yes, I know. But it would be better than not getting *any* notification > except of some syscalls simply failing with -EFAULT, and not having an idea > which address was even accessed. > > Maybe the signal injection is easier than I think, but I somehow doubt it > ... Yeah I'm afraid I don't think this series is a place where I can fundamentally change how something so sensitive works in the kernel. It's espeically super sensitive because this is a uAPI change and a wrong decision here could result in guard pages being broken out the gate and I really don't want to risk that. > > -- > Cheers, > > David / dhildenb >