[v2,0/5] implement lightweight guard pages

Message ID	cover.1729440856.git.lorenzo.stoakes@oracle.com (mailing list archive)
Headers	show Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CF304194A43; Sun, 20 Oct 2024 16:21:09 +0000 (UTC) From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Suren Baghdasaryan <surenb@google.com>, "Liam R . Howlett" <Liam.Howlett@oracle.com>, Matthew Wilcox <willy@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, "Paul E . McKenney" <paulmck@kernel.org>, Jann Horn <jannh@google.com>, David Hildenbrand <david@redhat.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Muchun Song <muchun.song@linux.dev>, Richard Henderson <richard.henderson@linaro.org>, Ivan Kokshaysky <ink@jurassic.park.msu.ru>, Matt Turner <mattst88@gmail.com>, Thomas Bogendoerfer <tsbogend@alpha.franken.de>, "James E . J . Bottomley" <James.Bottomley@HansenPartnership.com>, Helge Deller <deller@gmx.de>, Chris Zankel <chris@zankel.net>, Max Filippov <jcmvbkbc@gmail.com>, Arnd Bergmann <arnd@arndb.de>, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-arch@vger.kernel.org, Shuah Khan <shuah@kernel.org>, Christian Brauner <brauner@kernel.org>, linux-kselftest@vger.kernel.org, Sidhartha Kumar <sidhartha.kumar@oracle.com>, Jeff Xu <jeffxu@chromium.org>, Christoph Hellwig <hch@infradead.org>, linux-api@vger.kernel.org, John Hubbard <jhubbard@nvidia.com> Subject: [PATCH v2 0/5] implement lightweight guard pages Date: Sun, 20 Oct 2024 17:20:00 +0100 Message-ID: <cover.1729440856.git.lorenzo.stoakes@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0
Series	implement lightweight guard pages \| expand [v2,0/5] implement lightweight guard pages [v2,1/5] mm: pagewalk: add the ability to install PTEs [v2,2/5] mm: add PTE_MARKER_GUARD PTE marker [v2,3/5] mm: madvise: implement lightweight guard page mechanism [v2,4/5] tools: testing: update tools UAPI header for mman-common.h [v2,5/5] selftests/mm: add self tests for guard page feature

Lorenzo Stoakes Oct. 20, 2024, 4:20 p.m. UTC

Userland library functions such as allocators and threading implementations
often require regions of memory to act as 'guard pages' - mappings which,
when accessed, result in a fatal signal being sent to the accessing
process.

The current means by which these are implemented is via a PROT_NONE mmap()
mapping, which provides the required semantics however incur an overhead of
a VMA for each such region.

With a great many processes and threads, this can rapidly add up and incur
a significant memory penalty. It also has the added problem of preventing
merges that might otherwise be permitted.

This series takes a different approach - an idea suggested by Vlasimil
Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
provenance becomes a little tricky to ascertain after this - please forgive
any omissions!)  - rather than locating the guard pages at the VMA layer,
instead placing them in page tables mapping the required ranges.

Early testing of the prototype version of this code suggests a 5 times
speed up in memory mapping invocations (in conjunction with use of
process_madvise()) and a 13% reduction in VMAs on an entirely idle android
system and unoptimised code.

We expect with optimisation and a loaded system with a larger number of
guard pages this could significantly increase, but in any case these
numbers are encouraging.

This way, rather than having separate VMAs specifying which parts of a
range are guard pages, instead we have a VMA spanning the entire range of
memory a user is permitted to access and including ranges which are to be
'guarded'.

After mapping this, a user can specify which parts of the range should
result in a fatal signal when accessed.

By restricting the ability to specify guard pages to memory mapped by
existing VMAs, we can rely on the mappings being torn down when the
mappings are ultimately unmapped and everything works simply as if the
memory were not faulted in, from the point of view of the containing VMAs.

This mechanism in effect poisons memory ranges similar to hardware memory
poisoning, only it is an entirely software-controlled form of poisoning.

Any poisoned region of memory is also able to 'unpoisoned', that is, to
have its poison markers removed.

The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON
which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this
poisoning.

Poisoning can be performed across multiple VMAs and any existing mappings
will be cleared, that is zapped, before installing the poisoned page table
mappings.

There is no concept of 'nested' poisoning, multiple attempts to poison a
range will, after the first poisoning, have no effect.

Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned
memory, so a user can safely unpoison a range of memory and clear only
poison page table mappings leaving the rest intact.

The actual mechanism by which the page table entries are specified makes
use of existing logic - PTE markers, which are used for the userfaultfd
UFFDIO_POISON mechanism.

Unfortunately PTE_MARKER_POISONED is not suited for the guard page
mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
logic to handle it.

We also extend the generic page walk mechanism to allow for installation of
PTEs (carefully restricted to memory management logic only to prevent
unwanted abuse).

We ensure that zapping performed by, for instance, MADV_DONTNEED, does not
remove guard poison markers, nor does forking (except when VM_WIPEONFORK is
specified for a VMA which implies a total removal of memory
characteristics).

It's important to note that the guard page implementation is emphatically
NOT a security feature, so a user can remove the poisoning if they wish. We
simply implement it in such a way as to provide the least surprising
behaviour.

An extensive set of self-tests are provided which ensure behaviour is as
expected and additionally self-documents expected behaviour of poisoned
ranges.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Suggested-by: Jann Horn <jannh@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>

v2
* The macros in kselftest_harness.h seem to be broken - __EXPECT() is
  terminated by '} while (0); OPTIONAL_HANDLER(_assert)' meaning it is not
  safe in single line if / else or for /which blocks, however working
  around this results in checkpatch producing invalid warnings, as reported
  by Shuah.
* Fixing these macros is out of scope for this series, so compromise and
  instead rewrite test blocks so as to use multiple lines by separating out
  a decl in most cases. This has the side effect of, for the most part,
  making things more readable.
* Heavily document the use of the volatile keyword - we can't avoid
  checkpatch complaining about this, so we explain it, as reported by
  Shuah.
* Updated commit message to highlight that we skip tests we lack
  permissions for, as reported by Shuah.
* Replaced a perror() with ksft_exit_fail_perror(), as reported by Shuah.
* Added user friendly messages to cases where tests are skipped due to lack
  of permissions, as reported by Shuah.
* Update the tool header to include the new MADV_GUARD_POISON/UNPOISON
  defines and directly include asm-generic/mman.h to get the
  platform-neutral versions to ensure we import them.
* Finally fixed Vlastimil's email address in Suggested-by tags from suze to
  suse, as reported by Vlastimil.
* Added linux-api to cc list, as reported by Vlastimil.

v1
* Un-RFC'd as appears no major objections to approach but rather debate on
  implementation.
* Fixed issue with arches which need mmu_context.h and
  tlbfush.h. header imports in pagewalker logic to be able to use
  update_mmu_cache() as reported by the kernel test bot.
* Added comments in page walker logic to clarify who can use
  ops->install_pte and why as well as adding a check_ops_valid() helper
  function, as suggested by Christoph.
* Pass false in full parameter in pte_clear_not_present_full() as suggested
  by Jann.
* Stopped erroneously requiring a write lock for the poison operation as
  suggested by Jann and Suren.
* Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
  consistent with how this is used elsewhere in the kernel as suggested by
  Jann.
* Avoid returning -EAGAIN if we are raced on page faults, just keep looping
  and duck out if a fatal signal is pending or a conditional reschedule is
  needed, as suggested by Jann.
* Avoid needlessly splitting huge PUDs and PMDs by specifying
  ACTION_CONTINUE, as suggested by Jann.
https://lore.kernel.org/all/cover.1729196871.git.lorenzo.stoakes@oracle.com/

RFC
https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (5):
  mm: pagewalk: add the ability to install PTEs
  mm: add PTE_MARKER_GUARD PTE marker
  mm: madvise: implement lightweight guard page mechanism
  tools: testing: update tools UAPI header for mman-common.h
  selftests/mm: add self tests for guard page feature

 arch/alpha/include/uapi/asm/mman.h           |    3 +
 arch/mips/include/uapi/asm/mman.h            |    3 +
 arch/parisc/include/uapi/asm/mman.h          |    3 +
 arch/xtensa/include/uapi/asm/mman.h          |    3 +
 include/linux/mm_inline.h                    |    2 +-
 include/linux/pagewalk.h                     |   18 +-
 include/linux/swapops.h                      |   26 +-
 include/uapi/asm-generic/mman-common.h       |    3 +
 mm/hugetlb.c                                 |    3 +
 mm/internal.h                                |    6 +
 mm/madvise.c                                 |  168 +++
 mm/memory.c                                  |   18 +-
 mm/mprotect.c                                |    3 +-
 mm/mseal.c                                   |    1 +
 mm/pagewalk.c                                |  200 ++-
 tools/include/uapi/asm-generic/mman-common.h |    3 +
 tools/testing/selftests/mm/.gitignore        |    1 +
 tools/testing/selftests/mm/Makefile          |    1 +
 tools/testing/selftests/mm/guard-pages.c     | 1228 ++++++++++++++++++
 19 files changed, 1627 insertions(+), 66 deletions(-)
 create mode 100644 tools/testing/selftests/mm/guard-pages.c

--
2.47.0

Florian Weimer Oct. 20, 2024, 5:37 p.m. UTC | #1

* Lorenzo Stoakes:

> Early testing of the prototype version of this code suggests a 5 times
> speed up in memory mapping invocations (in conjunction with use of
> process_madvise()) and a 13% reduction in VMAs on an entirely idle android
> system and unoptimised code.
>
> We expect with optimisation and a loaded system with a larger number of
> guard pages this could significantly increase, but in any case these
> numbers are encouraging.
>
> This way, rather than having separate VMAs specifying which parts of a
> range are guard pages, instead we have a VMA spanning the entire range of
> memory a user is permitted to access and including ranges which are to be
> 'guarded'.
>
> After mapping this, a user can specify which parts of the range should
> result in a fatal signal when accessed.
>
> By restricting the ability to specify guard pages to memory mapped by
> existing VMAs, we can rely on the mappings being torn down when the
> mappings are ultimately unmapped and everything works simply as if the
> memory were not faulted in, from the point of view of the containing VMAs.

We have a glibc (so not Android) dynamic linker bug that asks us to
remove PROT_NONE mappings in mapped shared objects:

  Extra struct vm_area_struct with ---p created when PAGE_SIZE < max-page-size
  <https://sourceware.org/bugzilla/show_bug.cgi?id=31076>

It's slightly different from a guard page because our main goal is to
avoid other mappings to end up in those gaps, which has been shown to
cause odd application behavior in cases where it happens.  If I
understand the series correctly, the kernel would not automatically
attribute those PROT_NONE gaps to the previous or subsequent mapping.
We would have to extend one of the surrounding mapps and apply
MADV_POISON to that over-mapped part.  That doesn't seem too onerous.

Could the ELF loader in the kernel do the same thing for the main
executable and the program loader?

Lorenzo Stoakes Oct. 20, 2024, 7:45 p.m. UTC | #2

On Sun, Oct 20, 2024 at 07:37:54PM +0200, Florian Weimer wrote:
> * Lorenzo Stoakes:
>
> > Early testing of the prototype version of this code suggests a 5 times
> > speed up in memory mapping invocations (in conjunction with use of
> > process_madvise()) and a 13% reduction in VMAs on an entirely idle android
> > system and unoptimised code.
> >
> > We expect with optimisation and a loaded system with a larger number of
> > guard pages this could significantly increase, but in any case these
> > numbers are encouraging.
> >
> > This way, rather than having separate VMAs specifying which parts of a
> > range are guard pages, instead we have a VMA spanning the entire range of
> > memory a user is permitted to access and including ranges which are to be
> > 'guarded'.
> >
> > After mapping this, a user can specify which parts of the range should
> > result in a fatal signal when accessed.
> >
> > By restricting the ability to specify guard pages to memory mapped by
> > existing VMAs, we can rely on the mappings being torn down when the
> > mappings are ultimately unmapped and everything works simply as if the
> > memory were not faulted in, from the point of view of the containing VMAs.
>
> We have a glibc (so not Android) dynamic linker bug that asks us to
> remove PROT_NONE mappings in mapped shared objects:
>
>   Extra struct vm_area_struct with ---p created when PAGE_SIZE < max-page-size
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=31076>
>
> It's slightly different from a guard page because our main goal is to
> avoid other mappings to end up in those gaps, which has been shown to
> cause odd application behavior in cases where it happens.  If I
> understand the series correctly, the kernel would not automatically
> attribute those PROT_NONE gaps to the previous or subsequent mapping.
> We would have to extend one of the surrounding mapps and apply
> MADV_POISON to that over-mapped part.  That doesn't seem too onerous.
>
> Could the ELF loader in the kernel do the same thing for the main
> executable and the program loader?

Currently this implementation is only available for private anonymous
memory. We may look at extending it to shmem and file-backed memory in the
future but there are a whole host of things to overcome to make that work
so it's one step at a time! :)

Dmitry Vyukov Oct. 23, 2024, 6:24 a.m. UTC | #3

Hi Florian, Lorenzo,

This looks great!

What I am VERY interested in is if poisoned pages cause SIGSEGV even when
the access happens in the kernel. Namely, the syscall still returns EFAULT,
but also SIGSEGV is queued on return to user-space.

Catching bad accesses in system calls is currently the weak spot for
all user-space bug detection tools (GWP-ASan, libefence, libefency, etc).
It's almost possible with userfaultfd, but catching faults in the kernel
requires admin capability, so not really an option for generic bug
detection tools (+inconvinience of userfaultfd setup/handler).
Intercepting all EFAULT from syscalls is not generally possible
(w/o ptrace, usually not an option as well), and EFAULT does not always
mean a bug.

Triggering SIGSEGV even in syscalls would be not just a performance
optimization, but a new useful capability that would allow it to catch
more bugs.

Thanks

David Hildenbrand Oct. 23, 2024, 7:19 a.m. UTC | #4

On 23.10.24 08:24, Dmitry Vyukov wrote:
> Hi Florian, Lorenzo,
> 
> This looks great!
> 
> What I am VERY interested in is if poisoned pages cause SIGSEGV even when
> the access happens in the kernel. Namely, the syscall still returns EFAULT,
> but also SIGSEGV is queued on return to user-space.
> 
> Catching bad accesses in system calls is currently the weak spot for
> all user-space bug detection tools (GWP-ASan, libefence, libefency, etc).
> It's almost possible with userfaultfd, but catching faults in the kernel
> requires admin capability, so not really an option for generic bug
> detection tools (+inconvinience of userfaultfd setup/handler).
> Intercepting all EFAULT from syscalls is not generally possible
> (w/o ptrace, usually not an option as well), and EFAULT does not always
> mean a bug.
> 
> Triggering SIGSEGV even in syscalls would be not just a performance
> optimization, but a new useful capability that would allow it to catch
> more bugs.

Right, we discussed that offline also as a possible extension to the 
userfaultfd SIGBUS mode.

I did not look into that yet, but I was wonder if there could be cases 
where a different process could trigger that SIGSEGV, and how to (and if 
to) handle that.

For example, ptrace (access_remote_vm()) -> GUP likely can trigger that. 
I think with userfaultfd() we will currently return -EFAULT, because we 
call get_user_page_vma_remote() that is not prepared for dropping the 
mmap lock. Possibly that is the right thing to do, but not sure :)

These "remote" faults set FOLL_REMOTE -> FAULT_FLAG_REMOTE, so we might 
be able to distinguish them and perform different handling.

Lorenzo Stoakes Oct. 23, 2024, 8:11 a.m. UTC | #5

+cc Linus as reference a commit of his below...

On Wed, Oct 23, 2024 at 09:19:03AM +0200, David Hildenbrand wrote:
> On 23.10.24 08:24, Dmitry Vyukov wrote:
> > Hi Florian, Lorenzo,
> >
> > This looks great!

Thanks!

> >
> > What I am VERY interested in is if poisoned pages cause SIGSEGV even when
> > the access happens in the kernel. Namely, the syscall still returns EFAULT,
> > but also SIGSEGV is queued on return to user-space.

Yeah we don't in any way.

I think adding something like this would be a bit of its own project.

The fault andler for this is in handle_pte_marker() in mm/memory.c, where
we do the following:

	/* Hitting a guard page is always a fatal condition. */
	if (marker & PTE_MARKER_GUARD)
		return VM_FAULT_SIGSEGV;

So basically we pass this back to whoever invoked the fault. For uaccess we
end up in arch-specific code that eventually checks exception tables
etc. and for x86-64 that's kernelmode_fixup_or_oops().

There used to be a sig_on_uaccess_err in the x86-specific thread_struct
that let you propagate it but Linus pulled it out in commit 02b670c1f88e
("x86/mm: Remove broken vsyscall emulation code from the page fault code")
where it was presumably used for vsyscall.

Of course we could just get something much higher up the stack to send the
signal, but we'd need to be careful we weren't breaking anything doing
it...

I address GUP below.

> >
> > Catching bad accesses in system calls is currently the weak spot for
> > all user-space bug detection tools (GWP-ASan, libefence, libefency, etc).
> > It's almost possible with userfaultfd, but catching faults in the kernel
> > requires admin capability, so not really an option for generic bug
> > detection tools (+inconvinience of userfaultfd setup/handler).
> > Intercepting all EFAULT from syscalls is not generally possible
> > (w/o ptrace, usually not an option as well), and EFAULT does not always
> > mean a bug.
> >
> > Triggering SIGSEGV even in syscalls would be not just a performance
> > optimization, but a new useful capability that would allow it to catch
> > more bugs.
>
> Right, we discussed that offline also as a possible extension to the
> userfaultfd SIGBUS mode.
>
> I did not look into that yet, but I was wonder if there could be cases where
> a different process could trigger that SIGSEGV, and how to (and if to)
> handle that.
>
> For example, ptrace (access_remote_vm()) -> GUP likely can trigger that. I
> think with userfaultfd() we will currently return -EFAULT, because we call
> get_user_page_vma_remote() that is not prepared for dropping the mmap lock.
> Possibly that is the right thing to do, but not sure :)
>
> These "remote" faults set FOLL_REMOTE -> FAULT_FLAG_REMOTE, so we might be
> able to distinguish them and perform different handling.

So all GUP will return -EFAULT when hitting guard pages unless we change
something.

In GUP we handle this in faultin_page():

	if (ret & VM_FAULT_ERROR) {
		int err = vm_fault_to_errno(ret, flags);

		if (err)
			return err;
		BUG();
	}

And vm_fault_to_errno() is:

static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
{
	if (vm_fault & VM_FAULT_OOM)
		return -ENOMEM;
	if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
		return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
	if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
		return -EFAULT;
	return 0;
}

Again, I think if we wanted special handling here we'd need to probably
propagate that fault from higher up, but yes we'd need to for one
definitely not do so if it's remote but I worry about other cases.

>
> --
> Cheers,
>
> David / dhildenb
>

Overall while I sympathise with this, it feels dangerous and a pretty major
change, because there'll be something somewhere that will break because it
expects faults to be swallowed that we no longer do swallow.

So I'd say it'd be something we should defer, but of course it's a highly
user-facing change so how easy that would be I don't know.

But I definitely don't think a 'introduce the ability to do cheap PROT_NONE
guards' series is the place to also fundmentally change how user access
page faults are handled within the kernel :)

Dmitry Vyukov Oct. 23, 2024, 8:56 a.m. UTC | #6

On Wed, 23 Oct 2024 at 10:12, Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> +cc Linus as reference a commit of his below...
>
> On Wed, Oct 23, 2024 at 09:19:03AM +0200, David Hildenbrand wrote:
> > On 23.10.24 08:24, Dmitry Vyukov wrote:
> > > Hi Florian, Lorenzo,
> > >
> > > This looks great!
>
> Thanks!
>
> > >
> > > What I am VERY interested in is if poisoned pages cause SIGSEGV even when
> > > the access happens in the kernel. Namely, the syscall still returns EFAULT,
> > > but also SIGSEGV is queued on return to user-space.
>
> Yeah we don't in any way.
>
> I think adding something like this would be a bit of its own project.

I can totally understand this.

> The fault andler for this is in handle_pte_marker() in mm/memory.c, where
> we do the following:
>
>         /* Hitting a guard page is always a fatal condition. */
>         if (marker & PTE_MARKER_GUARD)
>                 return VM_FAULT_SIGSEGV;
>
> So basically we pass this back to whoever invoked the fault. For uaccess we
> end up in arch-specific code that eventually checks exception tables
> etc. and for x86-64 that's kernelmode_fixup_or_oops().
>
> There used to be a sig_on_uaccess_err in the x86-specific thread_struct
> that let you propagate it but Linus pulled it out in commit 02b670c1f88e
> ("x86/mm: Remove broken vsyscall emulation code from the page fault code")
> where it was presumably used for vsyscall.
>
> Of course we could just get something much higher up the stack to send the
> signal, but we'd need to be careful we weren't breaking anything doing
> it...

Can setting TIF_NOTIFY_RESUME and then doing the rest when returning
to userspace help here?

> I address GUP below.
>
> > >
> > > Catching bad accesses in system calls is currently the weak spot for
> > > all user-space bug detection tools (GWP-ASan, libefence, libefency, etc).
> > > It's almost possible with userfaultfd, but catching faults in the kernel
> > > requires admin capability, so not really an option for generic bug
> > > detection tools (+inconvinience of userfaultfd setup/handler).
> > > Intercepting all EFAULT from syscalls is not generally possible
> > > (w/o ptrace, usually not an option as well), and EFAULT does not always
> > > mean a bug.
> > >
> > > Triggering SIGSEGV even in syscalls would be not just a performance
> > > optimization, but a new useful capability that would allow it to catch
> > > more bugs.
> >
> > Right, we discussed that offline also as a possible extension to the
> > userfaultfd SIGBUS mode.
> >
> > I did not look into that yet, but I was wonder if there could be cases where
> > a different process could trigger that SIGSEGV, and how to (and if to)
> > handle that.
> >
> > For example, ptrace (access_remote_vm()) -> GUP likely can trigger that. I
> > think with userfaultfd() we will currently return -EFAULT, because we call
> > get_user_page_vma_remote() that is not prepared for dropping the mmap lock.
> > Possibly that is the right thing to do, but not sure :)

That's a good corner case.
I guess also process_vm_readv/writev.
Not triggering the signal in these cases looks like the right thing to do.

> > These "remote" faults set FOLL_REMOTE -> FAULT_FLAG_REMOTE, so we might be
> > able to distinguish them and perform different handling.
>
> So all GUP will return -EFAULT when hitting guard pages unless we change
> something.
>
> In GUP we handle this in faultin_page():
>
>         if (ret & VM_FAULT_ERROR) {
>                 int err = vm_fault_to_errno(ret, flags);
>
>                 if (err)
>                         return err;
>                 BUG();
>         }
>
> And vm_fault_to_errno() is:
>
> static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
> {
>         if (vm_fault & VM_FAULT_OOM)
>                 return -ENOMEM;
>         if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
>                 return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
>         if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
>                 return -EFAULT;
>         return 0;
> }
>
> Again, I think if we wanted special handling here we'd need to probably
> propagate that fault from higher up, but yes we'd need to for one
> definitely not do so if it's remote but I worry about other cases.
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
>
> Overall while I sympathise with this, it feels dangerous and a pretty major
> change, because there'll be something somewhere that will break because it
> expects faults to be swallowed that we no longer do swallow.
>
> So I'd say it'd be something we should defer, but of course it's a highly
> user-facing change so how easy that would be I don't know.
>
> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE
> guards' series is the place to also fundmentally change how user access
> page faults are handled within the kernel :)

Will delivering signals on kernel access be a backwards compatible
change? Or will we need a different API? MADV_GUARD_POISON_KERNEL?
It's just somewhat painful to detect/update all userspace if we add
this feature in future. Can we say signal delivery on kernel accesses
is unspecified?

Vlastimil Babka Oct. 23, 2024, 9:06 a.m. UTC | #7

On 10/23/24 10:56, Dmitry Vyukov wrote:
>>
>> Overall while I sympathise with this, it feels dangerous and a pretty major
>> change, because there'll be something somewhere that will break because it
>> expects faults to be swallowed that we no longer do swallow.
>>
>> So I'd say it'd be something we should defer, but of course it's a highly
>> user-facing change so how easy that would be I don't know.
>>
>> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE
>> guards' series is the place to also fundmentally change how user access
>> page faults are handled within the kernel :)
> 
> Will delivering signals on kernel access be a backwards compatible
> change? Or will we need a different API? MADV_GUARD_POISON_KERNEL?
> It's just somewhat painful to detect/update all userspace if we add
> this feature in future. Can we say signal delivery on kernel accesses
> is unspecified?

Would adding signal delivery to guard PTEs only help enough the ASAN etc
usecase? Wouldn't it be instead possible to add some prctl to opt-in the
whole ASANized process to deliver all existing segfaults as signals instead
of -EFAULT ?

David Hildenbrand Oct. 23, 2024, 9:13 a.m. UTC | #8

On 23.10.24 11:06, Vlastimil Babka wrote:
> On 10/23/24 10:56, Dmitry Vyukov wrote:
>>>
>>> Overall while I sympathise with this, it feels dangerous and a pretty major
>>> change, because there'll be something somewhere that will break because it
>>> expects faults to be swallowed that we no longer do swallow.
>>>
>>> So I'd say it'd be something we should defer, but of course it's a highly
>>> user-facing change so how easy that would be I don't know.
>>>
>>> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE
>>> guards' series is the place to also fundmentally change how user access
>>> page faults are handled within the kernel :)
>>
>> Will delivering signals on kernel access be a backwards compatible
>> change? Or will we need a different API? MADV_GUARD_POISON_KERNEL?
>> It's just somewhat painful to detect/update all userspace if we add
>> this feature in future. Can we say signal delivery on kernel accesses
>> is unspecified?
> 
> Would adding signal delivery to guard PTEs only help enough the ASAN etc
> usecase? Wouldn't it be instead possible to add some prctl to opt-in the
> whole ASANized process to deliver all existing segfaults as signals instead
> of -EFAULT ?

Not sure if it is an "instead", you might have to deliver the signal in 
addition to letting the syscall fail (not that I would be an expert on 
signal delivery :D ).

prctl sounds better, or some way to configure the behavior on VMA 
ranges; otherwise we would need yet another marker, which is not the end 
of the world but would make it slightly more confusing.

Dmitry Vyukov Oct. 23, 2024, 9:17 a.m. UTC | #9

On Wed, 23 Oct 2024 at 11:06, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 10/23/24 10:56, Dmitry Vyukov wrote:
> >>
> >> Overall while I sympathise with this, it feels dangerous and a pretty major
> >> change, because there'll be something somewhere that will break because it
> >> expects faults to be swallowed that we no longer do swallow.
> >>
> >> So I'd say it'd be something we should defer, but of course it's a highly
> >> user-facing change so how easy that would be I don't know.
> >>
> >> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE
> >> guards' series is the place to also fundmentally change how user access
> >> page faults are handled within the kernel :)
> >
> > Will delivering signals on kernel access be a backwards compatible
> > change? Or will we need a different API? MADV_GUARD_POISON_KERNEL?
> > It's just somewhat painful to detect/update all userspace if we add
> > this feature in future. Can we say signal delivery on kernel accesses
> > is unspecified?
>
> Would adding signal delivery to guard PTEs only help enough the ASAN etc
> usecase? Wouldn't it be instead possible to add some prctl to opt-in the
> whole ASANized process to deliver all existing segfaults as signals instead
> of -EFAULT ?

ASAN per se does not need this (it does not use page protection).
However, if you mean bug detection tools in general, then, yes, that's
what I had in mind.
There are also things like stack guard pages in libc that would
benefit from that as well.

But I observed that some libraries intentionally use EFAULT to probe
for memory readability, i.e. use some cheap syscall to probe memory
before reading it. So changing behavior globally may not work.

Lorenzo Stoakes Oct. 23, 2024, 9:18 a.m. UTC | #10

On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote:
> On 23.10.24 11:06, Vlastimil Babka wrote:
> > On 10/23/24 10:56, Dmitry Vyukov wrote:
> > > >
> > > > Overall while I sympathise with this, it feels dangerous and a pretty major
> > > > change, because there'll be something somewhere that will break because it
> > > > expects faults to be swallowed that we no longer do swallow.
> > > >
> > > > So I'd say it'd be something we should defer, but of course it's a highly
> > > > user-facing change so how easy that would be I don't know.
> > > >
> > > > But I definitely don't think a 'introduce the ability to do cheap PROT_NONE
> > > > guards' series is the place to also fundmentally change how user access
> > > > page faults are handled within the kernel :)
> > >
> > > Will delivering signals on kernel access be a backwards compatible
> > > change? Or will we need a different API? MADV_GUARD_POISON_KERNEL?
> > > It's just somewhat painful to detect/update all userspace if we add
> > > this feature in future. Can we say signal delivery on kernel accesses
> > > is unspecified?
> >
> > Would adding signal delivery to guard PTEs only help enough the ASAN etc
> > usecase? Wouldn't it be instead possible to add some prctl to opt-in the
> > whole ASANized process to deliver all existing segfaults as signals instead
> > of -EFAULT ?
>
> Not sure if it is an "instead", you might have to deliver the signal in
> addition to letting the syscall fail (not that I would be an expert on
> signal delivery :D ).
>
> prctl sounds better, or some way to configure the behavior on VMA ranges;
> otherwise we would need yet another marker, which is not the end of the
> world but would make it slightly more confusing.
>

Yeah prctl() sounds sensible, and since we are explicitly adding a marker
for guard pages here we can do this as a follow up too without breaking any
userland expectations, i.e. 'new feature to make guard pages signal' is not
going to contradict the default behaviour.

So all makes sense to me, but I do think best as a follow up! :)

> --
> Cheers,
>
> David / dhildenb
>

David Hildenbrand Oct. 23, 2024, 9:29 a.m. UTC | #11

On 23.10.24 11:18, Lorenzo Stoakes wrote:
> On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote:
>> On 23.10.24 11:06, Vlastimil Babka wrote:
>>> On 10/23/24 10:56, Dmitry Vyukov wrote:
>>>>>
>>>>> Overall while I sympathise with this, it feels dangerous and a pretty major
>>>>> change, because there'll be something somewhere that will break because it
>>>>> expects faults to be swallowed that we no longer do swallow.
>>>>>
>>>>> So I'd say it'd be something we should defer, but of course it's a highly
>>>>> user-facing change so how easy that would be I don't know.
>>>>>
>>>>> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE
>>>>> guards' series is the place to also fundmentally change how user access
>>>>> page faults are handled within the kernel :)
>>>>
>>>> Will delivering signals on kernel access be a backwards compatible
>>>> change? Or will we need a different API? MADV_GUARD_POISON_KERNEL?
>>>> It's just somewhat painful to detect/update all userspace if we add
>>>> this feature in future. Can we say signal delivery on kernel accesses
>>>> is unspecified?
>>>
>>> Would adding signal delivery to guard PTEs only help enough the ASAN etc
>>> usecase? Wouldn't it be instead possible to add some prctl to opt-in the
>>> whole ASANized process to deliver all existing segfaults as signals instead
>>> of -EFAULT ?
>>
>> Not sure if it is an "instead", you might have to deliver the signal in
>> addition to letting the syscall fail (not that I would be an expert on
>> signal delivery :D ).
>>
>> prctl sounds better, or some way to configure the behavior on VMA ranges;
>> otherwise we would need yet another marker, which is not the end of the
>> world but would make it slightly more confusing.
>>
> 
> Yeah prctl() sounds sensible, and since we are explicitly adding a marker
> for guard pages here we can do this as a follow up too without breaking any
> userland expectations, i.e. 'new feature to make guard pages signal' is not
> going to contradict the default behaviour.
> 
> So all makes sense to me, but I do think best as a follow up! :)

Yeah, fully agreed. And my gut feeling is that it might not be that easy 
... :)

In the end, what we want is *some* notification that a guard PTE was 
accessed. Likely the notification must not necessarily completely 
synchronous (although it would be ideal) and it must not be a signal.

Maybe having a different way to obtain that information from user space 
would work.

Marco Elver Oct. 23, 2024, 11:31 a.m. UTC | #12

On Wed, 23 Oct 2024 at 11:29, David Hildenbrand <david@redhat.com> wrote:
>
> On 23.10.24 11:18, Lorenzo Stoakes wrote:
> > On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote:
> >> On 23.10.24 11:06, Vlastimil Babka wrote:
> >>> On 10/23/24 10:56, Dmitry Vyukov wrote:
> >>>>>
> >>>>> Overall while I sympathise with this, it feels dangerous and a pretty major
> >>>>> change, because there'll be something somewhere that will break because it
> >>>>> expects faults to be swallowed that we no longer do swallow.
> >>>>>
> >>>>> So I'd say it'd be something we should defer, but of course it's a highly
> >>>>> user-facing change so how easy that would be I don't know.
> >>>>>
> >>>>> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE
> >>>>> guards' series is the place to also fundmentally change how user access
> >>>>> page faults are handled within the kernel :)
> >>>>
> >>>> Will delivering signals on kernel access be a backwards compatible
> >>>> change? Or will we need a different API? MADV_GUARD_POISON_KERNEL?
> >>>> It's just somewhat painful to detect/update all userspace if we add
> >>>> this feature in future. Can we say signal delivery on kernel accesses
> >>>> is unspecified?
> >>>
> >>> Would adding signal delivery to guard PTEs only help enough the ASAN etc
> >>> usecase? Wouldn't it be instead possible to add some prctl to opt-in the
> >>> whole ASANized process to deliver all existing segfaults as signals instead
> >>> of -EFAULT ?
> >>
> >> Not sure if it is an "instead", you might have to deliver the signal in
> >> addition to letting the syscall fail (not that I would be an expert on
> >> signal delivery :D ).
> >>
> >> prctl sounds better, or some way to configure the behavior on VMA ranges;
> >> otherwise we would need yet another marker, which is not the end of the
> >> world but would make it slightly more confusing.
> >>
> >
> > Yeah prctl() sounds sensible, and since we are explicitly adding a marker
> > for guard pages here we can do this as a follow up too without breaking any
> > userland expectations, i.e. 'new feature to make guard pages signal' is not
> > going to contradict the default behaviour.
> >
> > So all makes sense to me, but I do think best as a follow up! :)
>
> Yeah, fully agreed. And my gut feeling is that it might not be that easy
> ... :)
>
> In the end, what we want is *some* notification that a guard PTE was
> accessed. Likely the notification must not necessarily completely
> synchronous (although it would be ideal) and it must not be a signal.
>
> Maybe having a different way to obtain that information from user space
> would work.

For bug detection tools (like GWP-ASan [1]) it's essential to have
useful stack traces. As such, having this signal be synchronous would
be more useful. I don't see how one could get a useful stack trace (or
other information like what's stashed away in ucontext like CPU
registers) if this were asynchronous.

[1] https://arxiv.org/pdf/2311.09394

David Hildenbrand Oct. 23, 2024, 11:36 a.m. UTC | #13

On 23.10.24 13:31, Marco Elver wrote:
> On Wed, 23 Oct 2024 at 11:29, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 23.10.24 11:18, Lorenzo Stoakes wrote:
>>> On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote:
>>>> On 23.10.24 11:06, Vlastimil Babka wrote:
>>>>> On 10/23/24 10:56, Dmitry Vyukov wrote:
>>>>>>>
>>>>>>> Overall while I sympathise with this, it feels dangerous and a pretty major
>>>>>>> change, because there'll be something somewhere that will break because it
>>>>>>> expects faults to be swallowed that we no longer do swallow.
>>>>>>>
>>>>>>> So I'd say it'd be something we should defer, but of course it's a highly
>>>>>>> user-facing change so how easy that would be I don't know.
>>>>>>>
>>>>>>> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE
>>>>>>> guards' series is the place to also fundmentally change how user access
>>>>>>> page faults are handled within the kernel :)
>>>>>>
>>>>>> Will delivering signals on kernel access be a backwards compatible
>>>>>> change? Or will we need a different API? MADV_GUARD_POISON_KERNEL?
>>>>>> It's just somewhat painful to detect/update all userspace if we add
>>>>>> this feature in future. Can we say signal delivery on kernel accesses
>>>>>> is unspecified?
>>>>>
>>>>> Would adding signal delivery to guard PTEs only help enough the ASAN etc
>>>>> usecase? Wouldn't it be instead possible to add some prctl to opt-in the
>>>>> whole ASANized process to deliver all existing segfaults as signals instead
>>>>> of -EFAULT ?
>>>>
>>>> Not sure if it is an "instead", you might have to deliver the signal in
>>>> addition to letting the syscall fail (not that I would be an expert on
>>>> signal delivery :D ).
>>>>
>>>> prctl sounds better, or some way to configure the behavior on VMA ranges;
>>>> otherwise we would need yet another marker, which is not the end of the
>>>> world but would make it slightly more confusing.
>>>>
>>>
>>> Yeah prctl() sounds sensible, and since we are explicitly adding a marker
>>> for guard pages here we can do this as a follow up too without breaking any
>>> userland expectations, i.e. 'new feature to make guard pages signal' is not
>>> going to contradict the default behaviour.
>>>
>>> So all makes sense to me, but I do think best as a follow up! :)
>>
>> Yeah, fully agreed. And my gut feeling is that it might not be that easy
>> ... :)
>>
>> In the end, what we want is *some* notification that a guard PTE was
>> accessed. Likely the notification must not necessarily completely
>> synchronous (although it would be ideal) and it must not be a signal.
>>
>> Maybe having a different way to obtain that information from user space
>> would work.
> 
> For bug detection tools (like GWP-ASan [1]) it's essential to have
> useful stack traces. As such, having this signal be synchronous would
> be more useful. I don't see how one could get a useful stack trace (or
> other information like what's stashed away in ucontext like CPU
> registers) if this were asynchronous.

Yes, I know. But it would be better than not getting *any* notification 
except of some syscalls simply failing with -EFAULT, and not having an 
idea which address was even accessed.

Maybe the signal injection is easier than I think, but I somehow doubt 
it ...

Lorenzo Stoakes Oct. 23, 2024, 11:40 a.m. UTC | #14

On Wed, Oct 23, 2024 at 01:36:10PM +0200, David Hildenbrand wrote:
> On 23.10.24 13:31, Marco Elver wrote:
> > On Wed, 23 Oct 2024 at 11:29, David Hildenbrand <david@redhat.com> wrote:
> > >
> > > On 23.10.24 11:18, Lorenzo Stoakes wrote:
> > > > On Wed, Oct 23, 2024 at 11:13:47AM +0200, David Hildenbrand wrote:
> > > > > On 23.10.24 11:06, Vlastimil Babka wrote:
> > > > > > On 10/23/24 10:56, Dmitry Vyukov wrote:
> > > > > > > >
> > > > > > > > Overall while I sympathise with this, it feels dangerous and a pretty major
> > > > > > > > change, because there'll be something somewhere that will break because it
> > > > > > > > expects faults to be swallowed that we no longer do swallow.
> > > > > > > >
> > > > > > > > So I'd say it'd be something we should defer, but of course it's a highly
> > > > > > > > user-facing change so how easy that would be I don't know.
> > > > > > > >
> > > > > > > > But I definitely don't think a 'introduce the ability to do cheap PROT_NONE
> > > > > > > > guards' series is the place to also fundmentally change how user access
> > > > > > > > page faults are handled within the kernel :)
> > > > > > >
> > > > > > > Will delivering signals on kernel access be a backwards compatible
> > > > > > > change? Or will we need a different API? MADV_GUARD_POISON_KERNEL?
> > > > > > > It's just somewhat painful to detect/update all userspace if we add
> > > > > > > this feature in future. Can we say signal delivery on kernel accesses
> > > > > > > is unspecified?
> > > > > >
> > > > > > Would adding signal delivery to guard PTEs only help enough the ASAN etc
> > > > > > usecase? Wouldn't it be instead possible to add some prctl to opt-in the
> > > > > > whole ASANized process to deliver all existing segfaults as signals instead
> > > > > > of -EFAULT ?
> > > > >
> > > > > Not sure if it is an "instead", you might have to deliver the signal in
> > > > > addition to letting the syscall fail (not that I would be an expert on
> > > > > signal delivery :D ).
> > > > >
> > > > > prctl sounds better, or some way to configure the behavior on VMA ranges;
> > > > > otherwise we would need yet another marker, which is not the end of the
> > > > > world but would make it slightly more confusing.
> > > > >
> > > >
> > > > Yeah prctl() sounds sensible, and since we are explicitly adding a marker
> > > > for guard pages here we can do this as a follow up too without breaking any
> > > > userland expectations, i.e. 'new feature to make guard pages signal' is not
> > > > going to contradict the default behaviour.
> > > >
> > > > So all makes sense to me, but I do think best as a follow up! :)
> > >
> > > Yeah, fully agreed. And my gut feeling is that it might not be that easy
> > > ... :)
> > >
> > > In the end, what we want is *some* notification that a guard PTE was
> > > accessed. Likely the notification must not necessarily completely
> > > synchronous (although it would be ideal) and it must not be a signal.
> > >
> > > Maybe having a different way to obtain that information from user space
> > > would work.
> >
> > For bug detection tools (like GWP-ASan [1]) it's essential to have
> > useful stack traces. As such, having this signal be synchronous would
> > be more useful. I don't see how one could get a useful stack trace (or
> > other information like what's stashed away in ucontext like CPU
> > registers) if this were asynchronous.
>
> Yes, I know. But it would be better than not getting *any* notification
> except of some syscalls simply failing with -EFAULT, and not having an idea
> which address was even accessed.
>
> Maybe the signal injection is easier than I think, but I somehow doubt it
> ...

Yeah I'm afraid I don't think this series is a place where I can
fundamentally change how something so sensitive works in the kernel.

It's espeically super sensitive because this is a uAPI change and a wrong
decision here could result in guard pages being broken out the gate and I
really don't want to risk that.

>
> --
> Cheers,
>
> David / dhildenb
>

[v2,0/5] implement lightweight guard pages

Message

Comments