mbox series

[0/4] mm: permit guard regions for file-backed/shmem mappings

Message ID cover.1739469950.git.lorenzo.stoakes@oracle.com (mailing list archive)
Headers show
Series mm: permit guard regions for file-backed/shmem mappings | expand

Message

Lorenzo Stoakes Feb. 13, 2025, 6:16 p.m. UTC
The guard regions feature was initially implemented to support anonymous
mappings only, excluding shmem.

This was done such as to introduce the feature carefully and incrementally
and to be conservative when considering the various caveats and corner
cases that are applicable to file-backed mappings but not to anonymous
ones.

Now this feature has landed in 6.13, it is time to revisit this and to
extend this functionality to file-backed and shmem mappings.

In order to make this maximally useful, and since one may map file-backed
mappings read-only (for instance ELF images), we also remove the
restriction on read-only mappings and permit the establishment of guard
regions in any non-hugetlb, non-mlock()'d mapping.

It is permissible to permit the establishment of guard regions in read-only
mappings because the guard regions only reduce access to the mapping, and
when removed simply reinstate the existing attributes of the underlying
VMA, meaning no access violations can occur.

While the change in kernel code introduced in this series is small, the
majority of the effort here is spent in extending the testing to assert
that the feature works correctly across numerous file-backed mapping
scenarios.

Every single guard region self-test performed against anonymous memory
(which is relevant and not anon-only) has now been updated to also be
performed against shmem and a mapping of a file in the working directory.

This confirms that all cases also function correctly for file-backed guard
regions.

In addition a number of other tests are added for specific file-backed
mapping scenarios.

There are a number of other concerns that one might have with regard to
guard regions, addressed below:

Readahead
~~~~~~~~~

Readahead is a process through which the page cache is populated on the
assumption that sequential reads will occur, thus amortising I/O and,
through a clever use of the PG_readahead folio flag establishing during
major fault and checked upon minor fault, provides for asynchronous I/O to
occur as dat is processed, reducing I/O stalls as data is faulted in.

Guard regions do not alter this mechanism which operations at the folio and
fault level, but do of course prevent the faulting of folios that would
otherwise be mapped.

In the instance of a major fault prior to a guard region, synchronous
readahead will occur including populating folios in the page cache which
the guard regions will, in the case of the mapping in question, prevent
access to.

In addition, if PG_readahead is placed in a folio that is now inaccessible,
this will prevent asynchronous readahead from occurring as it would
otherwise do.

However, there are mechanisms for heuristically resetting this within
readahead regardless, which will 'recover' correct readahead behaviour.

Readahead presumes sequential data access, the presence of a guard region
clearly indicates that, at least in the guard region, no such sequential
access will occur, as it cannot occur there.

So this should have very little impact on any real workload. The far more
important point is as to whether readahead causes incorrect or
inappropriate mapping of ranges disallowed by the presence of guard
regions - this is not the case, as readahead does not 'pre-fault' memory in
this fashion.

At any rate, any mechanism which would attempt to do so would hit the usual
page fault paths, which correctly handle PTE markers as with anonymous
mappings.

Fault-Around
~~~~~~~~~~~~

The fault-around logic, in a similar vein to readahead, attempts to improve
efficiency with regard to file-backed memory mappings, however it differs
in that it does not try to fetch folios into the page cache that are about
to be accessed, but rather pre-maps a range of folios around the faulting
address.

Guard regions making use of PTE markers makes this relatively trivial, as
this case is already handled - see filemap_map_folio_range() and
filemap_map_order0_folio() - in both instances, the solution is to simply
keep the established page table mappings and let the fault handler take
care of PTE markers, as per the comment:

	/*
	 * NOTE: If there're PTE markers, we'll leave them to be
	 * handled in the specific fault path, and it'll prohibit
	 * the fault-around logic.
	 */

This works, as establishing guard regions results in page table mappings
with PTE markers, and clearing them removes them.

Truncation
~~~~~~~~~~

File truncation will not eliminate existing guard regions, as the
truncation operation will ultimately zap the range via
unmap_mapping_range(), which specifically excludes PTE markers.

Zapping
~~~~~~~

Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
which specifically deals with guard entries, leaving them intact except in
instances such as process teardown or munmap() where they need to be
removed.

Reclaim
~~~~~~~

When reclaim is performed on file-backed folios, it ultimately invokes
try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
will ultimately abort the operation for the guard region mapping. If large,
then check_pte() will determine that this is a non-device private
entry/device-exclusive entry 'swap' PTE and thus abort the operation in
that instance.

Therefore, no odd things happen in the instance of reclaim being attempted
upon a file-backed guard region.

Hole Punching
~~~~~~~~~~~~~

This updates the page cache and ultimately invokes unmap_mapping_range(),
which explicitly leaves PTE markers in place.

Because the establishment of guard regions zapped any existing mappings to
file-backed folios, once the guard regions are removed then the
hole-punched region will be faulted in as usual and everything will behave
as expected.

Lorenzo Stoakes (4):
  mm: allow guard regions in file-backed and read-only mappings
  selftests/mm: rename guard-pages to guard-regions
  tools/selftests: expand all guard region tests to file-backed
  tools/selftests: add file/shmem-backed mapping guard region tests

 mm/madvise.c                                  |   8 +-
 tools/testing/selftests/mm/.gitignore         |   2 +-
 tools/testing/selftests/mm/Makefile           |   2 +-
 .../mm/{guard-pages.c => guard-regions.c}     | 921 ++++++++++++++++--
 4 files changed, 821 insertions(+), 112 deletions(-)
 rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)

--
2.48.1

Comments

Vlastimil Babka Feb. 18, 2025, 12:12 p.m. UTC | #1
On 2/13/25 19:16, Lorenzo Stoakes wrote:
> The guard regions feature was initially implemented to support anonymous
> mappings only, excluding shmem.
> 
> This was done such as to introduce the feature carefully and incrementally
> and to be conservative when considering the various caveats and corner
> cases that are applicable to file-backed mappings but not to anonymous
> ones.
> 
> Now this feature has landed in 6.13, it is time to revisit this and to
> extend this functionality to file-backed and shmem mappings.
> 
> In order to make this maximally useful, and since one may map file-backed
> mappings read-only (for instance ELF images), we also remove the
> restriction on read-only mappings and permit the establishment of guard
> regions in any non-hugetlb, non-mlock()'d mapping.

Do you plan to address mlock later too? I guess somebody might find use for
those. Is there some fundamental issue or just that we need to define some
good semantics for corner cases? (i.e. if pages are already populated in the
mlocked area, discarding them by replacing with guard pte's goes against
that, so do we allow it or not?).

Otherwise nice discussion of all the potential issues here!
Lorenzo Stoakes Feb. 18, 2025, 1:05 p.m. UTC | #2
On Tue, Feb 18, 2025 at 01:12:05PM +0100, Vlastimil Babka wrote:
> On 2/13/25 19:16, Lorenzo Stoakes wrote:
> > The guard regions feature was initially implemented to support anonymous
> > mappings only, excluding shmem.
> >
> > This was done such as to introduce the feature carefully and incrementally
> > and to be conservative when considering the various caveats and corner
> > cases that are applicable to file-backed mappings but not to anonymous
> > ones.
> >
> > Now this feature has landed in 6.13, it is time to revisit this and to
> > extend this functionality to file-backed and shmem mappings.
> >
> > In order to make this maximally useful, and since one may map file-backed
> > mappings read-only (for instance ELF images), we also remove the
> > restriction on read-only mappings and permit the establishment of guard
> > regions in any non-hugetlb, non-mlock()'d mapping.
>
> Do you plan to address mlock later too? I guess somebody might find use for
> those. Is there some fundamental issue or just that we need to define some
> good semantics for corner cases? (i.e. if pages are already populated in the
> mlocked area, discarding them by replacing with guard pte's goes against
> that, so do we allow it or not?).

Yeah that's the fundamental issue with mlock, it does not interact with the
zapping part of MADV_GUARD_INSTALL, and that is why we disallow it (but not so
for MADV_GUARD_REMOVE, as if a VMA that contains guard regions is locked
_afterwards_ there will be no zapping).

We could potentially expose an equivalent, as there are for other flags, a
_LOCKED variant of the madvise() flag, like MADV_GUARD_INSTALL_LOCKED to make
this explicit.

That is probably the most sensible option, if there is a need for this!

>
> Otherwise nice discussion of all the potential issues here!
>

Thanks! :)
David Hildenbrand Feb. 18, 2025, 2:35 p.m. UTC | #3
On 18.02.25 14:05, Lorenzo Stoakes wrote:
> On Tue, Feb 18, 2025 at 01:12:05PM +0100, Vlastimil Babka wrote:
>> On 2/13/25 19:16, Lorenzo Stoakes wrote:
>>> The guard regions feature was initially implemented to support anonymous
>>> mappings only, excluding shmem.
>>>
>>> This was done such as to introduce the feature carefully and incrementally
>>> and to be conservative when considering the various caveats and corner
>>> cases that are applicable to file-backed mappings but not to anonymous
>>> ones.
>>>
>>> Now this feature has landed in 6.13, it is time to revisit this and to
>>> extend this functionality to file-backed and shmem mappings.
>>>
>>> In order to make this maximally useful, and since one may map file-backed
>>> mappings read-only (for instance ELF images), we also remove the
>>> restriction on read-only mappings and permit the establishment of guard
>>> regions in any non-hugetlb, non-mlock()'d mapping.
>>
>> Do you plan to address mlock later too? I guess somebody might find use for
>> those. Is there some fundamental issue or just that we need to define some
>> good semantics for corner cases? (i.e. if pages are already populated in the
>> mlocked area, discarding them by replacing with guard pte's goes against
>> that, so do we allow it or not?).
> 
> Yeah that's the fundamental issue with mlock, it does not interact with the
> zapping part of MADV_GUARD_INSTALL, and that is why we disallow it (but not so
> for MADV_GUARD_REMOVE, as if a VMA that contains guard regions is locked
> _afterwards_ there will be no zapping).
> 
> We could potentially expose an equivalent, as there are for other flags, a
> _LOCKED variant of the madvise() flag, like MADV_GUARD_INSTALL_LOCKED to make
> this explicit.
> 
> That is probably the most sensible option, if there is a need for this!

mlock is weird, because it assumes that memory will be faulted in in the whole VMA.

You'd likely have to populate + mlock the page when removing the guard.
Also not sure what happens if one does an mlock()/mlockall() after
already installing PTE markers.

__mm_populate() would skip whole VMAs in case populate_vma_page_range()
fails. And I would assume populate_vma_page_range() fails on the first
guard when it triggers a page fault.

OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where
MADV_DONTNEED_LOCKED originates from:

commit 9457056ac426e5ed0671356509c8dcce69f8dee0
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Thu Mar 24 18:14:12 2022 -0700

     mm: madvise: MADV_DONTNEED_LOCKED
     
     MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
     and MCL_ONFAULT allowing to mlock without populating, there are valid use
     cases for depopulating locked ranges as well.


Adding support for that would be indeed nice.
Lorenzo Stoakes Feb. 18, 2025, 2:53 p.m. UTC | #4
On Tue, Feb 18, 2025 at 03:35:19PM +0100, David Hildenbrand wrote:
> On 18.02.25 14:05, Lorenzo Stoakes wrote:
> > On Tue, Feb 18, 2025 at 01:12:05PM +0100, Vlastimil Babka wrote:
> > > On 2/13/25 19:16, Lorenzo Stoakes wrote:
> > > > The guard regions feature was initially implemented to support anonymous
> > > > mappings only, excluding shmem.
> > > >
> > > > This was done such as to introduce the feature carefully and incrementally
> > > > and to be conservative when considering the various caveats and corner
> > > > cases that are applicable to file-backed mappings but not to anonymous
> > > > ones.
> > > >
> > > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > > extend this functionality to file-backed and shmem mappings.
> > > >
> > > > In order to make this maximally useful, and since one may map file-backed
> > > > mappings read-only (for instance ELF images), we also remove the
> > > > restriction on read-only mappings and permit the establishment of guard
> > > > regions in any non-hugetlb, non-mlock()'d mapping.
> > >
> > > Do you plan to address mlock later too? I guess somebody might find use for
> > > those. Is there some fundamental issue or just that we need to define some
> > > good semantics for corner cases? (i.e. if pages are already populated in the
> > > mlocked area, discarding them by replacing with guard pte's goes against
> > > that, so do we allow it or not?).
> >
> > Yeah that's the fundamental issue with mlock, it does not interact with the
> > zapping part of MADV_GUARD_INSTALL, and that is why we disallow it (but not so
> > for MADV_GUARD_REMOVE, as if a VMA that contains guard regions is locked
> > _afterwards_ there will be no zapping).
> >
> > We could potentially expose an equivalent, as there are for other flags, a
> > _LOCKED variant of the madvise() flag, like MADV_GUARD_INSTALL_LOCKED to make
> > this explicit.
> >
> > That is probably the most sensible option, if there is a need for this!
>
> mlock is weird, because it assumes that memory will be faulted in in the whole VMA.
>
> You'd likely have to populate + mlock the page when removing the guard.

Right yeah that'd be super weird. And I don't want to add that logic.

> Also not sure what happens if one does an mlock()/mlockall() after
> already installing PTE markers.

The existing logic already handles non-present cases by skipping them, in
mlock_pte_range():

	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
		ptent = ptep_get(pte);
		if (!pte_present(ptent))
			continue;

		...
	}

Which covers off guard regions. Removing the guard regions after this will
leave you in a weird situation where these entries will be zapped... maybe
we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
also populate?

Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE
if you since locked the range? The code currently allows this on the
proviso that 'you aren't zapping locked mappings' but leaves the VMA in a
state such that some entries would not be locked.

It'd be pretty weird to lock guard regions like this.

Having said all that, given what you say below, maybe it's not an issue
after all?...

>
> __mm_populate() would skip whole VMAs in case populate_vma_page_range()
> fails. And I would assume populate_vma_page_range() fails on the first
> guard when it triggers a page fault.
>
> OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where
> MADV_DONTNEED_LOCKED originates from:
>
> commit 9457056ac426e5ed0671356509c8dcce69f8dee0
> Author: Johannes Weiner <hannes@cmpxchg.org>
> Date:   Thu Mar 24 18:14:12 2022 -0700
>
>     mm: madvise: MADV_DONTNEED_LOCKED
>     MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
>     and MCL_ONFAULT allowing to mlock without populating, there are valid use
>     cases for depopulating locked ranges as well.

...Hm this seems to imply the current guard remove stuff isn't quite so
bad, so maybe the assumption that VM_LOCKED implies 'everything is
populated' isn't quite as stringent then.

The restriction is as simple as:

		if (behavior != MADV_DONTNEED_LOCKED)
			forbidden |= VM_LOCKED;

>
>
> Adding support for that would be indeed nice.

I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED
locked ranges, but I really don't understand why you'd want to add guard
regions to mlock()'ed regions?

Then again we're currently asymmetric as you can add them _before_
mlock()'ing...

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand Feb. 18, 2025, 3:20 p.m. UTC | #5
> Right yeah that'd be super weird. And I don't want to add that logic.
> 
>> Also not sure what happens if one does an mlock()/mlockall() after
>> already installing PTE markers.
> 
> The existing logic already handles non-present cases by skipping them, in
> mlock_pte_range():
> 
> 	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
> 		ptent = ptep_get(pte);
> 		if (!pte_present(ptent))
> 			continue;
> 
> 		...
> 	}

I *think* that code only updates already-mapped folios, to properly call 
mlock_folio()/munlock_folio().

It is not the code that populates pages on mlock()/mlockall(). I think 
all that goes via mm_populate()/__mm_populate(), where "ordinary GUP" 
should apply.

See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.

> 
> Which covers off guard regions. Removing the guard regions after this will
> leave you in a weird situation where these entries will be zapped... maybe
> we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
> also populate?

Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.

> 
> Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE
> if you since locked the range? The code currently allows this on the
> proviso that 'you aren't zapping locked mappings' but leaves the VMA in a
> state such that some entries would not be locked.
> 
> It'd be pretty weird to lock guard regions like this.
> 
> Having said all that, given what you say below, maybe it's not an issue
> after all?...
> 
>>
>> __mm_populate() would skip whole VMAs in case populate_vma_page_range()
>> fails. And I would assume populate_vma_page_range() fails on the first
>> guard when it triggers a page fault.
>>
>> OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where
>> MADV_DONTNEED_LOCKED originates from:
>>
>> commit 9457056ac426e5ed0671356509c8dcce69f8dee0
>> Author: Johannes Weiner <hannes@cmpxchg.org>
>> Date:   Thu Mar 24 18:14:12 2022 -0700
>>
>>      mm: madvise: MADV_DONTNEED_LOCKED
>>      MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
>>      and MCL_ONFAULT allowing to mlock without populating, there are valid use
>>      cases for depopulating locked ranges as well.
> 
> ...Hm this seems to imply the current guard remove stuff isn't quite so
> bad, so maybe the assumption that VM_LOCKED implies 'everything is
> populated' isn't quite as stringent then.

Right, with MCL_ONFAULT at least. Without MCL_ONFAULT, the assumption is 
that everything is populated (unless, apparently one uses 
MADV_DONTNEED_LOCKED or population failed, maybe).

VM_LOCKONFAULT seems to be the sane case. I wonder why 
MADV_DONTNEED_LOCKED didn't explicitly check for that one ... maybe 
there is a history to that.

> 
> The restriction is as simple as:
> 
> 		if (behavior != MADV_DONTNEED_LOCKED)
> 			forbidden |= VM_LOCKED;
> 
>>
>>
>> Adding support for that would be indeed nice.
> 
> I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED
> locked ranges, but I really don't understand why you'd want to add guard
> regions to mlock()'ed regions?

Somme apps use mlockall(), and it might be nice to just be able to use 
guard pages as if "Nothing happened".

E.g., QEMU has the option to use mlockall().

> 
> Then again we're currently asymmetric as you can add them _before_
> mlock()'ing...

Right.
Lorenzo Stoakes Feb. 18, 2025, 4:43 p.m. UTC | #6
On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote:
> > Right yeah that'd be super weird. And I don't want to add that logic.
> >
> > > Also not sure what happens if one does an mlock()/mlockall() after
> > > already installing PTE markers.
> >
> > The existing logic already handles non-present cases by skipping them, in
> > mlock_pte_range():
> >
> > 	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
> > 		ptent = ptep_get(pte);
> > 		if (!pte_present(ptent))
> > 			continue;
> >
> > 		...
> > 	}
>
> I *think* that code only updates already-mapped folios, to properly call
> mlock_folio()/munlock_folio().

Guard regions _are_ 'already mapped' :) so it leaves them in place.

do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range()
implies this will be invoked.

>
> It is not the code that populates pages on mlock()/mlockall(). I think all
> that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should
> apply.

OK I want to correct what I said earlier.

Installing a guard region then attempting mlock() will result in an error. The
populate will -EFAULT and stop at the guard region, which causes mlock() to
error out.

This is a partial failure, so the VMA is split and has VM_LOCKED applied, but
the populate halts at the guard region.

This is ok as per previous discussion on aggregate operation failure, there can
be no expectation of 'unwinding' of partially successful operations that form
part of a requested aggregate one.

However, given there's stuff to clean up, and on error a user _may_ wish to then
remove guard regions and try again, I guess there's no harm in keeping the code
as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place.

>
> See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.

Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly'
yeah wrongly, very wrongly...

>
> >
> > Which covers off guard regions. Removing the guard regions after this will
> > leave you in a weird situation where these entries will be zapped... maybe
> > we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
> > also populate?
>
> Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.

See above, no we should not :P this is only good for cleanup after mlock()
failure, although no sane program should really be trying to do this, a sane
program would give up here (and it's a _programmatic error_ to try to mlock() a
range with guard regions).

>
> >
> > Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE
> > if you since locked the range? The code currently allows this on the
> > proviso that 'you aren't zapping locked mappings' but leaves the VMA in a
> > state such that some entries would not be locked.
> >
> > It'd be pretty weird to lock guard regions like this.
> >
> > Having said all that, given what you say below, maybe it's not an issue
> > after all?...
> >
> > >
> > > __mm_populate() would skip whole VMAs in case populate_vma_page_range()
> > > fails. And I would assume populate_vma_page_range() fails on the first
> > > guard when it triggers a page fault.
> > >
> > > OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where
> > > MADV_DONTNEED_LOCKED originates from:
> > >
> > > commit 9457056ac426e5ed0671356509c8dcce69f8dee0
> > > Author: Johannes Weiner <hannes@cmpxchg.org>
> > > Date:   Thu Mar 24 18:14:12 2022 -0700
> > >
> > >      mm: madvise: MADV_DONTNEED_LOCKED
> > >      MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
> > >      and MCL_ONFAULT allowing to mlock without populating, there are valid use
> > >      cases for depopulating locked ranges as well.
> >
> > ...Hm this seems to imply the current guard remove stuff isn't quite so
> > bad, so maybe the assumption that VM_LOCKED implies 'everything is
> > populated' isn't quite as stringent then.
>
> Right, with MCL_ONFAULT at least. Without MCL_ONFAULT, the assumption is
> that everything is populated (unless, apparently one uses
> MADV_DONTNEED_LOCKED or population failed, maybe).
>
> VM_LOCKONFAULT seems to be the sane case. I wonder why MADV_DONTNEED_LOCKED
> didn't explicitly check for that one ... maybe there is a history to that.

Yeah weird.

>
> >
> > The restriction is as simple as:
> >
> > 		if (behavior != MADV_DONTNEED_LOCKED)
> > 			forbidden |= VM_LOCKED;
> >
> > >
> > >
> > > Adding support for that would be indeed nice.
> >
> > I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED
> > locked ranges, but I really don't understand why you'd want to add guard
> > regions to mlock()'ed regions?
>
> Somme apps use mlockall(), and it might be nice to just be able to use guard
> pages as if "Nothing happened".

Sadly I think not given above :P

>
> E.g., QEMU has the option to use mlockall().
>
> >
> > Then again we're currently asymmetric as you can add them _before_
> > mlock()'ing...
>
> Right.
>
> --
> Cheers,
>
> David / dhildenb
>

I think the _LOCKED idea is therefore kaput, because it just won't work
properly because populating guard regions fails.

It fails because it tries to 'touch' the memory, but 'touching' guard
region memory causes a segfault. This kind of breaks the idea of
mlock()'ing guard regions.

I think adding workarounds to make this possible in any way is not really
worth it (and would probably be pretty gross).

We already document that 'mlock()ing lightweight guard regions will fail'
as per man page so this is all in line with that.
David Hildenbrand Feb. 18, 2025, 5:14 p.m. UTC | #7
On 18.02.25 17:43, Lorenzo Stoakes wrote:
> On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote:
>>> Right yeah that'd be super weird. And I don't want to add that logic.
>>>
>>>> Also not sure what happens if one does an mlock()/mlockall() after
>>>> already installing PTE markers.
>>>
>>> The existing logic already handles non-present cases by skipping them, in
>>> mlock_pte_range():
>>>
>>> 	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
>>> 		ptent = ptep_get(pte);
>>> 		if (!pte_present(ptent))
>>> 			continue;
>>>
>>> 		...
>>> 	}
>>
>> I *think* that code only updates already-mapped folios, to properly call
>> mlock_folio()/munlock_folio().
> 
> Guard regions _are_ 'already mapped' :) so it leaves them in place.

"mapped folios" -- there is no folio mapped. Yes, the VMA is in place.

> 
> do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range()
> implies this will be invoked.

Yes, to process any already mapped folios, to then continue population 
later.

> 
>>
>> It is not the code that populates pages on mlock()/mlockall(). I think all
>> that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should
>> apply.
> 
> OK I want to correct what I said earlier.
> 
> Installing a guard region then attempting mlock() will result in an error. The
> populate will -EFAULT and stop at the guard region, which causes mlock() to
> error out.

Right, that's my expectation.

> 
> This is a partial failure, so the VMA is split and has VM_LOCKED applied, but
> the populate halts at the guard region.
> 
> This is ok as per previous discussion on aggregate operation failure, there can
> be no expectation of 'unwinding' of partially successful operations that form
> part of a requested aggregate one.
> 
> However, given there's stuff to clean up, and on error a user _may_ wish to then
> remove guard regions and try again, I guess there's no harm in keeping the code
> as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place.

Likely yes; it's all weird code.

> 
>>
>> See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.
> 
> Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly'
> yeah wrongly, very wrongly...
> 
>>
>>>
>>> Which covers off guard regions. Removing the guard regions after this will
>>> leave you in a weird situation where these entries will be zapped... maybe
>>> we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
>>> also populate?
>>
>> Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.
> 
> See above, no we should not :P this is only good for cleanup after mlock()
> failure, although no sane program should really be trying to do this, a sane
> program would give up here (and it's a _programmatic error_ to try to mlock() a
> range with guard regions).
 >>>> Somme apps use mlockall(), and it might be nice to just be able to 
use guard
>> pages as if "Nothing happened".
> 
> Sadly I think not given above :P

QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); 
when requested to then exit(); if it fails.

Assume glibc or any lib uses it, QEMU would have no real way of figuring 
that out or instructing offending libraries to disabled that, at least 
for now  ...

... turning RT VMs less usable if any library uses guard regions. :(

There is upcoming support for MCL_ONFAULT in QEMU [1] (see below).

[1] https://lkml.kernel.org/r/20250212173823.214429-3-peterx@redhat.com

> 
>>
>> E.g., QEMU has the option to use mlockall().
>>
>>>
>>> Then again we're currently asymmetric as you can add them _before_
>>> mlock()'ing...
>>
>> Right.
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> I think the _LOCKED idea is therefore kaput, because it just won't work
> properly because populating guard regions fails.

Right, I think basic VM_LOCKED is out of the picture. VM_LOCKONFAULT 
might be interesting, because we are skipping the population stage.

> 
> It fails because it tries to 'touch' the memory, but 'touching' guard
> region memory causes a segfault. This kind of breaks the idea of
> mlock()'ing guard regions.
> 
> I think adding workarounds to make this possible in any way is not really
> worth it (and would probably be pretty gross).
> 
> We already document that 'mlock()ing lightweight guard regions will fail'
> as per man page so this is all in line with that.

Right, and I claim that supporting VM_LOCKONFAULT might likely be as 
easy as allowing install/remove of guard regions when that flag is set.
Lorenzo Stoakes Feb. 18, 2025, 5:20 p.m. UTC | #8
On Tue, Feb 18, 2025 at 06:14:00PM +0100, David Hildenbrand wrote:
> On 18.02.25 17:43, Lorenzo Stoakes wrote:
> > On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote:
> > > > Right yeah that'd be super weird. And I don't want to add that logic.
> > > >
> > > > > Also not sure what happens if one does an mlock()/mlockall() after
> > > > > already installing PTE markers.
> > > >
> > > > The existing logic already handles non-present cases by skipping them, in
> > > > mlock_pte_range():
> > > >
> > > > 	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
> > > > 		ptent = ptep_get(pte);
> > > > 		if (!pte_present(ptent))
> > > > 			continue;
> > > >
> > > > 		...
> > > > 	}
> > >
> > > I *think* that code only updates already-mapped folios, to properly call
> > > mlock_folio()/munlock_folio().
> >
> > Guard regions _are_ 'already mapped' :) so it leaves them in place.
>
> "mapped folios" -- there is no folio mapped. Yes, the VMA is in place.

We're engaging in a moot discussion on this I think but I mean it appears
to operate by walking page tables if they are populated, which they will be
for guard regions, but when it finds it's non-present it will skip.

This amounts to the same thing as not doing anything, obviously.

>
> >
> > do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range()
> > implies this will be invoked.
>
> Yes, to process any already mapped folios, to then continue population
> later.
>
> >
> > >
> > > It is not the code that populates pages on mlock()/mlockall(). I think all
> > > that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should
> > > apply.
> >
> > OK I want to correct what I said earlier.
> >
> > Installing a guard region then attempting mlock() will result in an error. The
> > populate will -EFAULT and stop at the guard region, which causes mlock() to
> > error out.
>
> Right, that's my expectation.

OK good!

>
> >
> > This is a partial failure, so the VMA is split and has VM_LOCKED applied, but
> > the populate halts at the guard region.
> >
> > This is ok as per previous discussion on aggregate operation failure, there can
> > be no expectation of 'unwinding' of partially successful operations that form
> > part of a requested aggregate one.
> >
> > However, given there's stuff to clean up, and on error a user _may_ wish to then
> > remove guard regions and try again, I guess there's no harm in keeping the code
> > as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place.
>
> Likely yes; it's all weird code.
>
> >
> > >
> > > See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.
> >
> > Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly'
> > yeah wrongly, very wrongly...
> >
> > >
> > > >
> > > > Which covers off guard regions. Removing the guard regions after this will
> > > > leave you in a weird situation where these entries will be zapped... maybe
> > > > we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
> > > > also populate?
> > >
> > > Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.
> >
> > See above, no we should not :P this is only good for cleanup after mlock()
> > failure, although no sane program should really be trying to do this, a sane
> > program would give up here (and it's a _programmatic error_ to try to mlock() a
> > range with guard regions).
> >>>> Somme apps use mlockall(), and it might be nice to just be able to use
> guard
> > > pages as if "Nothing happened".
> >
> > Sadly I think not given above :P
>
> QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when
> requested to then exit(); if it fails.

Hm under what circumstances? I use qemu extensively to test this stuff with
no issues. Unless you mean it's using it in the 'host' code somehow.

>
> Assume glibc or any lib uses it, QEMU would have no real way of figuring
> that out or instructing offending libraries to disabled that, at least for
> now  ...
>
> ... turning RT VMs less usable if any library uses guard regions. :(
>

This seems really stupid, to be honest. Unfortunately there's no way around
this, if software does stupid things then they get stupid prizes. There are
other ways mlock() and faulting in can fail too.

> There is upcoming support for MCL_ONFAULT in QEMU [1] (see below).

Good.

>
> [1] https://lkml.kernel.org/r/20250212173823.214429-3-peterx@redhat.com
>
> >
> > >
> > > E.g., QEMU has the option to use mlockall().
> > >
> > > >
> > > > Then again we're currently asymmetric as you can add them _before_
> > > > mlock()'ing...
> > >
> > > Right.
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
> >
> > I think the _LOCKED idea is therefore kaput, because it just won't work
> > properly because populating guard regions fails.
>
> Right, I think basic VM_LOCKED is out of the picture. VM_LOCKONFAULT might
> be interesting, because we are skipping the population stage.
>
> >
> > It fails because it tries to 'touch' the memory, but 'touching' guard
> > region memory causes a segfault. This kind of breaks the idea of
> > mlock()'ing guard regions.
> >
> > I think adding workarounds to make this possible in any way is not really
> > worth it (and would probably be pretty gross).
> >
> > We already document that 'mlock()ing lightweight guard regions will fail'
> > as per man page so this is all in line with that.
>
> Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as
> allowing install/remove of guard regions when that flag is set.

We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we
disallow.

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand Feb. 18, 2025, 5:25 p.m. UTC | #9
>>
>> QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when
>> requested to then exit(); if it fails.
> 
> Hm under what circumstances? I use qemu extensively to test this stuff with
> no issues. Unless you mean it's using it in the 'host' code somehow.


-overcommit mem-lock=on

or (legacy)

-realtime mlock=on

I think.

[...]

>>>
>>> It fails because it tries to 'touch' the memory, but 'touching' guard
>>> region memory causes a segfault. This kind of breaks the idea of
>>> mlock()'ing guard regions.
>>>
>>> I think adding workarounds to make this possible in any way is not really
>>> worth it (and would probably be pretty gross).
>>>
>>> We already document that 'mlock()ing lightweight guard regions will fail'
>>> as per man page so this is all in line with that.
>>
>> Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as
>> allowing install/remove of guard regions when that flag is set.
> 
> We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we
> disallow.


See mlock2();

SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
{
	vm_flags_t vm_flags = VM_LOCKED;

	if (flags & ~MLOCK_ONFAULT)
		return -EINVAL;

	if (flags & MLOCK_ONFAULT)
		vm_flags |= VM_LOCKONFAULT;

	return do_mlock(start, len, vm_flags);
}


VM_LOCKONFAULT always as VM_LOCKED set as well.
Lorenzo Stoakes Feb. 18, 2025, 5:28 p.m. UTC | #10
On Tue, Feb 18, 2025 at 06:25:35PM +0100, David Hildenbrand wrote:
> > >
> > > QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when
> > > requested to then exit(); if it fails.
> >
> > Hm under what circumstances? I use qemu extensively to test this stuff with
> > no issues. Unless you mean it's using it in the 'host' code somehow.
>
>
> -overcommit mem-lock=on
>
> or (legacy)
>
> -realtime mlock=on
>
> I think.
>
> [...]

Thanks

>
> > > >
> > > > It fails because it tries to 'touch' the memory, but 'touching' guard
> > > > region memory causes a segfault. This kind of breaks the idea of
> > > > mlock()'ing guard regions.
> > > >
> > > > I think adding workarounds to make this possible in any way is not really
> > > > worth it (and would probably be pretty gross).
> > > >
> > > > We already document that 'mlock()ing lightweight guard regions will fail'
> > > > as per man page so this is all in line with that.
> > >
> > > Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as
> > > allowing install/remove of guard regions when that flag is set.
> >
> > We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we
> > disallow.
>
>
> See mlock2();
>
> SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
> {
> 	vm_flags_t vm_flags = VM_LOCKED;
>
> 	if (flags & ~MLOCK_ONFAULT)
> 		return -EINVAL;
>
> 	if (flags & MLOCK_ONFAULT)
> 		vm_flags |= VM_LOCKONFAULT;
>
> 	return do_mlock(start, len, vm_flags);
> }
>
>
> VM_LOCKONFAULT always as VM_LOCKED set as well.

OK cool, that makes sense.

As with much kernel stuff, I knew this in the past. Then I forgot. Then I knew
again, then... :P if only somebody would write it down in a book...

Yeah then that makes sense to check explicitly for (VM_LOCKED | VM_LOCKONFAULT)
in any MADV_GUARD_INSTALL_LOCKED variant as obviously this would be passively
excluded right now.

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand Feb. 18, 2025, 5:31 p.m. UTC | #11
>> See mlock2();
>>
>> SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
>> {
>> 	vm_flags_t vm_flags = VM_LOCKED;
>>
>> 	if (flags & ~MLOCK_ONFAULT)
>> 		return -EINVAL;
>>
>> 	if (flags & MLOCK_ONFAULT)
>> 		vm_flags |= VM_LOCKONFAULT;
>>
>> 	return do_mlock(start, len, vm_flags);
>> }
>>
>>
>> VM_LOCKONFAULT always as VM_LOCKED set as well.
> 
> OK cool, that makes sense.
> 
> As with much kernel stuff, I knew this in the past. Then I forgot. Then I knew
> again, then... :P if only somebody would write it down in a book...

And if there would be such a book in the makings, I would consider 
hurrying up and preordering it ;)
Kalesh Singh Feb. 19, 2025, 8:25 a.m. UTC | #12
On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> The guard regions feature was initially implemented to support anonymous
> mappings only, excluding shmem.
>
> This was done such as to introduce the feature carefully and incrementally
> and to be conservative when considering the various caveats and corner
> cases that are applicable to file-backed mappings but not to anonymous
> ones.
>
> Now this feature has landed in 6.13, it is time to revisit this and to
> extend this functionality to file-backed and shmem mappings.
>
> In order to make this maximally useful, and since one may map file-backed
> mappings read-only (for instance ELF images), we also remove the
> restriction on read-only mappings and permit the establishment of guard
> regions in any non-hugetlb, non-mlock()'d mapping.

Hi Lorenzo,

Thank you for your work on this.

Have we thought about how guard regions are represented in /proc/*/[s]maps?

In the field, I've found that many applications read the ranges from
/proc/self/[s]maps to determine what they can access (usually related
to obfuscation techniques). If they don't know of the guard regions it
would cause them to crash; I think that we'll need similar entries to
PROT_NONE (---p) for these, and generally to maintain consistency
between the behavior and what is being said from /proc/*/[s]maps.

-- Kalesh

>
> It is permissible to permit the establishment of guard regions in read-only
> mappings because the guard regions only reduce access to the mapping, and
> when removed simply reinstate the existing attributes of the underlying
> VMA, meaning no access violations can occur.
>
> While the change in kernel code introduced in this series is small, the
> majority of the effort here is spent in extending the testing to assert
> that the feature works correctly across numerous file-backed mapping
> scenarios.
>
> Every single guard region self-test performed against anonymous memory
> (which is relevant and not anon-only) has now been updated to also be
> performed against shmem and a mapping of a file in the working directory.
>
> This confirms that all cases also function correctly for file-backed guard
> regions.
>
> In addition a number of other tests are added for specific file-backed
> mapping scenarios.
>
> There are a number of other concerns that one might have with regard to
> guard regions, addressed below:
>
> Readahead
> ~~~~~~~~~
>
> Readahead is a process through which the page cache is populated on the
> assumption that sequential reads will occur, thus amortising I/O and,
> through a clever use of the PG_readahead folio flag establishing during
> major fault and checked upon minor fault, provides for asynchronous I/O to
> occur as dat is processed, reducing I/O stalls as data is faulted in.
>
> Guard regions do not alter this mechanism which operations at the folio and
> fault level, but do of course prevent the faulting of folios that would
> otherwise be mapped.
>
> In the instance of a major fault prior to a guard region, synchronous
> readahead will occur including populating folios in the page cache which
> the guard regions will, in the case of the mapping in question, prevent
> access to.
>
> In addition, if PG_readahead is placed in a folio that is now inaccessible,
> this will prevent asynchronous readahead from occurring as it would
> otherwise do.
>
> However, there are mechanisms for heuristically resetting this within
> readahead regardless, which will 'recover' correct readahead behaviour.
>
> Readahead presumes sequential data access, the presence of a guard region
> clearly indicates that, at least in the guard region, no such sequential
> access will occur, as it cannot occur there.
>
> So this should have very little impact on any real workload. The far more
> important point is as to whether readahead causes incorrect or
> inappropriate mapping of ranges disallowed by the presence of guard
> regions - this is not the case, as readahead does not 'pre-fault' memory in
> this fashion.
>
> At any rate, any mechanism which would attempt to do so would hit the usual
> page fault paths, which correctly handle PTE markers as with anonymous
> mappings.
>
> Fault-Around
> ~~~~~~~~~~~~
>
> The fault-around logic, in a similar vein to readahead, attempts to improve
> efficiency with regard to file-backed memory mappings, however it differs
> in that it does not try to fetch folios into the page cache that are about
> to be accessed, but rather pre-maps a range of folios around the faulting
> address.
>
> Guard regions making use of PTE markers makes this relatively trivial, as
> this case is already handled - see filemap_map_folio_range() and
> filemap_map_order0_folio() - in both instances, the solution is to simply
> keep the established page table mappings and let the fault handler take
> care of PTE markers, as per the comment:
>
>         /*
>          * NOTE: If there're PTE markers, we'll leave them to be
>          * handled in the specific fault path, and it'll prohibit
>          * the fault-around logic.
>          */
>
> This works, as establishing guard regions results in page table mappings
> with PTE markers, and clearing them removes them.
>
> Truncation
> ~~~~~~~~~~
>
> File truncation will not eliminate existing guard regions, as the
> truncation operation will ultimately zap the range via
> unmap_mapping_range(), which specifically excludes PTE markers.
>
> Zapping
> ~~~~~~~
>
> Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
> which specifically deals with guard entries, leaving them intact except in
> instances such as process teardown or munmap() where they need to be
> removed.
>
> Reclaim
> ~~~~~~~
>
> When reclaim is performed on file-backed folios, it ultimately invokes
> try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
> will ultimately abort the operation for the guard region mapping. If large,
> then check_pte() will determine that this is a non-device private
> entry/device-exclusive entry 'swap' PTE and thus abort the operation in
> that instance.
>
> Therefore, no odd things happen in the instance of reclaim being attempted
> upon a file-backed guard region.
>
> Hole Punching
> ~~~~~~~~~~~~~
>
> This updates the page cache and ultimately invokes unmap_mapping_range(),
> which explicitly leaves PTE markers in place.
>
> Because the establishment of guard regions zapped any existing mappings to
> file-backed folios, once the guard regions are removed then the
> hole-punched region will be faulted in as usual and everything will behave
> as expected.
>
> Lorenzo Stoakes (4):
>   mm: allow guard regions in file-backed and read-only mappings
>   selftests/mm: rename guard-pages to guard-regions
>   tools/selftests: expand all guard region tests to file-backed
>   tools/selftests: add file/shmem-backed mapping guard region tests
>
>  mm/madvise.c                                  |   8 +-
>  tools/testing/selftests/mm/.gitignore         |   2 +-
>  tools/testing/selftests/mm/Makefile           |   2 +-
>  .../mm/{guard-pages.c => guard-regions.c}     | 921 ++++++++++++++++--
>  4 files changed, 821 insertions(+), 112 deletions(-)
>  rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
>
> --
> 2.48.1
Kalesh Singh Feb. 19, 2025, 8:35 a.m. UTC | #13
On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh <kaleshsingh@google.com> wrote:
>
> On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > The guard regions feature was initially implemented to support anonymous
> > mappings only, excluding shmem.
> >
> > This was done such as to introduce the feature carefully and incrementally
> > and to be conservative when considering the various caveats and corner
> > cases that are applicable to file-backed mappings but not to anonymous
> > ones.
> >
> > Now this feature has landed in 6.13, it is time to revisit this and to
> > extend this functionality to file-backed and shmem mappings.
> >
> > In order to make this maximally useful, and since one may map file-backed
> > mappings read-only (for instance ELF images), we also remove the
> > restriction on read-only mappings and permit the establishment of guard
> > regions in any non-hugetlb, non-mlock()'d mapping.
>
> Hi Lorenzo,
>
> Thank you for your work on this.
>
> Have we thought about how guard regions are represented in /proc/*/[s]maps?
>
> In the field, I've found that many applications read the ranges from
> /proc/self/[s]maps to determine what they can access (usually related
> to obfuscation techniques). If they don't know of the guard regions it
> would cause them to crash; I think that we'll need similar entries to
> PROT_NONE (---p) for these, and generally to maintain consistency
> between the behavior and what is being said from /proc/*/[s]maps.

To clarify why the applications may not be aware of their guard
regions -- in the case of the ELF mappings these PROT_NONE (guard
regions) would be installed by the dynamic loader; or may be inherited
from the parent (zygote in Android's case).

>
> -- Kalesh
>
> >
> > It is permissible to permit the establishment of guard regions in read-only
> > mappings because the guard regions only reduce access to the mapping, and
> > when removed simply reinstate the existing attributes of the underlying
> > VMA, meaning no access violations can occur.
> >
> > While the change in kernel code introduced in this series is small, the
> > majority of the effort here is spent in extending the testing to assert
> > that the feature works correctly across numerous file-backed mapping
> > scenarios.
> >
> > Every single guard region self-test performed against anonymous memory
> > (which is relevant and not anon-only) has now been updated to also be
> > performed against shmem and a mapping of a file in the working directory.
> >
> > This confirms that all cases also function correctly for file-backed guard
> > regions.
> >
> > In addition a number of other tests are added for specific file-backed
> > mapping scenarios.
> >
> > There are a number of other concerns that one might have with regard to
> > guard regions, addressed below:
> >
> > Readahead
> > ~~~~~~~~~
> >
> > Readahead is a process through which the page cache is populated on the
> > assumption that sequential reads will occur, thus amortising I/O and,
> > through a clever use of the PG_readahead folio flag establishing during
> > major fault and checked upon minor fault, provides for asynchronous I/O to
> > occur as dat is processed, reducing I/O stalls as data is faulted in.
> >
> > Guard regions do not alter this mechanism which operations at the folio and
> > fault level, but do of course prevent the faulting of folios that would
> > otherwise be mapped.
> >
> > In the instance of a major fault prior to a guard region, synchronous
> > readahead will occur including populating folios in the page cache which
> > the guard regions will, in the case of the mapping in question, prevent
> > access to.
> >
> > In addition, if PG_readahead is placed in a folio that is now inaccessible,
> > this will prevent asynchronous readahead from occurring as it would
> > otherwise do.
> >
> > However, there are mechanisms for heuristically resetting this within
> > readahead regardless, which will 'recover' correct readahead behaviour.
> >
> > Readahead presumes sequential data access, the presence of a guard region
> > clearly indicates that, at least in the guard region, no such sequential
> > access will occur, as it cannot occur there.
> >
> > So this should have very little impact on any real workload. The far more
> > important point is as to whether readahead causes incorrect or
> > inappropriate mapping of ranges disallowed by the presence of guard
> > regions - this is not the case, as readahead does not 'pre-fault' memory in
> > this fashion.
> >
> > At any rate, any mechanism which would attempt to do so would hit the usual
> > page fault paths, which correctly handle PTE markers as with anonymous
> > mappings.
> >
> > Fault-Around
> > ~~~~~~~~~~~~
> >
> > The fault-around logic, in a similar vein to readahead, attempts to improve
> > efficiency with regard to file-backed memory mappings, however it differs
> > in that it does not try to fetch folios into the page cache that are about
> > to be accessed, but rather pre-maps a range of folios around the faulting
> > address.
> >
> > Guard regions making use of PTE markers makes this relatively trivial, as
> > this case is already handled - see filemap_map_folio_range() and
> > filemap_map_order0_folio() - in both instances, the solution is to simply
> > keep the established page table mappings and let the fault handler take
> > care of PTE markers, as per the comment:
> >
> >         /*
> >          * NOTE: If there're PTE markers, we'll leave them to be
> >          * handled in the specific fault path, and it'll prohibit
> >          * the fault-around logic.
> >          */
> >
> > This works, as establishing guard regions results in page table mappings
> > with PTE markers, and clearing them removes them.
> >
> > Truncation
> > ~~~~~~~~~~
> >
> > File truncation will not eliminate existing guard regions, as the
> > truncation operation will ultimately zap the range via
> > unmap_mapping_range(), which specifically excludes PTE markers.
> >
> > Zapping
> > ~~~~~~~
> >
> > Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
> > which specifically deals with guard entries, leaving them intact except in
> > instances such as process teardown or munmap() where they need to be
> > removed.
> >
> > Reclaim
> > ~~~~~~~
> >
> > When reclaim is performed on file-backed folios, it ultimately invokes
> > try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
> > will ultimately abort the operation for the guard region mapping. If large,
> > then check_pte() will determine that this is a non-device private
> > entry/device-exclusive entry 'swap' PTE and thus abort the operation in
> > that instance.
> >
> > Therefore, no odd things happen in the instance of reclaim being attempted
> > upon a file-backed guard region.
> >
> > Hole Punching
> > ~~~~~~~~~~~~~
> >
> > This updates the page cache and ultimately invokes unmap_mapping_range(),
> > which explicitly leaves PTE markers in place.
> >
> > Because the establishment of guard regions zapped any existing mappings to
> > file-backed folios, once the guard regions are removed then the
> > hole-punched region will be faulted in as usual and everything will behave
> > as expected.
> >
> > Lorenzo Stoakes (4):
> >   mm: allow guard regions in file-backed and read-only mappings
> >   selftests/mm: rename guard-pages to guard-regions
> >   tools/selftests: expand all guard region tests to file-backed
> >   tools/selftests: add file/shmem-backed mapping guard region tests
> >
> >  mm/madvise.c                                  |   8 +-
> >  tools/testing/selftests/mm/.gitignore         |   2 +-
> >  tools/testing/selftests/mm/Makefile           |   2 +-
> >  .../mm/{guard-pages.c => guard-regions.c}     | 921 ++++++++++++++++--
> >  4 files changed, 821 insertions(+), 112 deletions(-)
> >  rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
> >
> > --
> > 2.48.1
Lorenzo Stoakes Feb. 19, 2025, 9:03 a.m. UTC | #14
On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
> On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > The guard regions feature was initially implemented to support anonymous
> > mappings only, excluding shmem.
> >
> > This was done such as to introduce the feature carefully and incrementally
> > and to be conservative when considering the various caveats and corner
> > cases that are applicable to file-backed mappings but not to anonymous
> > ones.
> >
> > Now this feature has landed in 6.13, it is time to revisit this and to
> > extend this functionality to file-backed and shmem mappings.
> >
> > In order to make this maximally useful, and since one may map file-backed
> > mappings read-only (for instance ELF images), we also remove the
> > restriction on read-only mappings and permit the establishment of guard
> > regions in any non-hugetlb, non-mlock()'d mapping.
>
> Hi Lorenzo,
>
> Thank you for your work on this.

You're welcome.

>
> Have we thought about how guard regions are represented in /proc/*/[s]maps?

This is off-topic here but... Yes, extensively. No they do not appear
there.

I thought you had attended LPC and my talk where I mentioned this
purposefully as a drawback?

I went out of my way to advertise this limitation at the LPC talk, in the
original series, etc. so it's a little disappointing that this is being
brought up so late, but nobody else has raised objections to this issue so
I think in general it's not a limitation that matters in practice.

>
> In the field, I've found that many applications read the ranges from
> /proc/self/[s]maps to determine what they can access (usually related
> to obfuscation techniques). If they don't know of the guard regions it
> would cause them to crash; I think that we'll need similar entries to
> PROT_NONE (---p) for these, and generally to maintain consistency
> between the behavior and what is being said from /proc/*/[s]maps.

No, we cannot have these, sorry.

Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this
feature is to avoid having to accumulate VMAs for regions which are not
intended to be accessible.

Secondly, there is no practical means for this to be accomplished in
/proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates
they have guard regions.

This is intentional, because setting such metadata is simply not practical
- why? Because when you try to split the VMA, how do you know which bit
gets the metadata and which doesn't? You can't without _reading page
tables_.

/proc/$pid/smaps _does_ read page tables, but we can't start pretending
VMAs exist when they don't, this would be completely inaccurate, would
break assumptions for things like mremap (which require a single VMA) and
would be unworkable.

The best that _could_ be achieved is to have a marker in /proc/$pid/smaps
saying 'hey this region has guard regions somewhere'.

But I haven't seen any demand for this and presumably this wouldn't help
your imagined program?

I don't really understand your use case though, what programs would read
/proc/maps, then... try to use /proc/$pid/mem or whatnot to arbitrarily
read regions? Such applications would be in danger of SIGBUS in any case if
they were to read invalid portions of file-backed mappings, and have no way
of knowing this, so they seem fundamentally broken as it is?

>
> -- Kalesh
>
> >
> > It is permissible to permit the establishment of guard regions in read-only
> > mappings because the guard regions only reduce access to the mapping, and
> > when removed simply reinstate the existing attributes of the underlying
> > VMA, meaning no access violations can occur.
> >
> > While the change in kernel code introduced in this series is small, the
> > majority of the effort here is spent in extending the testing to assert
> > that the feature works correctly across numerous file-backed mapping
> > scenarios.
> >
> > Every single guard region self-test performed against anonymous memory
> > (which is relevant and not anon-only) has now been updated to also be
> > performed against shmem and a mapping of a file in the working directory.
> >
> > This confirms that all cases also function correctly for file-backed guard
> > regions.
> >
> > In addition a number of other tests are added for specific file-backed
> > mapping scenarios.
> >
> > There are a number of other concerns that one might have with regard to
> > guard regions, addressed below:
> >
> > Readahead
> > ~~~~~~~~~
> >
> > Readahead is a process through which the page cache is populated on the
> > assumption that sequential reads will occur, thus amortising I/O and,
> > through a clever use of the PG_readahead folio flag establishing during
> > major fault and checked upon minor fault, provides for asynchronous I/O to
> > occur as dat is processed, reducing I/O stalls as data is faulted in.
> >
> > Guard regions do not alter this mechanism which operations at the folio and
> > fault level, but do of course prevent the faulting of folios that would
> > otherwise be mapped.
> >
> > In the instance of a major fault prior to a guard region, synchronous
> > readahead will occur including populating folios in the page cache which
> > the guard regions will, in the case of the mapping in question, prevent
> > access to.
> >
> > In addition, if PG_readahead is placed in a folio that is now inaccessible,
> > this will prevent asynchronous readahead from occurring as it would
> > otherwise do.
> >
> > However, there are mechanisms for heuristically resetting this within
> > readahead regardless, which will 'recover' correct readahead behaviour.
> >
> > Readahead presumes sequential data access, the presence of a guard region
> > clearly indicates that, at least in the guard region, no such sequential
> > access will occur, as it cannot occur there.
> >
> > So this should have very little impact on any real workload. The far more
> > important point is as to whether readahead causes incorrect or
> > inappropriate mapping of ranges disallowed by the presence of guard
> > regions - this is not the case, as readahead does not 'pre-fault' memory in
> > this fashion.
> >
> > At any rate, any mechanism which would attempt to do so would hit the usual
> > page fault paths, which correctly handle PTE markers as with anonymous
> > mappings.
> >
> > Fault-Around
> > ~~~~~~~~~~~~
> >
> > The fault-around logic, in a similar vein to readahead, attempts to improve
> > efficiency with regard to file-backed memory mappings, however it differs
> > in that it does not try to fetch folios into the page cache that are about
> > to be accessed, but rather pre-maps a range of folios around the faulting
> > address.
> >
> > Guard regions making use of PTE markers makes this relatively trivial, as
> > this case is already handled - see filemap_map_folio_range() and
> > filemap_map_order0_folio() - in both instances, the solution is to simply
> > keep the established page table mappings and let the fault handler take
> > care of PTE markers, as per the comment:
> >
> >         /*
> >          * NOTE: If there're PTE markers, we'll leave them to be
> >          * handled in the specific fault path, and it'll prohibit
> >          * the fault-around logic.
> >          */
> >
> > This works, as establishing guard regions results in page table mappings
> > with PTE markers, and clearing them removes them.
> >
> > Truncation
> > ~~~~~~~~~~
> >
> > File truncation will not eliminate existing guard regions, as the
> > truncation operation will ultimately zap the range via
> > unmap_mapping_range(), which specifically excludes PTE markers.
> >
> > Zapping
> > ~~~~~~~
> >
> > Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
> > which specifically deals with guard entries, leaving them intact except in
> > instances such as process teardown or munmap() where they need to be
> > removed.
> >
> > Reclaim
> > ~~~~~~~
> >
> > When reclaim is performed on file-backed folios, it ultimately invokes
> > try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
> > will ultimately abort the operation for the guard region mapping. If large,
> > then check_pte() will determine that this is a non-device private
> > entry/device-exclusive entry 'swap' PTE and thus abort the operation in
> > that instance.
> >
> > Therefore, no odd things happen in the instance of reclaim being attempted
> > upon a file-backed guard region.
> >
> > Hole Punching
> > ~~~~~~~~~~~~~
> >
> > This updates the page cache and ultimately invokes unmap_mapping_range(),
> > which explicitly leaves PTE markers in place.
> >
> > Because the establishment of guard regions zapped any existing mappings to
> > file-backed folios, once the guard regions are removed then the
> > hole-punched region will be faulted in as usual and everything will behave
> > as expected.
> >
> > Lorenzo Stoakes (4):
> >   mm: allow guard regions in file-backed and read-only mappings
> >   selftests/mm: rename guard-pages to guard-regions
> >   tools/selftests: expand all guard region tests to file-backed
> >   tools/selftests: add file/shmem-backed mapping guard region tests
> >
> >  mm/madvise.c                                  |   8 +-
> >  tools/testing/selftests/mm/.gitignore         |   2 +-
> >  tools/testing/selftests/mm/Makefile           |   2 +-
> >  .../mm/{guard-pages.c => guard-regions.c}     | 921 ++++++++++++++++--
> >  4 files changed, 821 insertions(+), 112 deletions(-)
> >  rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
> >
> > --
> > 2.48.1
Lorenzo Stoakes Feb. 19, 2025, 9:15 a.m. UTC | #15
On Wed, Feb 19, 2025 at 12:35:12AM -0800, Kalesh Singh wrote:
> On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh <kaleshsingh@google.com> wrote:
> >
> > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > The guard regions feature was initially implemented to support anonymous
> > > mappings only, excluding shmem.
> > >
> > > This was done such as to introduce the feature carefully and incrementally
> > > and to be conservative when considering the various caveats and corner
> > > cases that are applicable to file-backed mappings but not to anonymous
> > > ones.
> > >
> > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > extend this functionality to file-backed and shmem mappings.
> > >
> > > In order to make this maximally useful, and since one may map file-backed
> > > mappings read-only (for instance ELF images), we also remove the
> > > restriction on read-only mappings and permit the establishment of guard
> > > regions in any non-hugetlb, non-mlock()'d mapping.
> >
> > Hi Lorenzo,
> >
> > Thank you for your work on this.
> >
> > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> >
> > In the field, I've found that many applications read the ranges from
> > /proc/self/[s]maps to determine what they can access (usually related
> > to obfuscation techniques). If they don't know of the guard regions it
> > would cause them to crash; I think that we'll need similar entries to
> > PROT_NONE (---p) for these, and generally to maintain consistency
> > between the behavior and what is being said from /proc/*/[s]maps.
>
> To clarify why the applications may not be aware of their guard
> regions -- in the case of the ELF mappings these PROT_NONE (guard
> regions) would be installed by the dynamic loader; or may be inherited
> from the parent (zygote in Android's case).

I'm still really confused as to how you propose to use guard regions in
this scenario. If it's trying to 'clone' some set of mappings somehow then
guard regions are just not compatible with your use case at all.

I'm not sure what would be honestly... you maybe could find a way to modify
the loader to interpret PROT_NONE mappings as a cue to use guard regions
instead if that worked?

You can't both not have a VMA (something that describes mapping with common
attributes) and have page table state along with somehow having something
that describes a mapping with said page table state :)

It's a sort of non-such-beast.

Everything is a trade-off, guard regions provide the option to retain state
regarding faulting behaviour in page tables rather the VMAs, entirely
opt-in and at the behest of the user.

Now extended to file-backed and shmem!

While I hoped it would help your scenario based on your talk at LPC, this
is very clearly an important change for guard regions (for shmem most
notably) and removes an important limitation and exposes the possibility of
using it for things such as ELF mappings for those who don't also require
specific [s]maps state (which explicitly NEEDS VMA metadata state).

I suspect you can find a way of making this work in your scenario
though.

An alternative if you're willing to wait for page table state to be
acquired is to expose a new /proc/$pid/... endpoint that explicitly outputs
guard regions.

Would this solve your problem...? I had always considered implementing this
if there was a demand.

Maybe /proc/$pid/gmaps? :P 'g' could mean guard regions or... ;)

>
> >
> > -- Kalesh
> >
> > >
> > > It is permissible to permit the establishment of guard regions in read-only
> > > mappings because the guard regions only reduce access to the mapping, and
> > > when removed simply reinstate the existing attributes of the underlying
> > > VMA, meaning no access violations can occur.
> > >
> > > While the change in kernel code introduced in this series is small, the
> > > majority of the effort here is spent in extending the testing to assert
> > > that the feature works correctly across numerous file-backed mapping
> > > scenarios.
> > >
> > > Every single guard region self-test performed against anonymous memory
> > > (which is relevant and not anon-only) has now been updated to also be
> > > performed against shmem and a mapping of a file in the working directory.
> > >
> > > This confirms that all cases also function correctly for file-backed guard
> > > regions.
> > >
> > > In addition a number of other tests are added for specific file-backed
> > > mapping scenarios.
> > >
> > > There are a number of other concerns that one might have with regard to
> > > guard regions, addressed below:
> > >
> > > Readahead
> > > ~~~~~~~~~
> > >
> > > Readahead is a process through which the page cache is populated on the
> > > assumption that sequential reads will occur, thus amortising I/O and,
> > > through a clever use of the PG_readahead folio flag establishing during
> > > major fault and checked upon minor fault, provides for asynchronous I/O to
> > > occur as dat is processed, reducing I/O stalls as data is faulted in.
> > >
> > > Guard regions do not alter this mechanism which operations at the folio and
> > > fault level, but do of course prevent the faulting of folios that would
> > > otherwise be mapped.
> > >
> > > In the instance of a major fault prior to a guard region, synchronous
> > > readahead will occur including populating folios in the page cache which
> > > the guard regions will, in the case of the mapping in question, prevent
> > > access to.
> > >
> > > In addition, if PG_readahead is placed in a folio that is now inaccessible,
> > > this will prevent asynchronous readahead from occurring as it would
> > > otherwise do.
> > >
> > > However, there are mechanisms for heuristically resetting this within
> > > readahead regardless, which will 'recover' correct readahead behaviour.
> > >
> > > Readahead presumes sequential data access, the presence of a guard region
> > > clearly indicates that, at least in the guard region, no such sequential
> > > access will occur, as it cannot occur there.
> > >
> > > So this should have very little impact on any real workload. The far more
> > > important point is as to whether readahead causes incorrect or
> > > inappropriate mapping of ranges disallowed by the presence of guard
> > > regions - this is not the case, as readahead does not 'pre-fault' memory in
> > > this fashion.
> > >
> > > At any rate, any mechanism which would attempt to do so would hit the usual
> > > page fault paths, which correctly handle PTE markers as with anonymous
> > > mappings.
> > >
> > > Fault-Around
> > > ~~~~~~~~~~~~
> > >
> > > The fault-around logic, in a similar vein to readahead, attempts to improve
> > > efficiency with regard to file-backed memory mappings, however it differs
> > > in that it does not try to fetch folios into the page cache that are about
> > > to be accessed, but rather pre-maps a range of folios around the faulting
> > > address.
> > >
> > > Guard regions making use of PTE markers makes this relatively trivial, as
> > > this case is already handled - see filemap_map_folio_range() and
> > > filemap_map_order0_folio() - in both instances, the solution is to simply
> > > keep the established page table mappings and let the fault handler take
> > > care of PTE markers, as per the comment:
> > >
> > >         /*
> > >          * NOTE: If there're PTE markers, we'll leave them to be
> > >          * handled in the specific fault path, and it'll prohibit
> > >          * the fault-around logic.
> > >          */
> > >
> > > This works, as establishing guard regions results in page table mappings
> > > with PTE markers, and clearing them removes them.
> > >
> > > Truncation
> > > ~~~~~~~~~~
> > >
> > > File truncation will not eliminate existing guard regions, as the
> > > truncation operation will ultimately zap the range via
> > > unmap_mapping_range(), which specifically excludes PTE markers.
> > >
> > > Zapping
> > > ~~~~~~~
> > >
> > > Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
> > > which specifically deals with guard entries, leaving them intact except in
> > > instances such as process teardown or munmap() where they need to be
> > > removed.
> > >
> > > Reclaim
> > > ~~~~~~~
> > >
> > > When reclaim is performed on file-backed folios, it ultimately invokes
> > > try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
> > > will ultimately abort the operation for the guard region mapping. If large,
> > > then check_pte() will determine that this is a non-device private
> > > entry/device-exclusive entry 'swap' PTE and thus abort the operation in
> > > that instance.
> > >
> > > Therefore, no odd things happen in the instance of reclaim being attempted
> > > upon a file-backed guard region.
> > >
> > > Hole Punching
> > > ~~~~~~~~~~~~~
> > >
> > > This updates the page cache and ultimately invokes unmap_mapping_range(),
> > > which explicitly leaves PTE markers in place.
> > >
> > > Because the establishment of guard regions zapped any existing mappings to
> > > file-backed folios, once the guard regions are removed then the
> > > hole-punched region will be faulted in as usual and everything will behave
> > > as expected.
> > >
> > > Lorenzo Stoakes (4):
> > >   mm: allow guard regions in file-backed and read-only mappings
> > >   selftests/mm: rename guard-pages to guard-regions
> > >   tools/selftests: expand all guard region tests to file-backed
> > >   tools/selftests: add file/shmem-backed mapping guard region tests
> > >
> > >  mm/madvise.c                                  |   8 +-
> > >  tools/testing/selftests/mm/.gitignore         |   2 +-
> > >  tools/testing/selftests/mm/Makefile           |   2 +-
> > >  .../mm/{guard-pages.c => guard-regions.c}     | 921 ++++++++++++++++--
> > >  4 files changed, 821 insertions(+), 112 deletions(-)
> > >  rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
> > >
> > > --
> > > 2.48.1
David Hildenbrand Feb. 19, 2025, 9:15 a.m. UTC | #16
On 19.02.25 10:03, Lorenzo Stoakes wrote:
> On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
>> On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
>> <lorenzo.stoakes@oracle.com> wrote:
>>>
>>> The guard regions feature was initially implemented to support anonymous
>>> mappings only, excluding shmem.
>>>
>>> This was done such as to introduce the feature carefully and incrementally
>>> and to be conservative when considering the various caveats and corner
>>> cases that are applicable to file-backed mappings but not to anonymous
>>> ones.
>>>
>>> Now this feature has landed in 6.13, it is time to revisit this and to
>>> extend this functionality to file-backed and shmem mappings.
>>>
>>> In order to make this maximally useful, and since one may map file-backed
>>> mappings read-only (for instance ELF images), we also remove the
>>> restriction on read-only mappings and permit the establishment of guard
>>> regions in any non-hugetlb, non-mlock()'d mapping.
>>
>> Hi Lorenzo,
>>
>> Thank you for your work on this.
> 
> You're welcome.
> 
>>
>> Have we thought about how guard regions are represented in /proc/*/[s]maps?
> 
> This is off-topic here but... Yes, extensively. No they do not appear
> there.
> 
> I thought you had attended LPC and my talk where I mentioned this
> purposefully as a drawback?
> 
> I went out of my way to advertise this limitation at the LPC talk, in the
> original series, etc. so it's a little disappointing that this is being
> brought up so late, but nobody else has raised objections to this issue so
> I think in general it's not a limitation that matters in practice.
> 
>>
>> In the field, I've found that many applications read the ranges from
>> /proc/self/[s]maps to determine what they can access (usually related
>> to obfuscation techniques). If they don't know of the guard regions it
>> would cause them to crash; I think that we'll need similar entries to
>> PROT_NONE (---p) for these, and generally to maintain consistency
>> between the behavior and what is being said from /proc/*/[s]maps.
> 
> No, we cannot have these, sorry.
> 
> Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this
> feature is to avoid having to accumulate VMAs for regions which are not
> intended to be accessible.
> 
> Secondly, there is no practical means for this to be accomplished in
> /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates
> they have guard regions.
> 
> This is intentional, because setting such metadata is simply not practical
> - why? Because when you try to split the VMA, how do you know which bit
> gets the metadata and which doesn't? You can't without _reading page
> tables_.
> 
> /proc/$pid/smaps _does_ read page tables, but we can't start pretending
> VMAs exist when they don't, this would be completely inaccurate, would
> break assumptions for things like mremap (which require a single VMA) and
> would be unworkable.
> 
> The best that _could_ be achieved is to have a marker in /proc/$pid/smaps
> saying 'hey this region has guard regions somewhere'.

And then simply expose it in /proc/$pid/pagemap, which is a better 
interface for this pte-level information inside of VMAs. We should still 
have a spare bit for that purpose in the pagemap entries.
Lorenzo Stoakes Feb. 19, 2025, 9:17 a.m. UTC | #17
On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote:
> On 19.02.25 10:03, Lorenzo Stoakes wrote:
> > On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
> > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > The guard regions feature was initially implemented to support anonymous
> > > > mappings only, excluding shmem.
> > > >
> > > > This was done such as to introduce the feature carefully and incrementally
> > > > and to be conservative when considering the various caveats and corner
> > > > cases that are applicable to file-backed mappings but not to anonymous
> > > > ones.
> > > >
> > > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > > extend this functionality to file-backed and shmem mappings.
> > > >
> > > > In order to make this maximally useful, and since one may map file-backed
> > > > mappings read-only (for instance ELF images), we also remove the
> > > > restriction on read-only mappings and permit the establishment of guard
> > > > regions in any non-hugetlb, non-mlock()'d mapping.
> > >
> > > Hi Lorenzo,
> > >
> > > Thank you for your work on this.
> >
> > You're welcome.
> >
> > >
> > > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> >
> > This is off-topic here but... Yes, extensively. No they do not appear
> > there.
> >
> > I thought you had attended LPC and my talk where I mentioned this
> > purposefully as a drawback?
> >
> > I went out of my way to advertise this limitation at the LPC talk, in the
> > original series, etc. so it's a little disappointing that this is being
> > brought up so late, but nobody else has raised objections to this issue so
> > I think in general it's not a limitation that matters in practice.
> >
> > >
> > > In the field, I've found that many applications read the ranges from
> > > /proc/self/[s]maps to determine what they can access (usually related
> > > to obfuscation techniques). If they don't know of the guard regions it
> > > would cause them to crash; I think that we'll need similar entries to
> > > PROT_NONE (---p) for these, and generally to maintain consistency
> > > between the behavior and what is being said from /proc/*/[s]maps.
> >
> > No, we cannot have these, sorry.
> >
> > Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this
> > feature is to avoid having to accumulate VMAs for regions which are not
> > intended to be accessible.
> >
> > Secondly, there is no practical means for this to be accomplished in
> > /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates
> > they have guard regions.
> >
> > This is intentional, because setting such metadata is simply not practical
> > - why? Because when you try to split the VMA, how do you know which bit
> > gets the metadata and which doesn't? You can't without _reading page
> > tables_.
> >
> > /proc/$pid/smaps _does_ read page tables, but we can't start pretending
> > VMAs exist when they don't, this would be completely inaccurate, would
> > break assumptions for things like mremap (which require a single VMA) and
> > would be unworkable.
> >
> > The best that _could_ be achieved is to have a marker in /proc/$pid/smaps
> > saying 'hey this region has guard regions somewhere'.
>
> And then simply expose it in /proc/$pid/pagemap, which is a better interface
> for this pte-level information inside of VMAs. We should still have a spare
> bit for that purpose in the pagemap entries.

Ah yeah thanks David forgot about that!

This is also a possibility if that'd solve your problems Kalesh?

This bit will be fought over haha

>
> --
> Cheers,
>
> David / dhildenb
>
Liam R. Howlett Feb. 19, 2025, 5:32 p.m. UTC | #18
* Kalesh Singh <kaleshsingh@google.com> [250219 03:35]:
> On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh <kaleshsingh@google.com> wrote:
> >
> > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > The guard regions feature was initially implemented to support anonymous
> > > mappings only, excluding shmem.
> > >
> > > This was done such as to introduce the feature carefully and incrementally
> > > and to be conservative when considering the various caveats and corner
> > > cases that are applicable to file-backed mappings but not to anonymous
> > > ones.
> > >
> > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > extend this functionality to file-backed and shmem mappings.
> > >
> > > In order to make this maximally useful, and since one may map file-backed
> > > mappings read-only (for instance ELF images), we also remove the
> > > restriction on read-only mappings and permit the establishment of guard
> > > regions in any non-hugetlb, non-mlock()'d mapping.
> >
> > Hi Lorenzo,
> >
> > Thank you for your work on this.
> >
> > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> >
> > In the field, I've found that many applications read the ranges from
> > /proc/self/[s]maps to determine what they can access (usually related
> > to obfuscation techniques). If they don't know of the guard regions it
> > would cause them to crash; I think that we'll need similar entries to
> > PROT_NONE (---p) for these, and generally to maintain consistency
> > between the behavior and what is being said from /proc/*/[s]maps.
> 
> To clarify why the applications may not be aware of their guard
> regions -- in the case of the ELF mappings these PROT_NONE (guard
> regions) would be installed by the dynamic loader; or may be inherited
> from the parent (zygote in Android's case).

The guard regions are a method to reduce vma count and speed up guard
installation and removal.  This very much will change the proc interface
by its design, since it's removing information from the maps for the
advantage of speed and less resources.  I thought that was pretty clear
from the start.

The proc interface is also not a very good way to programmatically get
this information, as I'm sure you are aware.  I'm also sure you know
that reading that file takes the mmap read lock, and tearing can occur.

I think this implies you are taking a lot of time to get the information
you want in the way you are getting it (parsing a text file, and not
doing any mmap write work)?

If this is a common pattern, I think we need a better interface to query
this type of information.  We have an ioctl going in for getting vma
information, but that will lack these new guard regions as well.

David mentioned /proc/self/pagemap - that's certainly worth exploring,
and doesn't involve text parsing.

If it's not that common, then maybe your zygote/loader can communicate
the information in a way that does not involve read locking the programs
vm area?

Are you really parsing the same library information for each application
launched to find the guards instead of asking or being told what they
are?  I must be missing something..

Thanks,
Liam
Kalesh Singh Feb. 19, 2025, 6:52 p.m. UTC | #19
On Wed, Feb 19, 2025 at 1:17 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote:
> > On 19.02.25 10:03, Lorenzo Stoakes wrote:
> > > On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
> > > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > >
> > > > > The guard regions feature was initially implemented to support anonymous
> > > > > mappings only, excluding shmem.
> > > > >
> > > > > This was done such as to introduce the feature carefully and incrementally
> > > > > and to be conservative when considering the various caveats and corner
> > > > > cases that are applicable to file-backed mappings but not to anonymous
> > > > > ones.
> > > > >
> > > > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > > > extend this functionality to file-backed and shmem mappings.
> > > > >
> > > > > In order to make this maximally useful, and since one may map file-backed
> > > > > mappings read-only (for instance ELF images), we also remove the
> > > > > restriction on read-only mappings and permit the establishment of guard
> > > > > regions in any non-hugetlb, non-mlock()'d mapping.
> > > >
> > > > Hi Lorenzo,
> > > >
> > > > Thank you for your work on this.
> > >
> > > You're welcome.
> > >
> > > >
> > > > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> > >
> > > This is off-topic here but... Yes, extensively. No they do not appear
> > > there.
> > >
> > > I thought you had attended LPC and my talk where I mentioned this
> > > purposefully as a drawback?
> > >
> > > I went out of my way to advertise this limitation at the LPC talk, in the
> > > original series, etc. so it's a little disappointing that this is being
> > > brought up so late, but nobody else has raised objections to this issue so
> > > I think in general it's not a limitation that matters in practice.
> > >

Sorry for raising this now, yes at the time I believe we discussed
reducing the vma slab memory usage for the PROT_NONE mappings. I
didn't imagine that apps could have dependencies on the mapped ELF
ranges in /proc/self/[s]maps until recent breakages from a similar
feature. Android itself doesn't depend on this but what I've seen is
banking apps and apps that have obfuscation to prevent reverse
engineering (the particulars of such obfuscation are a black box).

> > > >
> > > > In the field, I've found that many applications read the ranges from
> > > > /proc/self/[s]maps to determine what they can access (usually related
> > > > to obfuscation techniques). If they don't know of the guard regions it
> > > > would cause them to crash; I think that we'll need similar entries to
> > > > PROT_NONE (---p) for these, and generally to maintain consistency
> > > > between the behavior and what is being said from /proc/*/[s]maps.
> > >
> > > No, we cannot have these, sorry.
> > >
> > > Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this
> > > feature is to avoid having to accumulate VMAs for regions which are not
> > > intended to be accessible.
> > >
> > > Secondly, there is no practical means for this to be accomplished in
> > > /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates
> > > they have guard regions.
> > >
> > > This is intentional, because setting such metadata is simply not practical
> > > - why? Because when you try to split the VMA, how do you know which bit
> > > gets the metadata and which doesn't? You can't without _reading page
> > > tables_.

Yeah the splitting becomes complicated with any vm flags for this...
meaning any attempt to expose this in /proc/*/maps have to
unconditionally walk the page tables :(

> > >
> > > /proc/$pid/smaps _does_ read page tables, but we can't start pretending
> > > VMAs exist when they don't, this would be completely inaccurate, would
> > > break assumptions for things like mremap (which require a single VMA) and
> > > would be unworkable.
> > >
> > > The best that _could_ be achieved is to have a marker in /proc/$pid/smaps
> > > saying 'hey this region has guard regions somewhere'.
> >
> > And then simply expose it in /proc/$pid/pagemap, which is a better interface
> > for this pte-level information inside of VMAs. We should still have a spare
> > bit for that purpose in the pagemap entries.
>
> Ah yeah thanks David forgot about that!
>
> This is also a possibility if that'd solve your problems Kalesh?

I'm not sure what is the correct interface to advertise these. Maybe
smaps as you suggested since we already walk the page tables there?
and pagemap bit for the exact pages as well? It won't solve this
particular issue, as 1000s of in field apps do look at this through
/proc/*/maps. But maybe we have to live with that...

We can argue that such apps are broken since they may trip on the
SIGBUS off the end of the file -- usually this isn't the case for the
ELF segment mappings.

This is still useful for other cases, I just wanted to get some ideas
if this can be extended to further use cases.

Thanks,
Kalesh


>
> This bit will be fought over haha
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
Lorenzo Stoakes Feb. 19, 2025, 7:20 p.m. UTC | #20
On Wed, Feb 19, 2025 at 10:52:04AM -0800, Kalesh Singh wrote:
> On Wed, Feb 19, 2025 at 1:17 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote:
> > > On 19.02.25 10:03, Lorenzo Stoakes wrote:
> > > > On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
> > > > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > >
> > > > > > The guard regions feature was initially implemented to support anonymous
> > > > > > mappings only, excluding shmem.
> > > > > >
> > > > > > This was done such as to introduce the feature carefully and incrementally
> > > > > > and to be conservative when considering the various caveats and corner
> > > > > > cases that are applicable to file-backed mappings but not to anonymous
> > > > > > ones.
> > > > > >
> > > > > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > > > > extend this functionality to file-backed and shmem mappings.
> > > > > >
> > > > > > In order to make this maximally useful, and since one may map file-backed
> > > > > > mappings read-only (for instance ELF images), we also remove the
> > > > > > restriction on read-only mappings and permit the establishment of guard
> > > > > > regions in any non-hugetlb, non-mlock()'d mapping.
> > > > >
> > > > > Hi Lorenzo,
> > > > >
> > > > > Thank you for your work on this.
> > > >
> > > > You're welcome.
> > > >
> > > > >
> > > > > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> > > >
> > > > This is off-topic here but... Yes, extensively. No they do not appear
> > > > there.
> > > >
> > > > I thought you had attended LPC and my talk where I mentioned this
> > > > purposefully as a drawback?
> > > >
> > > > I went out of my way to advertise this limitation at the LPC talk, in the
> > > > original series, etc. so it's a little disappointing that this is being
> > > > brought up so late, but nobody else has raised objections to this issue so
> > > > I think in general it's not a limitation that matters in practice.
> > > >
>
> Sorry for raising this now, yes at the time I believe we discussed
> reducing the vma slab memory usage for the PROT_NONE mappings. I
> didn't imagine that apps could have dependencies on the mapped ELF
> ranges in /proc/self/[s]maps until recent breakages from a similar
> feature. Android itself doesn't depend on this but what I've seen is
> banking apps and apps that have obfuscation to prevent reverse
> engineering (the particulars of such obfuscation are a black box).

Ack ok fair enough, sorry, but obviously you can understand it's
frustrating when I went to great lengths to advertise this not only at the
talk but in the original series.

Really important to have these discussions early. Not that really we can do
much about this, as inherently this feature cannot give you what you need.

Is it _only_ banking apps that do this? And do they exclusively read
/proc/$pid/maps? I mean there's nothing we can do about that, sorry. If
that's immutable, then unless you do your own very, very, very slow custom
android maps implementation (that will absolutely break the /proc/$pid/maps
scalability efforts atm) this is just a no-go.

>
> > > > >
> > > > > In the field, I've found that many applications read the ranges from
> > > > > /proc/self/[s]maps to determine what they can access (usually related
> > > > > to obfuscation techniques). If they don't know of the guard regions it
> > > > > would cause them to crash; I think that we'll need similar entries to
> > > > > PROT_NONE (---p) for these, and generally to maintain consistency
> > > > > between the behavior and what is being said from /proc/*/[s]maps.
> > > >
> > > > No, we cannot have these, sorry.
> > > >
> > > > Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this
> > > > feature is to avoid having to accumulate VMAs for regions which are not
> > > > intended to be accessible.
> > > >
> > > > Secondly, there is no practical means for this to be accomplished in
> > > > /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates
> > > > they have guard regions.
> > > >
> > > > This is intentional, because setting such metadata is simply not practical
> > > > - why? Because when you try to split the VMA, how do you know which bit
> > > > gets the metadata and which doesn't? You can't without _reading page
> > > > tables_.
>
> Yeah the splitting becomes complicated with any vm flags for this...
> meaning any attempt to expose this in /proc/*/maps have to
> unconditionally walk the page tables :(

It's not really complicated, it's _impossible_ unless you made literally
all VMA code walk page tables for every single operation. Which we are
emphatically not going to do :)

And no, /proc/$pid/maps is _never_ going to walk page tables. For obvious
performance reasons.

>
> > > >
> > > > /proc/$pid/smaps _does_ read page tables, but we can't start pretending
> > > > VMAs exist when they don't, this would be completely inaccurate, would
> > > > break assumptions for things like mremap (which require a single VMA) and
> > > > would be unworkable.
> > > >
> > > > The best that _could_ be achieved is to have a marker in /proc/$pid/smaps
> > > > saying 'hey this region has guard regions somewhere'.
> > >
> > > And then simply expose it in /proc/$pid/pagemap, which is a better interface
> > > for this pte-level information inside of VMAs. We should still have a spare
> > > bit for that purpose in the pagemap entries.
> >
> > Ah yeah thanks David forgot about that!
> >
> > This is also a possibility if that'd solve your problems Kalesh?
>
> I'm not sure what is the correct interface to advertise these. Maybe
> smaps as you suggested since we already walk the page tables there?
> and pagemap bit for the exact pages as well? It won't solve this
> particular issue, as 1000s of in field apps do look at this through
> /proc/*/maps. But maybe we have to live with that...

I mean why are we even considering this if you can't change this anywhere?
Confused by that.

I'm afraid upstream can't radically change interfaces to suit this
scenario.

We also can't change smaps in the way you want, it _has_ to still give
output per VMA information.

The proposed change that would be there would be a flag or something
indicating that the VMA has guard regions _SOMEWHERE_ in it.

Since this doesn't solve your problem, adds complexity, and nobody else
seems to need it, I would suggest this is not worthwhile and I'd rather not
do this.

Therefore for your needs there are literally only two choices here:

1. Add a bit to /proc/$pid/pagemap OR
2. a new interface.

I am not in favour of a new interface here, if we can just extend pagemap.

What you'd have to do is:

1. Find virtual ranges via /proc/$pid/maps
2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.

Since anything that would retrieve guard region state would need to walk
page tables, any approach would be slow and I don't think this would be any
less slow than any other interface.

This way you'd be able to find all guard regions all the time.

This is just the trade-off for this feature unfortunately - its whole
design ethos is to allow modification of -faulting- behaviour without
having to modify -VMA- behaviour.

But if it's banking apps whose code you can't control (surprised you don't
lock down these interfaces), I mean is this even useful to you?

If your requirement is 'you have to change /proc/$pid/maps to show guard
regions' I mean the answer is that we can't.

>
> We can argue that such apps are broken since they may trip on the
> SIGBUS off the end of the file -- usually this isn't the case for the
> ELF segment mappings.

Or tearing of the maps interface, or things getting unmapped or or
or... It's really not a sane thing to do.

>
> This is still useful for other cases, I just wanted to get some ideas
> if this can be extended to further use cases.

Well I'm glad that you guys find it useful for _something_ ;)

Again this wasn't written only for you (it is broadly a good feature for
upstream), but I did have your use case in mind, so I'm a little
disappointed that it doesn't help, as I like to solve problems.

But I'm glad it solves at least some for you...

>
> Thanks,
> Kalesh
>
>
> >
> > This bit will be fought over haha
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
Kalesh Singh Feb. 19, 2025, 8:56 p.m. UTC | #21
On Wed, Feb 19, 2025 at 11:20 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Feb 19, 2025 at 10:52:04AM -0800, Kalesh Singh wrote:
> > On Wed, Feb 19, 2025 at 1:17 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote:
> > > > On 19.02.25 10:03, Lorenzo Stoakes wrote:
> > > > > On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
> > > > > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > > > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > > >
> > > > > > > The guard regions feature was initially implemented to support anonymous
> > > > > > > mappings only, excluding shmem.
> > > > > > >
> > > > > > > This was done such as to introduce the feature carefully and incrementally
> > > > > > > and to be conservative when considering the various caveats and corner
> > > > > > > cases that are applicable to file-backed mappings but not to anonymous
> > > > > > > ones.
> > > > > > >
> > > > > > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > > > > > extend this functionality to file-backed and shmem mappings.
> > > > > > >
> > > > > > > In order to make this maximally useful, and since one may map file-backed
> > > > > > > mappings read-only (for instance ELF images), we also remove the
> > > > > > > restriction on read-only mappings and permit the establishment of guard
> > > > > > > regions in any non-hugetlb, non-mlock()'d mapping.
> > > > > >
> > > > > > Hi Lorenzo,
> > > > > >
> > > > > > Thank you for your work on this.
> > > > >
> > > > > You're welcome.
> > > > >
> > > > > >
> > > > > > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> > > > >
> > > > > This is off-topic here but... Yes, extensively. No they do not appear
> > > > > there.
> > > > >
> > > > > I thought you had attended LPC and my talk where I mentioned this
> > > > > purposefully as a drawback?
> > > > >
> > > > > I went out of my way to advertise this limitation at the LPC talk, in the
> > > > > original series, etc. so it's a little disappointing that this is being
> > > > > brought up so late, but nobody else has raised objections to this issue so
> > > > > I think in general it's not a limitation that matters in practice.
> > > > >
> >
> > Sorry for raising this now, yes at the time I believe we discussed
> > reducing the vma slab memory usage for the PROT_NONE mappings. I
> > didn't imagine that apps could have dependencies on the mapped ELF
> > ranges in /proc/self/[s]maps until recent breakages from a similar
> > feature. Android itself doesn't depend on this but what I've seen is
> > banking apps and apps that have obfuscation to prevent reverse
> > engineering (the particulars of such obfuscation are a black box).
>
> Ack ok fair enough, sorry, but obviously you can understand it's
> frustrating when I went to great lengths to advertise this not only at the
> talk but in the original series.
>
> Really important to have these discussions early. Not that really we can do
> much about this, as inherently this feature cannot give you what you need.
>
> Is it _only_ banking apps that do this? And do they exclusively read
> /proc/$pid/maps? I mean there's nothing we can do about that, sorry.

Not only banking apps but that's a common category.

> If that's immutable, then unless you do your own very, very, very slow custom
> android maps implementation (that will absolutely break the /proc/$pid/maps
> scalability efforts atm) this is just a no-go.
>

Yeah unfortunately that's immutable as app versions are mostly
independent from the OS version.

We do have something that handles this by encoding the guard regions
in the vm_flags, but as you can imagine it's not generic enough for
upstream.

> >
> > > > > >
> > > > > > In the field, I've found that many applications read the ranges from
> > > > > > /proc/self/[s]maps to determine what they can access (usually related
> > > > > > to obfuscation techniques). If they don't know of the guard regions it
> > > > > > would cause them to crash; I think that we'll need similar entries to
> > > > > > PROT_NONE (---p) for these, and generally to maintain consistency
> > > > > > between the behavior and what is being said from /proc/*/[s]maps.
> > > > >
> > > > > No, we cannot have these, sorry.
> > > > >
> > > > > Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this
> > > > > feature is to avoid having to accumulate VMAs for regions which are not
> > > > > intended to be accessible.
> > > > >
> > > > > Secondly, there is no practical means for this to be accomplished in
> > > > > /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates
> > > > > they have guard regions.
> > > > >
> > > > > This is intentional, because setting such metadata is simply not practical
> > > > > - why? Because when you try to split the VMA, how do you know which bit
> > > > > gets the metadata and which doesn't? You can't without _reading page
> > > > > tables_.
> >
> > Yeah the splitting becomes complicated with any vm flags for this...
> > meaning any attempt to expose this in /proc/*/maps have to
> > unconditionally walk the page tables :(
>
> It's not really complicated, it's _impossible_ unless you made literally
> all VMA code walk page tables for every single operation. Which we are
> emphatically not going to do :)
>
> And no, /proc/$pid/maps is _never_ going to walk page tables. For obvious
> performance reasons.
>
> >
> > > > >
> > > > > /proc/$pid/smaps _does_ read page tables, but we can't start pretending
> > > > > VMAs exist when they don't, this would be completely inaccurate, would
> > > > > break assumptions for things like mremap (which require a single VMA) and
> > > > > would be unworkable.
> > > > >
> > > > > The best that _could_ be achieved is to have a marker in /proc/$pid/smaps
> > > > > saying 'hey this region has guard regions somewhere'.
> > > >
> > > > And then simply expose it in /proc/$pid/pagemap, which is a better interface
> > > > for this pte-level information inside of VMAs. We should still have a spare
> > > > bit for that purpose in the pagemap entries.
> > >
> > > Ah yeah thanks David forgot about that!
> > >
> > > This is also a possibility if that'd solve your problems Kalesh?
> >
> > I'm not sure what is the correct interface to advertise these. Maybe
> > smaps as you suggested since we already walk the page tables there?
> > and pagemap bit for the exact pages as well? It won't solve this
> > particular issue, as 1000s of in field apps do look at this through
> > /proc/*/maps. But maybe we have to live with that...
>
> I mean why are we even considering this if you can't change this anywhere?
> Confused by that.
>
> I'm afraid upstream can't radically change interfaces to suit this
> scenario.
>
> We also can't change smaps in the way you want, it _has_ to still give
> output per VMA information.

Sorry I wasn't suggesting to change the entries in smaps, rather
agreeing to your marker suggestion. Maybe a set of ranges for each
smaps entry that has guards? It doesn't solve the use case, but does
make these regions visible to userspace.

>
> The proposed change that would be there would be a flag or something
> indicating that the VMA has guard regions _SOMEWHERE_ in it.
>
> Since this doesn't solve your problem, adds complexity, and nobody else
> seems to need it, I would suggest this is not worthwhile and I'd rather not
> do this.
>
> Therefore for your needs there are literally only two choices here:
>
> 1. Add a bit to /proc/$pid/pagemap OR
> 2. a new interface.
>
> I am not in favour of a new interface here, if we can just extend pagemap.
>
> What you'd have to do is:
>
> 1. Find virtual ranges via /proc/$pid/maps
> 2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
>

Could we also consider an smaps field like:

VmGuards: [AAA, BBB), [CCC, DDD), ...

or something of that sort?


> Since anything that would retrieve guard region state would need to walk
> page tables, any approach would be slow and I don't think this would be any
> less slow than any other interface.
>
> This way you'd be able to find all guard regions all the time.
>
> This is just the trade-off for this feature unfortunately - its whole
> design ethos is to allow modification of -faulting- behaviour without
> having to modify -VMA- behaviour.
>
> But if it's banking apps whose code you can't control (surprised you don't
> lock down these interfaces), I mean is this even useful to you?
>
> If your requirement is 'you have to change /proc/$pid/maps to show guard
> regions' I mean the answer is that we can't.
>
> >
> > We can argue that such apps are broken since they may trip on the
> > SIGBUS off the end of the file -- usually this isn't the case for the
> > ELF segment mappings.
>
> Or tearing of the maps interface, or things getting unmapped or or
> or... It's really not a sane thing to do.
>
> >
> > This is still useful for other cases, I just wanted to get some ideas
> > if this can be extended to further use cases.
>
> Well I'm glad that you guys find it useful for _something_ ;)
>
> Again this wasn't written only for you (it is broadly a good feature for
> upstream), but I did have your use case in mind, so I'm a little
> disappointed that it doesn't help, as I like to solve problems.
>
> But I'm glad it solves at least some for you...

I recall Liam had a proposal to store the guard ranges in the maple tree?

I wonder if that can be used in combination with this approach to have
a better representation of this?

>
> >
> > Thanks,
> > Kalesh
> >
> >
> > >
> > > This bit will be fought over haha
> > >
> > > >
> > > > --
> > > > Cheers,
> > > >
> > > > David / dhildenb
> > > >
Lorenzo Stoakes Feb. 20, 2025, 8:51 a.m. UTC | #22
On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
> > We also can't change smaps in the way you want, it _has_ to still give
> > output per VMA information.
>
> Sorry I wasn't suggesting to change the entries in smaps, rather
> agreeing to your marker suggestion. Maybe a set of ranges for each
> smaps entry that has guards? It doesn't solve the use case, but does
> make these regions visible to userspace.

No, you are not providing a usecase for this. /proc/$pid/pagemaps does not
contaminate the smaps output, mess with efforts to make it RCU readable,
require updating the ioctl interface, etc. so it is clearly the better
choice.

>
> >
> > The proposed change that would be there would be a flag or something
> > indicating that the VMA has guard regions _SOMEWHERE_ in it.
> >
> > Since this doesn't solve your problem, adds complexity, and nobody else
> > seems to need it, I would suggest this is not worthwhile and I'd rather not
> > do this.
> >
> > Therefore for your needs there are literally only two choices here:
> >
> > 1. Add a bit to /proc/$pid/pagemap OR
> > 2. a new interface.
> >
> > I am not in favour of a new interface here, if we can just extend pagemap.
> >
> > What you'd have to do is:
> >
> > 1. Find virtual ranges via /proc/$pid/maps
> > 2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
> >
>
> Could we also consider an smaps field like:
>
> VmGuards: [AAA, BBB), [CCC, DDD), ...
>
> or something of that sort?

No, absolutely, categorically not. You realise these could be thousands of
characters long right?

/proc/$pid/pagemaps resolves this without contaminating this output.

> > Well I'm glad that you guys find it useful for _something_ ;)
> >
> > Again this wasn't written only for you (it is broadly a good feature for
> > upstream), but I did have your use case in mind, so I'm a little
> > disappointed that it doesn't help, as I like to solve problems.
> >
> > But I'm glad it solves at least some for you...
>
> I recall Liam had a proposal to store the guard ranges in the maple tree?
>
> I wonder if that can be used in combination with this approach to have
> a better representation of this?

This was an alternative proposal made prior to the feature being
implemented (and you and others at Google were welcome to comment and many
were cc'd, etc.).

There is no 'in combination with'. This feature would take weeks/months to
implement, fundamentally impact the maple tree VMA implementation
and... not actually achieve anything + immediately be redundant.

Plus it'd likely be slower, have locking implications, would have kernel
memory allocation implications, a lot more complexity and probably other
problems besides (we discussed this at length at the time and a number of
issues came up, I can't recall all of them).

To be crystal clear - we are empathically NOT changing /proc/$pid/maps to
lie about VMAs regardless of underlying implementation, nor adding
thousands of characters to /proc/$pid/smaps entries.

This is independent of implementation and would have been the case had we
gone with a maple node version.

So in no world is your problem solved here, unfortunately you have
inextricably tied yourself to a VMA representation here.

I still wonder if you could find some means of abstracting this, but
/proc/$pid/pagemaps is where I am likely to expose this information for
anybody who needs it, and will likely send a series for this relatively
soon.

If you _can_ abstract this in some way, then if we provide this information
_anywhere_ you can get it.

As I said to you earlier, the _best_ we could do in smaps would be to add a
flag like 'Grd' or something to indicate some part of the VMA is
guarded. But I won't do that unless somebody has an -actual use case- for
it.

David's /proc/$pid/pagemaps suggestion is excellent, avoids all the
pitfalls, exposes guard regions to anybody who really really wants to know
and doesn't interfere with anything else, so this is what we'll go with.

Regards, Lorenzo
David Hildenbrand Feb. 20, 2025, 8:57 a.m. UTC | #23
On 20.02.25 09:51, Lorenzo Stoakes wrote:
> On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
>>> We also can't change smaps in the way you want, it _has_ to still give
>>> output per VMA information.
>>
>> Sorry I wasn't suggesting to change the entries in smaps, rather
>> agreeing to your marker suggestion. Maybe a set of ranges for each
>> smaps entry that has guards? It doesn't solve the use case, but does
>> make these regions visible to userspace.
> 
> No, you are not providing a usecase for this. /proc/$pid/pagemaps does not
> contaminate the smaps output, mess with efforts to make it RCU readable,
> require updating the ioctl interface, etc. so it is clearly the better
> choice.
> 
>>
>>>
>>> The proposed change that would be there would be a flag or something
>>> indicating that the VMA has guard regions _SOMEWHERE_ in it.
>>>
>>> Since this doesn't solve your problem, adds complexity, and nobody else
>>> seems to need it, I would suggest this is not worthwhile and I'd rather not
>>> do this.
>>>
>>> Therefore for your needs there are literally only two choices here:
>>>
>>> 1. Add a bit to /proc/$pid/pagemap OR
>>> 2. a new interface.
>>>
>>> I am not in favour of a new interface here, if we can just extend pagemap.
>>>
>>> What you'd have to do is:
>>>
>>> 1. Find virtual ranges via /proc/$pid/maps
>>> 2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
>>>
>>
>> Could we also consider an smaps field like:
>>
>> VmGuards: [AAA, BBB), [CCC, DDD), ...
>>
>> or something of that sort?
> 
> No, absolutely, categorically not. You realise these could be thousands of
> characters long right?
> 
> /proc/$pid/pagemaps resolves this without contaminating this output.
> 
>>> Well I'm glad that you guys find it useful for _something_ ;)
>>>
>>> Again this wasn't written only for you (it is broadly a good feature for
>>> upstream), but I did have your use case in mind, so I'm a little
>>> disappointed that it doesn't help, as I like to solve problems.
>>>
>>> But I'm glad it solves at least some for you...
>>
>> I recall Liam had a proposal to store the guard ranges in the maple tree?
>>
>> I wonder if that can be used in combination with this approach to have
>> a better representation of this?
> 
> This was an alternative proposal made prior to the feature being
> implemented (and you and others at Google were welcome to comment and many
> were cc'd, etc.).
> 
> There is no 'in combination with'. This feature would take weeks/months to
> implement, fundamentally impact the maple tree VMA implementation
> and... not actually achieve anything + immediately be redundant.
> 
> Plus it'd likely be slower, have locking implications, would have kernel
> memory allocation implications, a lot more complexity and probably other
> problems besides (we discussed this at length at the time and a number of
> issues came up, I can't recall all of them).
> 
> To be crystal clear - we are empathically NOT changing /proc/$pid/maps to
> lie about VMAs regardless of underlying implementation, nor adding
> thousands of characters to /proc/$pid/smaps entries.

Yes. Calling it a "guard region" might be part of the problem 
(/"misunderstanding"), because it reminds people of "virtual memory 
regions".

"Guard markers" or similar might have been clearer that these operate on 
individual PTEs, require page table scanning etc ... which makes them a 
lot more scalable and fine-grained and provides all these benfits, with 
the downside being that we don't end up with that many "virtual memory 
regions" that maps/smaps operate on.

[...]

> 
> As I said to you earlier, the _best_ we could do in smaps would be to add a
> flag like 'Grd' or something to indicate some part of the VMA is
> guarded. But I won't do that unless somebody has an -actual use case- for
> it.

Right, and that would limit where you have to manually scan. Something 
similar is being done with uffd-wp markers IIRC.
Lorenzo Stoakes Feb. 20, 2025, 9:04 a.m. UTC | #24
On Thu, Feb 20, 2025 at 09:57:37AM +0100, David Hildenbrand wrote:
> On 20.02.25 09:51, Lorenzo Stoakes wrote:
> > On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
> > > > We also can't change smaps in the way you want, it _has_ to still give
> > > > output per VMA information.
> > >
> > > Sorry I wasn't suggesting to change the entries in smaps, rather
> > > agreeing to your marker suggestion. Maybe a set of ranges for each
> > > smaps entry that has guards? It doesn't solve the use case, but does
> > > make these regions visible to userspace.
> >
> > No, you are not providing a usecase for this. /proc/$pid/pagemaps does not
> > contaminate the smaps output, mess with efforts to make it RCU readable,
> > require updating the ioctl interface, etc. so it is clearly the better
> > choice.
> >
> > >
> > > >
> > > > The proposed change that would be there would be a flag or something
> > > > indicating that the VMA has guard regions _SOMEWHERE_ in it.
> > > >
> > > > Since this doesn't solve your problem, adds complexity, and nobody else
> > > > seems to need it, I would suggest this is not worthwhile and I'd rather not
> > > > do this.
> > > >
> > > > Therefore for your needs there are literally only two choices here:
> > > >
> > > > 1. Add a bit to /proc/$pid/pagemap OR
> > > > 2. a new interface.
> > > >
> > > > I am not in favour of a new interface here, if we can just extend pagemap.
> > > >
> > > > What you'd have to do is:
> > > >
> > > > 1. Find virtual ranges via /proc/$pid/maps
> > > > 2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
> > > >
> > >
> > > Could we also consider an smaps field like:
> > >
> > > VmGuards: [AAA, BBB), [CCC, DDD), ...
> > >
> > > or something of that sort?
> >
> > No, absolutely, categorically not. You realise these could be thousands of
> > characters long right?
> >
> > /proc/$pid/pagemaps resolves this without contaminating this output.
> >
> > > > Well I'm glad that you guys find it useful for _something_ ;)
> > > >
> > > > Again this wasn't written only for you (it is broadly a good feature for
> > > > upstream), but I did have your use case in mind, so I'm a little
> > > > disappointed that it doesn't help, as I like to solve problems.
> > > >
> > > > But I'm glad it solves at least some for you...
> > >
> > > I recall Liam had a proposal to store the guard ranges in the maple tree?
> > >
> > > I wonder if that can be used in combination with this approach to have
> > > a better representation of this?
> >
> > This was an alternative proposal made prior to the feature being
> > implemented (and you and others at Google were welcome to comment and many
> > were cc'd, etc.).
> >
> > There is no 'in combination with'. This feature would take weeks/months to
> > implement, fundamentally impact the maple tree VMA implementation
> > and... not actually achieve anything + immediately be redundant.
> >
> > Plus it'd likely be slower, have locking implications, would have kernel
> > memory allocation implications, a lot more complexity and probably other
> > problems besides (we discussed this at length at the time and a number of
> > issues came up, I can't recall all of them).
> >
> > To be crystal clear - we are empathically NOT changing /proc/$pid/maps to
> > lie about VMAs regardless of underlying implementation, nor adding
> > thousands of characters to /proc/$pid/smaps entries.
>
> Yes. Calling it a "guard region" might be part of the problem
> (/"misunderstanding"), because it reminds people of "virtual memory
> regions".
>
> "Guard markers" or similar might have been clearer that these operate on
> individual PTEs, require page table scanning etc ... which makes them a lot
> more scalable and fine-grained and provides all these benfits, with the
> downside being that we don't end up with that many "virtual memory regions"
> that maps/smaps operate on.

Honestly David you and the naming... :P

I disagree, sorry. Saying 'guard' anything might make people think one
thing or another. We can't account for that. I mean don't get me started on
'pinning' or any of the million other overloaded terms we use...

I _hugely_ publicly went out of my way to express the limitations, I gave a
talk, we had meetings, I mentioned it in the series.

Honestly if at that point you still don't realise, that's not a naming
problem. It's a 'did not participate with upstream' problem.

I like guard regions, as they're not pages as we previously referred to
them. People have no idea what a marker is, it doesn't sound like it spans
ranges, no don't like it sorry.

And sorry but this naming topic is closed :) I already let you change the
naming of the MADV_'s, which broke my heart, there will not be a second
heart breaking...

>
> [...]
>
> >
> > As I said to you earlier, the _best_ we could do in smaps would be to add a
> > flag like 'Grd' or something to indicate some part of the VMA is
> > guarded. But I won't do that unless somebody has an -actual use case- for
> > it.
>
> Right, and that would limit where you have to manually scan. Something
> similar is being done with uffd-wp markers IIRC.

Yeah that's a good point, but honestly if you're reading smaps that reads
the page tables, then reading /proc/$pid/pagemaps and reading page tables
TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
/proc/$pid/pagemaps and reading page tables once.

>
> --
> Cheers,
>
> David / dhildenb
>
Vlastimil Babka Feb. 20, 2025, 9:22 a.m. UTC | #25
On 2/20/25 09:51, Lorenzo Stoakes wrote:
> On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
> 
> As I said to you earlier, the _best_ we could do in smaps would be to add a
> flag like 'Grd' or something to indicate some part of the VMA is

In smaps we could say how many kB is covered by guard ptes and it would be
in line with the current output, and the fact it's already scanning page
tables, so virtually no new overhead. But it wouldn't tell you the address
ranges, of course.

> guarded. But I won't do that unless somebody has an -actual use case- for
> it.

Right, we need to hear that first. Also I'm a bit confused about what's the
issue with the existing Android apps and the proposals. Do the existing apps
expect to see particular PROT_NONE regions, which would be gone (become part
of the surrounding vma's) when using guards instead? Then clearly there's no
way to use guards for them without breaking their assumptions.

But assuming future versions of these apps for future android versions would
have to adapt to guards instead of PROT_NONE, why do we have to promise them
to represent the guards, if the point is just so they adapt to the new state
of smaps in their dubious security checking code, and not break anymore?

(but geez, if android apps are to use the android apis, which is java based
(IIRC?) why were there ever allowed to read /proc in the first place? such a
mistake, sigh)

> David's /proc/$pid/pagemaps suggestion is excellent, avoids all the
> pitfalls, exposes guard regions to anybody who really really wants to know
> and doesn't interfere with anything else, so this is what we'll go with.
> 
> Regards, Lorenzo
David Hildenbrand Feb. 20, 2025, 9:23 a.m. UTC | #26
On 20.02.25 10:04, Lorenzo Stoakes wrote:
> On Thu, Feb 20, 2025 at 09:57:37AM +0100, David Hildenbrand wrote:
>> On 20.02.25 09:51, Lorenzo Stoakes wrote:
>>> On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
>>>>> We also can't change smaps in the way you want, it _has_ to still give
>>>>> output per VMA information.
>>>>
>>>> Sorry I wasn't suggesting to change the entries in smaps, rather
>>>> agreeing to your marker suggestion. Maybe a set of ranges for each
>>>> smaps entry that has guards? It doesn't solve the use case, but does
>>>> make these regions visible to userspace.
>>>
>>> No, you are not providing a usecase for this. /proc/$pid/pagemaps does not
>>> contaminate the smaps output, mess with efforts to make it RCU readable,
>>> require updating the ioctl interface, etc. so it is clearly the better
>>> choice.
>>>
>>>>
>>>>>
>>>>> The proposed change that would be there would be a flag or something
>>>>> indicating that the VMA has guard regions _SOMEWHERE_ in it.
>>>>>
>>>>> Since this doesn't solve your problem, adds complexity, and nobody else
>>>>> seems to need it, I would suggest this is not worthwhile and I'd rather not
>>>>> do this.
>>>>>
>>>>> Therefore for your needs there are literally only two choices here:
>>>>>
>>>>> 1. Add a bit to /proc/$pid/pagemap OR
>>>>> 2. a new interface.
>>>>>
>>>>> I am not in favour of a new interface here, if we can just extend pagemap.
>>>>>
>>>>> What you'd have to do is:
>>>>>
>>>>> 1. Find virtual ranges via /proc/$pid/maps
>>>>> 2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
>>>>>
>>>>
>>>> Could we also consider an smaps field like:
>>>>
>>>> VmGuards: [AAA, BBB), [CCC, DDD), ...
>>>>
>>>> or something of that sort?
>>>
>>> No, absolutely, categorically not. You realise these could be thousands of
>>> characters long right?
>>>
>>> /proc/$pid/pagemaps resolves this without contaminating this output.
>>>
>>>>> Well I'm glad that you guys find it useful for _something_ ;)
>>>>>
>>>>> Again this wasn't written only for you (it is broadly a good feature for
>>>>> upstream), but I did have your use case in mind, so I'm a little
>>>>> disappointed that it doesn't help, as I like to solve problems.
>>>>>
>>>>> But I'm glad it solves at least some for you...
>>>>
>>>> I recall Liam had a proposal to store the guard ranges in the maple tree?
>>>>
>>>> I wonder if that can be used in combination with this approach to have
>>>> a better representation of this?
>>>
>>> This was an alternative proposal made prior to the feature being
>>> implemented (and you and others at Google were welcome to comment and many
>>> were cc'd, etc.).
>>>
>>> There is no 'in combination with'. This feature would take weeks/months to
>>> implement, fundamentally impact the maple tree VMA implementation
>>> and... not actually achieve anything + immediately be redundant.
>>>
>>> Plus it'd likely be slower, have locking implications, would have kernel
>>> memory allocation implications, a lot more complexity and probably other
>>> problems besides (we discussed this at length at the time and a number of
>>> issues came up, I can't recall all of them).
>>>
>>> To be crystal clear - we are empathically NOT changing /proc/$pid/maps to
>>> lie about VMAs regardless of underlying implementation, nor adding
>>> thousands of characters to /proc/$pid/smaps entries.
>>
>> Yes. Calling it a "guard region" might be part of the problem
>> (/"misunderstanding"), because it reminds people of "virtual memory
>> regions".
>>
>> "Guard markers" or similar might have been clearer that these operate on
>> individual PTEs, require page table scanning etc ... which makes them a lot
>> more scalable and fine-grained and provides all these benfits, with the
>> downside being that we don't end up with that many "virtual memory regions"
>> that maps/smaps operate on.
> 
> Honestly David you and the naming... :P
> 
 > I disagree, sorry. Saying 'guard' anything might make people think 
one> thing or another. We can't account for that. I mean don't get me 
started on
> 'pinning' or any of the million other overloaded terms we use...
> 
 > I _hugely_ publicly went out of my way to express the limitations, I 
gave a> talk, we had meetings, I mentioned it in the series.
> 
> Honestly if at that point you still don't realise, that's not a naming
> problem. It's a 'did not participate with upstream' problem.
> 
> I like guard regions, as they're not pages as we previously referred to
> them. People have no idea what a marker is, it doesn't sound like it spans
> ranges, no don't like it sorry.
> 
> And sorry but this naming topic is closed :) I already let you change the
> naming of the MADV_'s, which broke my heart, there will not be a second
> heart breaking...

Lorenzo, I was not pushing for it to be changed *now*, that ship has 
sailed, and personally, I *don't* find it confusing because I know how 
it works under the hood.

I was trying to find a reason *why* people would thing that it would 
show up in smaps in the first place. For example, just when reading the 
MAN page *today*.

Doesn't really matter now, it is named the way it is, and all we can do 
is try making documentation clearer if it keeps confusing people.

Your conclusion is 'did not participate with upstream'; I don't agree 
with that. But maybe you and Kalesh have a history on that that let's 
you react on his questions IMHO more emotionally than it should have been.


> 
>>
>> [...]
>>
>>>
>>> As I said to you earlier, the _best_ we could do in smaps would be to add a
>>> flag like 'Grd' or something to indicate some part of the VMA is
>>> guarded. But I won't do that unless somebody has an -actual use case- for
>>> it.
>>
>> Right, and that would limit where you have to manually scan. Something
>> similar is being done with uffd-wp markers IIRC.
> 
> Yeah that's a good point, but honestly if you're reading smaps that reads
> the page tables, then reading /proc/$pid/pagemaps and reading page tables
> TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
> /proc/$pid/pagemaps and reading page tables once.

Right; I recently wished that we would have an interface to obtain more 
VMA flags without having to go through smaps
Lorenzo Stoakes Feb. 20, 2025, 9:47 a.m. UTC | #27
On Thu, Feb 20, 2025 at 10:23:57AM +0100, David Hildenbrand wrote:
> On 20.02.25 10:04, Lorenzo Stoakes wrote:
> > On Thu, Feb 20, 2025 at 09:57:37AM +0100, David Hildenbrand wrote:
> > > On 20.02.25 09:51, Lorenzo Stoakes wrote:
> > > > On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
> > > > > > We also can't change smaps in the way you want, it _has_ to still give
> > > > > > output per VMA information.
> > > > >
> > > > > Sorry I wasn't suggesting to change the entries in smaps, rather
> > > > > agreeing to your marker suggestion. Maybe a set of ranges for each
> > > > > smaps entry that has guards? It doesn't solve the use case, but does
> > > > > make these regions visible to userspace.
> > > >
> > > > No, you are not providing a usecase for this. /proc/$pid/pagemaps does not
> > > > contaminate the smaps output, mess with efforts to make it RCU readable,
> > > > require updating the ioctl interface, etc. so it is clearly the better
> > > > choice.
> > > >
> > > > >
> > > > > >
> > > > > > The proposed change that would be there would be a flag or something
> > > > > > indicating that the VMA has guard regions _SOMEWHERE_ in it.
> > > > > >
> > > > > > Since this doesn't solve your problem, adds complexity, and nobody else
> > > > > > seems to need it, I would suggest this is not worthwhile and I'd rather not
> > > > > > do this.
> > > > > >
> > > > > > Therefore for your needs there are literally only two choices here:
> > > > > >
> > > > > > 1. Add a bit to /proc/$pid/pagemap OR
> > > > > > 2. a new interface.
> > > > > >
> > > > > > I am not in favour of a new interface here, if we can just extend pagemap.
> > > > > >
> > > > > > What you'd have to do is:
> > > > > >
> > > > > > 1. Find virtual ranges via /proc/$pid/maps
> > > > > > 2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
> > > > > >
> > > > >
> > > > > Could we also consider an smaps field like:
> > > > >
> > > > > VmGuards: [AAA, BBB), [CCC, DDD), ...
> > > > >
> > > > > or something of that sort?
> > > >
> > > > No, absolutely, categorically not. You realise these could be thousands of
> > > > characters long right?
> > > >
> > > > /proc/$pid/pagemaps resolves this without contaminating this output.
> > > >
> > > > > > Well I'm glad that you guys find it useful for _something_ ;)
> > > > > >
> > > > > > Again this wasn't written only for you (it is broadly a good feature for
> > > > > > upstream), but I did have your use case in mind, so I'm a little
> > > > > > disappointed that it doesn't help, as I like to solve problems.
> > > > > >
> > > > > > But I'm glad it solves at least some for you...
> > > > >
> > > > > I recall Liam had a proposal to store the guard ranges in the maple tree?
> > > > >
> > > > > I wonder if that can be used in combination with this approach to have
> > > > > a better representation of this?
> > > >
> > > > This was an alternative proposal made prior to the feature being
> > > > implemented (and you and others at Google were welcome to comment and many
> > > > were cc'd, etc.).
> > > >
> > > > There is no 'in combination with'. This feature would take weeks/months to
> > > > implement, fundamentally impact the maple tree VMA implementation
> > > > and... not actually achieve anything + immediately be redundant.
> > > >
> > > > Plus it'd likely be slower, have locking implications, would have kernel
> > > > memory allocation implications, a lot more complexity and probably other
> > > > problems besides (we discussed this at length at the time and a number of
> > > > issues came up, I can't recall all of them).
> > > >
> > > > To be crystal clear - we are empathically NOT changing /proc/$pid/maps to
> > > > lie about VMAs regardless of underlying implementation, nor adding
> > > > thousands of characters to /proc/$pid/smaps entries.
> > >
> > > Yes. Calling it a "guard region" might be part of the problem
> > > (/"misunderstanding"), because it reminds people of "virtual memory
> > > regions".
> > >
> > > "Guard markers" or similar might have been clearer that these operate on
> > > individual PTEs, require page table scanning etc ... which makes them a lot
> > > more scalable and fine-grained and provides all these benfits, with the
> > > downside being that we don't end up with that many "virtual memory regions"
> > > that maps/smaps operate on.
> >
> > Honestly David you and the naming... :P
> >
> > I disagree, sorry. Saying 'guard' anything might make people think one>
> thing or another. We can't account for that. I mean don't get me started on
> > 'pinning' or any of the million other overloaded terms we use...
> >
> > I _hugely_ publicly went out of my way to express the limitations, I gave
> a> talk, we had meetings, I mentioned it in the series.
> >
> > Honestly if at that point you still don't realise, that's not a naming
> > problem. It's a 'did not participate with upstream' problem.
> >
> > I like guard regions, as they're not pages as we previously referred to
> > them. People have no idea what a marker is, it doesn't sound like it spans
> > ranges, no don't like it sorry.
> >
> > And sorry but this naming topic is closed :) I already let you change the
> > naming of the MADV_'s, which broke my heart, there will not be a second
> > heart breaking...
>
> Lorenzo, I was not pushing for it to be changed *now*, that ship has sailed,
> and personally, I *don't* find it confusing because I know how it works
> under the hood.
>
> I was trying to find a reason *why* people would thing that it would show up
> in smaps in the first place. For example, just when reading the MAN page
> *today*.
>
> Doesn't really matter now, it is named the way it is, and all we can do is
> try making documentation clearer if it keeps confusing people.

Right, but I disagree with your analysis. In case as you say, it doesn't
really matter at this stage.

>
> Your conclusion is 'did not participate with upstream'; I don't agree with
> that. But maybe you and Kalesh have a history on that that let's you react
> on his questions IMHO more emotionally than it should have been.

This is wholly unfair, I have been very reasonable in response to this
thread. I have offered to find solutions, I have tried to understand the
problem in spite of having gone to great lengths to try to discuss the
limitations of the proposed approach in every venue I possibly could.

I go out of my way to deal professionally and objectively with what is
presented. Nothing here is emotional. So I'd ask that you please abstain
from making commentary like this which has no basis.

My point about not participating with upstream is that Kalesh is now asking
for an -entirely different- approach to guard regions than was implemented,
months after it was merged.

This is after many discussion were had including with people at Google
among other organisations, an RFC which clearly delineated the limitations,
a talk at LPC also.

I feel I communite these things better than many people actually, I go out
of my way to document, add extensive self-documenting tests, etc. etc. and
to participate, engage and take onboard feedback from others.

So I'm suggesting that clearly - something broke down here. There was a
miscommunication, or there was a lack of awareness of a key requirement.

I mean a large motivator for file-backed support here came from the LPC
talk give by Kalesh and Juan re: ELF guard regions. So obviously that this
breakdown in communication with upstream occurred is very unfortunate.

I am not blaming anybody or being 'emotional'. I am simply stating what
seems to me to be a clear fact.

I genuinely don't understand how it could be seen any other way? How can
requesting that an entirely different alternative approach months after the
fact be anything other than some failure to engage with upstream?

I am emphatically _not_ blaming Kalesh by the way, whom I respect and with
whom I have no problem, in any way whatsoever. I apologised when I realised
that he simply was not aware of this limitation at LPC if you look through
the thread, having politely suggested disappointment at this not having
been brought up then.

I am suggesting that within Google clearly there has been some form of
miscommunication or failure for one aspect of things (this limitation) to
be expressed at the right time to engage with upstream.

Sorry if you misinterpreted that as something else, this is the _only_
point I am making here.

And again, I am going out of my way to find practical and helpful ways
forward - though I cannot see how we can possibly fulfil Kalesh's needs
here.

>
>
> >
> > >
> > > [...]
> > >
> > > >
> > > > As I said to you earlier, the _best_ we could do in smaps would be to add a
> > > > flag like 'Grd' or something to indicate some part of the VMA is
> > > > guarded. But I won't do that unless somebody has an -actual use case- for
> > > > it.
> > >
> > > Right, and that would limit where you have to manually scan. Something
> > > similar is being done with uffd-wp markers IIRC.
> >
> > Yeah that's a good point, but honestly if you're reading smaps that reads
> > the page tables, then reading /proc/$pid/pagemaps and reading page tables
> > TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
> > /proc/$pid/pagemaps and reading page tables once.
>
> Right; I recently wished that we would have an interface to obtain more VMA
> flags without having to go through smaps

Well maybe that lends itself to the idea of adding a whole new interface in
general...

>
> --
> Cheers,
>
> David / dhildenb
>
Lorenzo Stoakes Feb. 20, 2025, 9:53 a.m. UTC | #28
On Thu, Feb 20, 2025 at 10:22:29AM +0100, Vlastimil Babka wrote:
> On 2/20/25 09:51, Lorenzo Stoakes wrote:
> > On Wed, Feb 19, 2025 at 12:56:31PM -0800, Kalesh Singh wrote:
> >
> > As I said to you earlier, the _best_ we could do in smaps would be to add a
> > flag like 'Grd' or something to indicate some part of the VMA is
>
> In smaps we could say how many kB is covered by guard ptes and it would be
> in line with the current output, and the fact it's already scanning page
> tables, so virtually no new overhead. But it wouldn't tell you the address
> ranges, of course.

Yes right. Sorry, I didn't think to suggest this, the fundamental point
being that there is essentially a 'flag' or 'entry' saying 'there are
guard regions here'. And indeed that could include ranges.

But it doesn't tell you the actual address ranges which is what Kalesh
appears to require.

You could equally well find these, and the sizes (albeit less conveniently)
with a combination of /proc/$pid/maps and adding a guard bit to
/proc/$pid/pagemap.

>
> > guarded. But I won't do that unless somebody has an -actual use case- for
> > it.
>
> Right, we need to hear that first. Also I'm a bit confused about what's the
> issue with the existing Android apps and the proposals. Do the existing apps
> expect to see particular PROT_NONE regions, which would be gone (become part
> of the surrounding vma's) when using guards instead? Then clearly there's no
> way to use guards for them without breaking their assumptions.
>
> But assuming future versions of these apps for future android versions would
> have to adapt to guards instead of PROT_NONE, why do we have to promise them
> to represent the guards, if the point is just so they adapt to the new state
> of smaps in their dubious security checking code, and not break anymore?
>
> (but geez, if android apps are to use the android apis, which is java based
> (IIRC?) why were there ever allowed to read /proc in the first place? such a
> mistake, sigh)
>
> > David's /proc/$pid/pagemaps suggestion is excellent, avoids all the
> > pitfalls, exposes guard regions to anybody who really really wants to know
> > and doesn't interfere with anything else, so this is what we'll go with.
> >
> > Regards, Lorenzo
>
David Hildenbrand Feb. 20, 2025, 10:03 a.m. UTC | #29
>> Your conclusion is 'did not participate with upstream'; I don't agree with
>> that. But maybe you and Kalesh have a history on that that let's you react
>> on his questions IMHO more emotionally than it should have been.
> 
> This is wholly unfair, I have been very reasonable in response to this
> thread. I have offered to find solutions, I have tried to understand the
> problem in spite of having gone to great lengths to try to discuss the
> limitations of the proposed approach in every venue I possibly could.
> 
> I go out of my way to deal professionally and objectively with what is
> presented. Nothing here is emotional. So I'd ask that you please abstain
> from making commentary like this which has no basis.

I appreciate everything you write below. But this request is just 
impossible. I will keep raising my opinion and misunderstandings will 
happen.

Note that the whole "Honestly David you and the naming. .." thing could 
have been written as "I don't think it's a naming problem."

>>
>>
>>>
>>>>
>>>> [...]
>>>>
>>>>>
>>>>> As I said to you earlier, the _best_ we could do in smaps would be to add a
>>>>> flag like 'Grd' or something to indicate some part of the VMA is
>>>>> guarded. But I won't do that unless somebody has an -actual use case- for
>>>>> it.
>>>>
>>>> Right, and that would limit where you have to manually scan. Something
>>>> similar is being done with uffd-wp markers IIRC.
>>>
>>> Yeah that's a good point, but honestly if you're reading smaps that reads
>>> the page tables, then reading /proc/$pid/pagemaps and reading page tables
>>> TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
>>> /proc/$pid/pagemaps and reading page tables once.
>>
>> Right; I recently wished that we would have an interface to obtain more VMA
>> flags without having to go through smaps
> 
> Well maybe that lends itself to the idea of adding a whole new interface in
> general...

An extended "maps" interface might be reasonable, that allows for 
exposing more things without walking the page tables. (e.g., flags)

Maybe one could have an indicator that says "ever had guard regions in 
this mapping" without actually walking the page tables.
Lorenzo Stoakes Feb. 20, 2025, 10:15 a.m. UTC | #30
On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
> > > Your conclusion is 'did not participate with upstream'; I don't agree with
> > > that. But maybe you and Kalesh have a history on that that let's you react
> > > on his questions IMHO more emotionally than it should have been.
> >
> > This is wholly unfair, I have been very reasonable in response to this
> > thread. I have offered to find solutions, I have tried to understand the
> > problem in spite of having gone to great lengths to try to discuss the
> > limitations of the proposed approach in every venue I possibly could.
> >
> > I go out of my way to deal professionally and objectively with what is
> > presented. Nothing here is emotional. So I'd ask that you please abstain
> > from making commentary like this which has no basis.
>
> I appreciate everything you write below. But this request is just
> impossible. I will keep raising my opinion and misunderstandings will
> happen.

Well I wouldn't ask you not to express your opinion David, you know I respect
and like you, and by all means push back hard or call out what you think is bad
behaviour :)

I just meant to say, in my view, that there was no basis, but I appreciate
miscommunications happen.

So apologies if I came off as being difficult or rude, it actually wasn't
intended. And to re-emphasise - I have zero personal issue with anybody in this
thread whatsoever!

I just want to find the best way forward, technically and am willing to do
whatever work is required to make the guard region implementation as good as it
possibly can be.

>
> Note that the whole "Honestly David you and the naming. .." thing could have
> been written as "I don't think it's a naming problem."

I feel like I _always_ get in trouble when I try to write in a 'tongue-in-cheek'
style, which is what this was meant to be... so I think herein lies the basis of
the miscommunication :)

I apologise, the household is ill, which maybe affects my judgment in how I
write these, but in general text is a very poor medium. It was meant to be said
in a jolly tone with a wink...

I think maybe I should learn my lesson with these things, I thought the ':p'
would make this clear but yeah, text, poor medium.

Anyway apologies if this seemed disrespectful.

>
> > >
> > >
> > > >
> > > > >
> > > > > [...]
> > > > >
> > > > > >
> > > > > > As I said to you earlier, the _best_ we could do in smaps would be to add a
> > > > > > flag like 'Grd' or something to indicate some part of the VMA is
> > > > > > guarded. But I won't do that unless somebody has an -actual use case- for
> > > > > > it.
> > > > >
> > > > > Right, and that would limit where you have to manually scan. Something
> > > > > similar is being done with uffd-wp markers IIRC.
> > > >
> > > > Yeah that's a good point, but honestly if you're reading smaps that reads
> > > > the page tables, then reading /proc/$pid/pagemaps and reading page tables
> > > > TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
> > > > /proc/$pid/pagemaps and reading page tables once.
> > >
> > > Right; I recently wished that we would have an interface to obtain more VMA
> > > flags without having to go through smaps
> >
> > Well maybe that lends itself to the idea of adding a whole new interface in
> > general...
>
> An extended "maps" interface might be reasonable, that allows for exposing
> more things without walking the page tables. (e.g., flags)
>
> Maybe one could have an indicator that says "ever had guard regions in this
> mapping" without actually walking the page tables.

Yeah this is something we've discussed before, but it's a little fraught. Let's
say it was a VMA flag, in this case we'd have to make this flag 'sticky' and not
impact merging (easy enough) to account for splits/merges.

The problem comes in that we would then need to acquire the VMA write lock to do
it, something we don't currently require on application of guard regions.

We'd also have to make sure nothing else makes any assumptions about VMA flags
implying differences in VMAs in this one instance (though we do already do this
for VM_SOFTDIRTY).

I saw this as possibly something like VM_MAYBE_GUARD_REGIONS or something.

>
> --
> Cheers,
>
> David / dhildenb
>
>
David Hildenbrand Feb. 20, 2025, 12:44 p.m. UTC | #31
On 20.02.25 11:15, Lorenzo Stoakes wrote:
> On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
>>>> Your conclusion is 'did not participate with upstream'; I don't agree with
>>>> that. But maybe you and Kalesh have a history on that that let's you react
>>>> on his questions IMHO more emotionally than it should have been.
>>>
>>> This is wholly unfair, I have been very reasonable in response to this
>>> thread. I have offered to find solutions, I have tried to understand the
>>> problem in spite of having gone to great lengths to try to discuss the
>>> limitations of the proposed approach in every venue I possibly could.
>>>
>>> I go out of my way to deal professionally and objectively with what is
>>> presented. Nothing here is emotional. So I'd ask that you please abstain
>>> from making commentary like this which has no basis.
>>
>> I appreciate everything you write below. But this request is just
>> impossible. I will keep raising my opinion and misunderstandings will
>> happen.
> 
> Well I wouldn't ask you not to express your opinion David, you know I respect
> and like you, and by all means push back hard or call out what you think is bad
> behaviour :)
> 
> I just meant to say, in my view, that there was no basis, but I appreciate
> miscommunications happen.
 > > So apologies if I came off as being difficult or rude, it actually 
wasn't
> intended. And to re-emphasise - I have zero personal issue with anybody in this
> thread whatsoever!

It sounded to me like you were trying to defend your work (again, IMHO 
too emotionally, and I might have completely misinterpreted that) and 
slowly switching to "friendly fire" (towards me). Apologies from my side 
if I completely misunderstood/misinterpreted that.

To recap: what we have upstream is great; you did a great job. Yes, the 
mechanism has its drawbacks, but that's just part of the design.

Some people maybe have wrong expectations, maybe there were 
misunderstandings, or maybe there are requirements that only now pop up; 
it's sometimes unavoidable, and that's ok.

We can try to document it better (and I was trying to find clues why 
people might be mislead), and see if/how we could sort out these 
requirements. But we can likely not make it perfect in any possible way 
(I'm sure there are plenty of use cases where what we currently have is 
more than sufficient).

 > > I just want to find the best way forward, technically and am 
willing to do
> whatever work is required to make the guard region implementation as good as it
> possibly can be.
> 
>>
>> Note that the whole "Honestly David you and the naming. .." thing could have
>> been written as "I don't think it's a naming problem."
> 
> I feel like I _always_ get in trouble when I try to write in a 'tongue-in-cheek'
> style, which is what this was meant to be... so I think herein lies the basis of
> the miscommunication :)
> 
> I apologise, the household is ill, which maybe affects my judgment in how I
> write these, but in general text is a very poor medium. It was meant to be said
> in a jolly tone with a wink...
> 
> I think maybe I should learn my lesson with these things, I thought the ':p'
> would make this clear but yeah, text, poor medium.
> 
> Anyway apologies if this seemed disrespectful.

No worries, it's hard to really make me angry, and I appreciate your 
openness and your apology (well, and you and your work, obviously).

I'll note, though, if my memory serves me right, that nobody so far ever 
  criticized the way I communicate upstream, or even told me to abstain 
from certain communication.

That probably hurt most, now that a couple of hours passed. Nothing that 
a couple of beers and a bit of self-reflection on my communication style 
can't fix ;)

[...]

>>>>> Yeah that's a good point, but honestly if you're reading smaps that reads
>>>>> the page tables, then reading /proc/$pid/pagemaps and reading page tables
>>>>> TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
>>>>> /proc/$pid/pagemaps and reading page tables once.
>>>>
>>>> Right; I recently wished that we would have an interface to obtain more VMA
>>>> flags without having to go through smaps
>>>
>>> Well maybe that lends itself to the idea of adding a whole new interface in
>>> general...
>>
>> An extended "maps" interface might be reasonable, that allows for exposing
>> more things without walking the page tables. (e.g., flags)
>>
>> Maybe one could have an indicator that says "ever had guard regions in this
>> mapping" without actually walking the page tables.
> 
> Yeah this is something we've discussed before, but it's a little fraught. Let's
> say it was a VMA flag, in this case we'd have to make this flag 'sticky' and not
> impact merging (easy enough) to account for splits/merges.
 > > The problem comes in that we would then need to acquire the VMA 
write lock to do
> it, something we don't currently require on application of guard regions.

Right, and we shouldn't write-lock the mmap. We'd need some way to just 
atomically set such an indicator on a VMA.

I'll also note that it might be helpful for smallish region, but 
especially for large ones (including when they are split and the 
indicator is wrong), it's less helpful. I don't have to tell you about 
the VMA merging implications, probably it would be like VM_SOFTDIRTY 
handling :)

> 
> We'd also have to make sure nothing else makes any assumptions about VMA flags
> implying differences in VMAs in this one instance (though we do already do this
> for VM_SOFTDIRTY).
> 
> I saw this as possibly something like VM_MAYBE_GUARD_REGIONS or something.

Yes.
Lorenzo Stoakes Feb. 20, 2025, 1:18 p.m. UTC | #32
On Thu, Feb 20, 2025 at 01:44:20PM +0100, David Hildenbrand wrote:
> On 20.02.25 11:15, Lorenzo Stoakes wrote:
> > On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
> > > > > Your conclusion is 'did not participate with upstream'; I don't agree with
> > > > > that. But maybe you and Kalesh have a history on that that let's you react
> > > > > on his questions IMHO more emotionally than it should have been.
> > > >
> > > > This is wholly unfair, I have been very reasonable in response to this
> > > > thread. I have offered to find solutions, I have tried to understand the
> > > > problem in spite of having gone to great lengths to try to discuss the
> > > > limitations of the proposed approach in every venue I possibly could.
> > > >
> > > > I go out of my way to deal professionally and objectively with what is
> > > > presented. Nothing here is emotional. So I'd ask that you please abstain
> > > > from making commentary like this which has no basis.
> > >
> > > I appreciate everything you write below. But this request is just
> > > impossible. I will keep raising my opinion and misunderstandings will
> > > happen.
> >
> > Well I wouldn't ask you not to express your opinion David, you know I respect
> > and like you, and by all means push back hard or call out what you think is bad
> > behaviour :)
> >
> > I just meant to say, in my view, that there was no basis, but I appreciate
> > miscommunications happen.
> > > So apologies if I came off as being difficult or rude, it actually
> wasn't
> > intended. And to re-emphasise - I have zero personal issue with anybody in this
> > thread whatsoever!
>
> It sounded to me like you were trying to defend your work (again, IMHO too
> emotionally, and I might have completely misinterpreted that) and slowly
> switching to "friendly fire" (towards me). Apologies from my side if I
> completely misunderstood/misinterpreted that.

Right this was not at all my intent, sorry if it seemed that way. I may well
have communicated terribly, so apologies on my side too.

>
> To recap: what we have upstream is great; you did a great job. Yes, the
> mechanism has its drawbacks, but that's just part of the design.

Thanks :)

>
> Some people maybe have wrong expectations, maybe there were
> misunderstandings, or maybe there are requirements that only now pop up;
> it's sometimes unavoidable, and that's ok.
>
> We can try to document it better (and I was trying to find clues why people
> might be mislead), and see if/how we could sort out these requirements. But
> we can likely not make it perfect in any possible way (I'm sure there are
> plenty of use cases where what we currently have is more than sufficient).

Sure and I"m very open to adding a documentation page for guard regions, in
fact was considering this very thing recently. I already added man pages
but be good to be able to go into more depth.

>
> > > I just want to find the best way forward, technically and am willing to
> do
> > whatever work is required to make the guard region implementation as good as it
> > possibly can be.
> >
> > >
> > > Note that the whole "Honestly David you and the naming. .." thing could have
> > > been written as "I don't think it's a naming problem."
> >
> > I feel like I _always_ get in trouble when I try to write in a 'tongue-in-cheek'
> > style, which is what this was meant to be... so I think herein lies the basis of
> > the miscommunication :)
> >
> > I apologise, the household is ill, which maybe affects my judgment in how I
> > write these, but in general text is a very poor medium. It was meant to be said
> > in a jolly tone with a wink...
> >
> > I think maybe I should learn my lesson with these things, I thought the ':p'
> > would make this clear but yeah, text, poor medium.
> >
> > Anyway apologies if this seemed disrespectful.
>
> No worries, it's hard to really make me angry, and I appreciate your
> openness and your apology (well, and you and your work, obviously).
>
> I'll note, though, if my memory serves me right, that nobody so far ever
> criticized the way I communicate upstream, or even told me to abstain from
> certain communication.

I wish I could say the same haha, so perhaps this was a problem on my side
honestly. I do have a habit of being 'tongue in cheek' and failing to
communicate that which I did say the last time I wouldn't repeat. It is not
intended, I promise.

As the abstain, was more a British turn of phrase, meaning to say - I
dispute the claim that this is an emotional thing and please don't say this
if it isn't so.

But I understand that of course, you may have interpreted it as so, due to
my having failed to communicate it well.

Again, I must say, text remains replete with possibilities for
miscommunication, misunderstanding and it can so often be difficult to
communicate one's intent.

But again of course, I apologise if I overstepped the line in any way!

>
> That probably hurt most, now that a couple of hours passed. Nothing that a
> couple of beers and a bit of self-reflection on my communication style can't
> fix ;)

Ugh sorry, man. Not my intent. And it seems - I literally OWE YOU pints
now. :) we will fix this at lsf...

Perhaps owe Kalesh some too should he be there... will budget
accordingly... :P

>
> [...]
>
> > > > > > Yeah that's a good point, but honestly if you're reading smaps that reads
> > > > > > the page tables, then reading /proc/$pid/pagemaps and reading page tables
> > > > > > TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
> > > > > > /proc/$pid/pagemaps and reading page tables once.
> > > > >
> > > > > Right; I recently wished that we would have an interface to obtain more VMA
> > > > > flags without having to go through smaps
> > > >
> > > > Well maybe that lends itself to the idea of adding a whole new interface in
> > > > general...
> > >
> > > An extended "maps" interface might be reasonable, that allows for exposing
> > > more things without walking the page tables. (e.g., flags)
> > >
> > > Maybe one could have an indicator that says "ever had guard regions in this
> > > mapping" without actually walking the page tables.
> >
> > Yeah this is something we've discussed before, but it's a little fraught. Let's
> > say it was a VMA flag, in this case we'd have to make this flag 'sticky' and not
> > impact merging (easy enough) to account for splits/merges.
> > > The problem comes in that we would then need to acquire the VMA write
> lock to do
> > it, something we don't currently require on application of guard regions.
>
> Right, and we shouldn't write-lock the mmap. We'd need some way to just
> atomically set such an indicator on a VMA.

Hm yeah, could be tricky, we definitely can't manage a new field in
vm_area_struct, this is a very sensitive subject at the moment really with
Suren's work with VMAs allocated via SLAB_TYPESAFE_BY_RCU, putting the lock
into the VMA and the alignment requirements.

Not sure what precedent we'd have with atomic setting of a VMA flag for
this... could be tricky.

>
> I'll also note that it might be helpful for smallish region, but especially
> for large ones (including when they are split and the indicator is wrong),
> it's less helpful. I don't have to tell you about the VMA merging
> implications, probably it would be like VM_SOFTDIRTY handling :)

Yeah indeed now we've simplified merging a lot of possibilities emerge,
this is one!

>
> >
> > We'd also have to make sure nothing else makes any assumptions about VMA flags
> > implying differences in VMAs in this one instance (though we do already do this
> > for VM_SOFTDIRTY).
> >
> > I saw this as possibly something like VM_MAYBE_GUARD_REGIONS or something.
>
> Yes.
>
> --
> Cheers,
>
> David / dhildenb
>

Best, Lorenzo
Suren Baghdasaryan Feb. 20, 2025, 4:21 p.m. UTC | #33
On Thu, Feb 20, 2025 at 5:18 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Feb 20, 2025 at 01:44:20PM +0100, David Hildenbrand wrote:
> > On 20.02.25 11:15, Lorenzo Stoakes wrote:
> > > On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
> > > > > > Your conclusion is 'did not participate with upstream'; I don't agree with
> > > > > > that. But maybe you and Kalesh have a history on that that let's you react
> > > > > > on his questions IMHO more emotionally than it should have been.
> > > > >
> > > > > This is wholly unfair, I have been very reasonable in response to this
> > > > > thread. I have offered to find solutions, I have tried to understand the
> > > > > problem in spite of having gone to great lengths to try to discuss the
> > > > > limitations of the proposed approach in every venue I possibly could.
> > > > >
> > > > > I go out of my way to deal professionally and objectively with what is
> > > > > presented. Nothing here is emotional. So I'd ask that you please abstain
> > > > > from making commentary like this which has no basis.
> > > >
> > > > I appreciate everything you write below. But this request is just
> > > > impossible. I will keep raising my opinion and misunderstandings will
> > > > happen.
> > >
> > > Well I wouldn't ask you not to express your opinion David, you know I respect
> > > and like you, and by all means push back hard or call out what you think is bad
> > > behaviour :)
> > >
> > > I just meant to say, in my view, that there was no basis, but I appreciate
> > > miscommunications happen.
> > > > So apologies if I came off as being difficult or rude, it actually
> > wasn't
> > > intended. And to re-emphasise - I have zero personal issue with anybody in this
> > > thread whatsoever!
> >
> > It sounded to me like you were trying to defend your work (again, IMHO too
> > emotionally, and I might have completely misinterpreted that) and slowly
> > switching to "friendly fire" (towards me). Apologies from my side if I
> > completely misunderstood/misinterpreted that.
>
> Right this was not at all my intent, sorry if it seemed that way. I may well
> have communicated terribly, so apologies on my side too.

Sorry for being late to the party. Was sick for a couple of days.
Lorenzo is right, there was a breakdown in communication at Google and
he has all the rights to be upset. The issue with obfuscators should
have been communicated once it was discovered. I was in regular
discussions with Lorenzo but wasn't directly involved with this
particular project and wasn't aware or did not realize that the
obfuscator issue renders guards unusable for this usecase. My
apologies, I should have asked more questions about it. I suspect
Lorenzo would have implemented this anyway...

To make guard regions work for this usecase, first we (Android) need
to abstract /proc/pid/maps accesses. Only then we can use additional
interfaces like /proc/pid/pagemaps to obtain guard region information.
I'll start figuring out what it takes to insert such an abstraction.
Thanks,
Suren.


>
> >
> > To recap: what we have upstream is great; you did a great job. Yes, the
> > mechanism has its drawbacks, but that's just part of the design.
>
> Thanks :)
>
> >
> > Some people maybe have wrong expectations, maybe there were
> > misunderstandings, or maybe there are requirements that only now pop up;
> > it's sometimes unavoidable, and that's ok.
> >
> > We can try to document it better (and I was trying to find clues why people
> > might be mislead), and see if/how we could sort out these requirements. But
> > we can likely not make it perfect in any possible way (I'm sure there are
> > plenty of use cases where what we currently have is more than sufficient).
>
> Sure and I"m very open to adding a documentation page for guard regions, in
> fact was considering this very thing recently. I already added man pages
> but be good to be able to go into more depth.
>
> >
> > > > I just want to find the best way forward, technically and am willing to
> > do
> > > whatever work is required to make the guard region implementation as good as it
> > > possibly can be.
> > >
> > > >
> > > > Note that the whole "Honestly David you and the naming. .." thing could have
> > > > been written as "I don't think it's a naming problem."
> > >
> > > I feel like I _always_ get in trouble when I try to write in a 'tongue-in-cheek'
> > > style, which is what this was meant to be... so I think herein lies the basis of
> > > the miscommunication :)
> > >
> > > I apologise, the household is ill, which maybe affects my judgment in how I
> > > write these, but in general text is a very poor medium. It was meant to be said
> > > in a jolly tone with a wink...
> > >
> > > I think maybe I should learn my lesson with these things, I thought the ':p'
> > > would make this clear but yeah, text, poor medium.
> > >
> > > Anyway apologies if this seemed disrespectful.
> >
> > No worries, it's hard to really make me angry, and I appreciate your
> > openness and your apology (well, and you and your work, obviously).
> >
> > I'll note, though, if my memory serves me right, that nobody so far ever
> > criticized the way I communicate upstream, or even told me to abstain from
> > certain communication.
>
> I wish I could say the same haha, so perhaps this was a problem on my side
> honestly. I do have a habit of being 'tongue in cheek' and failing to
> communicate that which I did say the last time I wouldn't repeat. It is not
> intended, I promise.
>
> As the abstain, was more a British turn of phrase, meaning to say - I
> dispute the claim that this is an emotional thing and please don't say this
> if it isn't so.
>
> But I understand that of course, you may have interpreted it as so, due to
> my having failed to communicate it well.
>
> Again, I must say, text remains replete with possibilities for
> miscommunication, misunderstanding and it can so often be difficult to
> communicate one's intent.
>
> But again of course, I apologise if I overstepped the line in any way!
>
> >
> > That probably hurt most, now that a couple of hours passed. Nothing that a
> > couple of beers and a bit of self-reflection on my communication style can't
> > fix ;)
>
> Ugh sorry, man. Not my intent. And it seems - I literally OWE YOU pints
> now. :) we will fix this at lsf...
>
> Perhaps owe Kalesh some too should he be there... will budget
> accordingly... :P
>
> >
> > [...]
> >
> > > > > > > Yeah that's a good point, but honestly if you're reading smaps that reads
> > > > > > > the page tables, then reading /proc/$pid/pagemaps and reading page tables
> > > > > > > TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
> > > > > > > /proc/$pid/pagemaps and reading page tables once.
> > > > > >
> > > > > > Right; I recently wished that we would have an interface to obtain more VMA
> > > > > > flags without having to go through smaps
> > > > >
> > > > > Well maybe that lends itself to the idea of adding a whole new interface in
> > > > > general...
> > > >
> > > > An extended "maps" interface might be reasonable, that allows for exposing
> > > > more things without walking the page tables. (e.g., flags)
> > > >
> > > > Maybe one could have an indicator that says "ever had guard regions in this
> > > > mapping" without actually walking the page tables.
> > >
> > > Yeah this is something we've discussed before, but it's a little fraught. Let's
> > > say it was a VMA flag, in this case we'd have to make this flag 'sticky' and not
> > > impact merging (easy enough) to account for splits/merges.
> > > > The problem comes in that we would then need to acquire the VMA write
> > lock to do
> > > it, something we don't currently require on application of guard regions.
> >
> > Right, and we shouldn't write-lock the mmap. We'd need some way to just
> > atomically set such an indicator on a VMA.
>
> Hm yeah, could be tricky, we definitely can't manage a new field in
> vm_area_struct, this is a very sensitive subject at the moment really with
> Suren's work with VMAs allocated via SLAB_TYPESAFE_BY_RCU, putting the lock
> into the VMA and the alignment requirements.
>
> Not sure what precedent we'd have with atomic setting of a VMA flag for
> this... could be tricky.
>
> >
> > I'll also note that it might be helpful for smallish region, but especially
> > for large ones (including when they are split and the indicator is wrong),
> > it's less helpful. I don't have to tell you about the VMA merging
> > implications, probably it would be like VM_SOFTDIRTY handling :)
>
> Yeah indeed now we've simplified merging a lot of possibilities emerge,
> this is one!
>
> >
> > >
> > > We'd also have to make sure nothing else makes any assumptions about VMA flags
> > > implying differences in VMAs in this one instance (though we do already do this
> > > for VM_SOFTDIRTY).
> > >
> > > I saw this as possibly something like VM_MAYBE_GUARD_REGIONS or something.
> >
> > Yes.
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
>
> Best, Lorenzo
Kalesh Singh Feb. 20, 2025, 6:08 p.m. UTC | #34
On Thu, Feb 20, 2025 at 8:22 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Feb 20, 2025 at 5:18 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 01:44:20PM +0100, David Hildenbrand wrote:
> > > On 20.02.25 11:15, Lorenzo Stoakes wrote:
> > > > On Thu, Feb 20, 2025 at 11:03:02AM +0100, David Hildenbrand wrote:
> > > > > > > Your conclusion is 'did not participate with upstream'; I don't agree with
> > > > > > > that. But maybe you and Kalesh have a history on that that let's you react
> > > > > > > on his questions IMHO more emotionally than it should have been.
> > > > > >
> > > > > > This is wholly unfair, I have been very reasonable in response to this
> > > > > > thread. I have offered to find solutions, I have tried to understand the
> > > > > > problem in spite of having gone to great lengths to try to discuss the
> > > > > > limitations of the proposed approach in every venue I possibly could.
> > > > > >
> > > > > > I go out of my way to deal professionally and objectively with what is
> > > > > > presented. Nothing here is emotional. So I'd ask that you please abstain
> > > > > > from making commentary like this which has no basis.
> > > > >
> > > > > I appreciate everything you write below. But this request is just
> > > > > impossible. I will keep raising my opinion and misunderstandings will
> > > > > happen.
> > > >
> > > > Well I wouldn't ask you not to express your opinion David, you know I respect
> > > > and like you, and by all means push back hard or call out what you think is bad
> > > > behaviour :)
> > > >
> > > > I just meant to say, in my view, that there was no basis, but I appreciate
> > > > miscommunications happen.
> > > > > So apologies if I came off as being difficult or rude, it actually
> > > wasn't
> > > > intended. And to re-emphasise - I have zero personal issue with anybody in this
> > > > thread whatsoever!
> > >
> > > It sounded to me like you were trying to defend your work (again, IMHO too
> > > emotionally, and I might have completely misinterpreted that) and slowly
> > > switching to "friendly fire" (towards me). Apologies from my side if I
> > > completely misunderstood/misinterpreted that.
> >
> > Right this was not at all my intent, sorry if it seemed that way. I may well
> > have communicated terribly, so apologies on my side too.

Hi everyone,

Thank you for all the discussion.

I don't find any personal issues with the communication in this
thread, but I appreciate David being the object voice of reason.

I understand it can be frustrating since you have made many efforts to
communicate these tradeoffs. Unfortunately these issues were not known
for the file-backed ELF guard regions for my particular use case.

>
> Sorry for being late to the party. Was sick for a couple of days.
> Lorenzo is right, there was a breakdown in communication at Google and
> he has all the rights to be upset. The issue with obfuscators should
> have been communicated once it was discovered. I was in regular
> discussions with Lorenzo but wasn't directly involved with this
> particular project and wasn't aware or did not realize that the
> obfuscator issue renders guards unusable for this usecase. My
> apologies, I should have asked more questions about it. I suspect
> Lorenzo would have implemented this anyway...
>

Suren's use case is different from mine and this design fits perfectly
for anon guard regions from the allocator. :)

So I think in conclusion, these aren't VMAs and shouldn't be treated
as such; we will advertise them from pagemap for those who need to
know.

-- Kalesh


> To make guard regions work for this usecase, first we (Android) need
> to abstract /proc/pid/maps accesses. Only then we can use additional
> interfaces like /proc/pid/pagemaps to obtain guard region information.
> I'll start figuring out what it takes to insert such an abstraction.
> Thanks,
> Suren.
>
>
> >
> > >
> > > To recap: what we have upstream is great; you did a great job. Yes, the
> > > mechanism has its drawbacks, but that's just part of the design.
> >
> > Thanks :)
> >
> > >
> > > Some people maybe have wrong expectations, maybe there were
> > > misunderstandings, or maybe there are requirements that only now pop up;
> > > it's sometimes unavoidable, and that's ok.
> > >
> > > We can try to document it better (and I was trying to find clues why people
> > > might be mislead), and see if/how we could sort out these requirements. But
> > > we can likely not make it perfect in any possible way (I'm sure there are
> > > plenty of use cases where what we currently have is more than sufficient).
> >
> > Sure and I"m very open to adding a documentation page for guard regions, in
> > fact was considering this very thing recently. I already added man pages
> > but be good to be able to go into more depth.
> >
> > >
> > > > > I just want to find the best way forward, technically and am willing to
> > > do
> > > > whatever work is required to make the guard region implementation as good as it
> > > > possibly can be.
> > > >
> > > > >
> > > > > Note that the whole "Honestly David you and the naming. .." thing could have
> > > > > been written as "I don't think it's a naming problem."
> > > >
> > > > I feel like I _always_ get in trouble when I try to write in a 'tongue-in-cheek'
> > > > style, which is what this was meant to be... so I think herein lies the basis of
> > > > the miscommunication :)
> > > >
> > > > I apologise, the household is ill, which maybe affects my judgment in how I
> > > > write these, but in general text is a very poor medium. It was meant to be said
> > > > in a jolly tone with a wink...
> > > >
> > > > I think maybe I should learn my lesson with these things, I thought the ':p'
> > > > would make this clear but yeah, text, poor medium.
> > > >
> > > > Anyway apologies if this seemed disrespectful.
> > >
> > > No worries, it's hard to really make me angry, and I appreciate your
> > > openness and your apology (well, and you and your work, obviously).
> > >
> > > I'll note, though, if my memory serves me right, that nobody so far ever
> > > criticized the way I communicate upstream, or even told me to abstain from
> > > certain communication.
> >
> > I wish I could say the same haha, so perhaps this was a problem on my side
> > honestly. I do have a habit of being 'tongue in cheek' and failing to
> > communicate that which I did say the last time I wouldn't repeat. It is not
> > intended, I promise.
> >
> > As the abstain, was more a British turn of phrase, meaning to say - I
> > dispute the claim that this is an emotional thing and please don't say this
> > if it isn't so.
> >
> > But I understand that of course, you may have interpreted it as so, due to
> > my having failed to communicate it well.
> >
> > Again, I must say, text remains replete with possibilities for
> > miscommunication, misunderstanding and it can so often be difficult to
> > communicate one's intent.
> >
> > But again of course, I apologise if I overstepped the line in any way!
> >
> > >
> > > That probably hurt most, now that a couple of hours passed. Nothing that a
> > > couple of beers and a bit of self-reflection on my communication style can't
> > > fix ;)
> >
> > Ugh sorry, man. Not my intent. And it seems - I literally OWE YOU pints
> > now. :) we will fix this at lsf...
> >
> > Perhaps owe Kalesh some too should he be there... will budget
> > accordingly... :P
> >
> > >
> > > [...]
> > >
> > > > > > > > Yeah that's a good point, but honestly if you're reading smaps that reads
> > > > > > > > the page tables, then reading /proc/$pid/pagemaps and reading page tables
> > > > > > > > TWICE that seems inefficient vs. just reading /proc/$pid/maps, then reading
> > > > > > > > /proc/$pid/pagemaps and reading page tables once.
> > > > > > >
> > > > > > > Right; I recently wished that we would have an interface to obtain more VMA
> > > > > > > flags without having to go through smaps
> > > > > >
> > > > > > Well maybe that lends itself to the idea of adding a whole new interface in
> > > > > > general...
> > > > >
> > > > > An extended "maps" interface might be reasonable, that allows for exposing
> > > > > more things without walking the page tables. (e.g., flags)
> > > > >
> > > > > Maybe one could have an indicator that says "ever had guard regions in this
> > > > > mapping" without actually walking the page tables.
> > > >
> > > > Yeah this is something we've discussed before, but it's a little fraught. Let's
> > > > say it was a VMA flag, in this case we'd have to make this flag 'sticky' and not
> > > > impact merging (easy enough) to account for splits/merges.
> > > > > The problem comes in that we would then need to acquire the VMA write
> > > lock to do
> > > > it, something we don't currently require on application of guard regions.
> > >
> > > Right, and we shouldn't write-lock the mmap. We'd need some way to just
> > > atomically set such an indicator on a VMA.
> >
> > Hm yeah, could be tricky, we definitely can't manage a new field in
> > vm_area_struct, this is a very sensitive subject at the moment really with
> > Suren's work with VMAs allocated via SLAB_TYPESAFE_BY_RCU, putting the lock
> > into the VMA and the alignment requirements.
> >
> > Not sure what precedent we'd have with atomic setting of a VMA flag for
> > this... could be tricky.
> >
> > >
> > > I'll also note that it might be helpful for smallish region, but especially
> > > for large ones (including when they are split and the indicator is wrong),
> > > it's less helpful. I don't have to tell you about the VMA merging
> > > implications, probably it would be like VM_SOFTDIRTY handling :)
> >
> > Yeah indeed now we've simplified merging a lot of possibilities emerge,
> > this is one!
> >
> > >
> > > >
> > > > We'd also have to make sure nothing else makes any assumptions about VMA flags
> > > > implying differences in VMAs in this one instance (though we do already do this
> > > > for VM_SOFTDIRTY).
> > > >
> > > > I saw this as possibly something like VM_MAYBE_GUARD_REGIONS or something.
> > >
> > > Yes.
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
> >
> > Best, Lorenzo