diff mbox series

[v22,1/4] mm: add MAP_DROPPABLE for designating always lazily freeable mappings

Message ID 20240709130513.98102-2-Jason@zx2c4.com (mailing list archive)
State Not Applicable
Delegated to: Herbert Xu
Headers show
Series implement getrandom() in vDSO | expand

Commit Message

Jason A. Donenfeld July 9, 2024, 1:05 p.m. UTC
The vDSO getrandom() implementation works with a buffer allocated with a
new system call that has certain requirements:

- It shouldn't be written to core dumps.
  * Easy: VM_DONTDUMP.
- It should be zeroed on fork.
  * Easy: VM_WIPEONFORK.

- It shouldn't be written to swap.
  * Uh-oh: mlock is rlimited.
  * Uh-oh: mlock isn't inherited by forks.

It turns out that the vDSO getrandom() function has three really nice
characteristics that we can exploit to solve this problem:

1) Due to being wiped during fork(), the vDSO code is already robust to
   having the contents of the pages it reads zeroed out midway through
   the function's execution.

2) In the absolute worst case of whatever contingency we're coding for,
   we have the option to fallback to the getrandom() syscall, and
   everything is fine.

3) The buffers the function uses are only ever useful for a maximum of
   60 seconds -- a sort of cache, rather than a long term allocation.

These characteristics mean that we can introduce VM_DROPPABLE, which
has the following semantics:

a) It never is written out to swap.
b) Under memory pressure, mm can just drop the pages (so that they're
   zero when read back again).
c) It is inherited by fork.
d) It doesn't count against the mlock budget, since nothing is locked.

This is fairly simple to implement, with the one snag that we have to
use 64-bit VM_* flags, but this shouldn't be a problem, since the only
consumers will probably be 64-bit anyway.

This way, allocations used by vDSO getrandom() can use:

    VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE

And there will be no problem with using memory when not in use, not
wiping on fork(), coredumps, or writing out to swap.

In order to let vDSO getrandom() use this, expose these via mmap(2) as
MAP_DROPPABLE.

Finally, the provided self test ensures that this is working as desired.

Cc: linux-mm@kvack.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 fs/proc/task_mmu.c                     |  1 +
 include/linux/mm.h                     |  7 ++++
 include/trace/events/mmflags.h         |  7 ++++
 include/uapi/linux/mman.h              |  1 +
 mm/madvise.c                           |  5 ++-
 mm/mlock.c                             |  2 +-
 mm/mmap.c                              | 30 +++++++++++++++
 mm/rmap.c                              | 22 +++++++++--
 tools/include/uapi/linux/mman.h        |  1 +
 tools/testing/selftests/mm/.gitignore  |  1 +
 tools/testing/selftests/mm/Makefile    |  1 +
 tools/testing/selftests/mm/droppable.c | 53 ++++++++++++++++++++++++++
 12 files changed, 126 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/mm/droppable.c

Comments

David Hildenbrand July 10, 2024, 3:27 a.m. UTC | #1
On 09.07.24 15:05, Jason A. Donenfeld wrote:
> The vDSO getrandom() implementation works with a buffer allocated with a
> new system call that has certain requirements:
> 
> - It shouldn't be written to core dumps.
>    * Easy: VM_DONTDUMP.
> - It should be zeroed on fork.
>    * Easy: VM_WIPEONFORK.
> 
> - It shouldn't be written to swap.
>    * Uh-oh: mlock is rlimited.
>    * Uh-oh: mlock isn't inherited by forks.
> 
> It turns out that the vDSO getrandom() function has three really nice
> characteristics that we can exploit to solve this problem:
> 
> 1) Due to being wiped during fork(), the vDSO code is already robust to
>     having the contents of the pages it reads zeroed out midway through
>     the function's execution.
> 
> 2) In the absolute worst case of whatever contingency we're coding for,
>     we have the option to fallback to the getrandom() syscall, and
>     everything is fine.
> 
> 3) The buffers the function uses are only ever useful for a maximum of
>     60 seconds -- a sort of cache, rather than a long term allocation.
> 
> These characteristics mean that we can introduce VM_DROPPABLE, which
> has the following semantics:
> 
> a) It never is written out to swap.
> b) Under memory pressure, mm can just drop the pages (so that they're
>     zero when read back again).
> c) It is inherited by fork.
> d) It doesn't count against the mlock budget, since nothing is locked.
> 
> This is fairly simple to implement, with the one snag that we have to
> use 64-bit VM_* flags, but this shouldn't be a problem, since the only
> consumers will probably be 64-bit anyway.
> 
> This way, allocations used by vDSO getrandom() can use:
> 
>      VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
> 
> And there will be no problem with using memory when not in use, not
> wiping on fork(), coredumps, or writing out to swap.
> 
> In order to let vDSO getrandom() use this, expose these via mmap(2) as
> MAP_DROPPABLE.
> 
> Finally, the provided self test ensures that this is working as desired.

Acked-by: David Hildenbrand <david@redhat.com>


I'll try to think of some corner cases we might be missing.

As raised, I think we could do better at naming, such as "MAP_FREEABLE" 
to match MADV_FREE, MAP_VOLATILE, ... but if nobody else care, I shall 
not care :)
David Hildenbrand July 10, 2024, 4:05 a.m. UTC | #2
On 10.07.24 05:27, David Hildenbrand wrote:
> On 09.07.24 15:05, Jason A. Donenfeld wrote:
>> The vDSO getrandom() implementation works with a buffer allocated with a
>> new system call that has certain requirements:
>>
>> - It shouldn't be written to core dumps.
>>     * Easy: VM_DONTDUMP.
>> - It should be zeroed on fork.
>>     * Easy: VM_WIPEONFORK.
>>
>> - It shouldn't be written to swap.
>>     * Uh-oh: mlock is rlimited.
>>     * Uh-oh: mlock isn't inherited by forks.
>>
>> It turns out that the vDSO getrandom() function has three really nice
>> characteristics that we can exploit to solve this problem:
>>
>> 1) Due to being wiped during fork(), the vDSO code is already robust to
>>      having the contents of the pages it reads zeroed out midway through
>>      the function's execution.
>>
>> 2) In the absolute worst case of whatever contingency we're coding for,
>>      we have the option to fallback to the getrandom() syscall, and
>>      everything is fine.
>>
>> 3) The buffers the function uses are only ever useful for a maximum of
>>      60 seconds -- a sort of cache, rather than a long term allocation.
>>
>> These characteristics mean that we can introduce VM_DROPPABLE, which
>> has the following semantics:
>>
>> a) It never is written out to swap.
>> b) Under memory pressure, mm can just drop the pages (so that they're
>>      zero when read back again).
>> c) It is inherited by fork.
>> d) It doesn't count against the mlock budget, since nothing is locked.
>>
>> This is fairly simple to implement, with the one snag that we have to
>> use 64-bit VM_* flags, but this shouldn't be a problem, since the only
>> consumers will probably be 64-bit anyway.
>>
>> This way, allocations used by vDSO getrandom() can use:
>>
>>       VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
>>
>> And there will be no problem with using memory when not in use, not
>> wiping on fork(), coredumps, or writing out to swap.
>>
>> In order to let vDSO getrandom() use this, expose these via mmap(2) as
>> MAP_DROPPABLE.
>>
>> Finally, the provided self test ensures that this is working as desired.
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 
> 
> I'll try to think of some corner cases we might be missing.

BTW, do we have to handle the folio_set_swapbacked() in sort_folio() as well?


	/* dirty lazyfree */
	if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
		success = lru_gen_del_folio(lruvec, folio, true);
		VM_WARN_ON_ONCE_FOLIO(!success, folio);
		folio_set_swapbacked(folio);
		lruvec_add_folio_tail(lruvec, folio);
		return true;
	}

Maybe more difficult because we don't have a VMA here ... hmm

IIUC, we have to make sure that no folio_set_swapbacked() would ever get
performed on these folios, correct?
Jason A. Donenfeld July 11, 2024, 12:44 a.m. UTC | #3
Hi David,

On Wed, Jul 10, 2024 at 06:05:34AM +0200, David Hildenbrand wrote:
> BTW, do we have to handle the folio_set_swapbacked() in sort_folio() as well?
> 
> 
> 	/* dirty lazyfree */
> 	if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
> 		success = lru_gen_del_folio(lruvec, folio, true);
> 		VM_WARN_ON_ONCE_FOLIO(!success, folio);
> 		folio_set_swapbacked(folio);
> 		lruvec_add_folio_tail(lruvec, folio);
> 		return true;
> 	}
> 
> Maybe more difficult because we don't have a VMA here ... hmm
> 
> IIUC, we have to make sure that no folio_set_swapbacked() would ever get
> performed on these folios, correct?

Hmmm, I'm trying to figure out what to do here, and if we have to do
something. All three conditions in that if statement will be true for a
folio in a droppable mapping. That's supposed to match MADV_FREE
mappings.

What is the context of this, though? It's scanning pages for good ones
to evict into swap, right? So if it encounters one that's an MADV_FREE
page, it actually just wants to delete it, rather than sending it to
swap. So it looks like it does just that, and then sets the swapbacked
bit back to true, in case the folio is used for something differnet
later?

If that's correct, then I don't think we need to do anything for this
one.

If that's not correct, then we'll need to propagate the droppableness
to the folio level. But hopefully we don't need to do that.

What's your analysis of this like?

Jason
Jason A. Donenfeld July 11, 2024, 4:32 a.m. UTC | #4
On Thu, Jul 11, 2024 at 02:44:29AM +0200, Jason A. Donenfeld wrote:
> Hi David,
> 
> On Wed, Jul 10, 2024 at 06:05:34AM +0200, David Hildenbrand wrote:
> > BTW, do we have to handle the folio_set_swapbacked() in sort_folio() as well?
> > 
> > 
> > 	/* dirty lazyfree */
> > 	if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
> > 		success = lru_gen_del_folio(lruvec, folio, true);
> > 		VM_WARN_ON_ONCE_FOLIO(!success, folio);
> > 		folio_set_swapbacked(folio);
> > 		lruvec_add_folio_tail(lruvec, folio);
> > 		return true;
> > 	}
> > 
> > Maybe more difficult because we don't have a VMA here ... hmm
> > 
> > IIUC, we have to make sure that no folio_set_swapbacked() would ever get
> > performed on these folios, correct?
> 
> Hmmm, I'm trying to figure out what to do here, and if we have to do
> something. All three conditions in that if statement will be true for a
> folio in a droppable mapping. That's supposed to match MADV_FREE
> mappings.
> 
> What is the context of this, though? It's scanning pages for good ones
> to evict into swap, right? So if it encounters one that's an MADV_FREE
> page, it actually just wants to delete it, rather than sending it to
> swap. So it looks like it does just that, and then sets the swapbacked
> bit back to true, in case the folio is used for something differnet
> later?
> 
> If that's correct, then I don't think we need to do anything for this
> one.
> 
> If that's not correct, then we'll need to propagate the droppableness
> to the folio level. But hopefully we don't need to do that.

Looks like that's not correct. This is for pages that have been dirtied
since calling MADV_FREE. So, hm.
David Hildenbrand July 11, 2024, 4:46 a.m. UTC | #5
On 11.07.24 06:32, Jason A. Donenfeld wrote:
> On Thu, Jul 11, 2024 at 02:44:29AM +0200, Jason A. Donenfeld wrote:
>> Hi David,
>>
>> On Wed, Jul 10, 2024 at 06:05:34AM +0200, David Hildenbrand wrote:
>>> BTW, do we have to handle the folio_set_swapbacked() in sort_folio() as well?
>>>
>>>
>>> 	/* dirty lazyfree */
>>> 	if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
>>> 		success = lru_gen_del_folio(lruvec, folio, true);
>>> 		VM_WARN_ON_ONCE_FOLIO(!success, folio);
>>> 		folio_set_swapbacked(folio);
>>> 		lruvec_add_folio_tail(lruvec, folio);
>>> 		return true;
>>> 	}
>>>
>>> Maybe more difficult because we don't have a VMA here ... hmm
>>>
>>> IIUC, we have to make sure that no folio_set_swapbacked() would ever get
>>> performed on these folios, correct?
>>
>> Hmmm, I'm trying to figure out what to do here, and if we have to do
>> something. All three conditions in that if statement will be true for a
>> folio in a droppable mapping. That's supposed to match MADV_FREE
>> mappings.
>>
>> What is the context of this, though? It's scanning pages for good ones
>> to evict into swap, right? So if it encounters one that's an MADV_FREE
>> page, it actually just wants to delete it, rather than sending it to
>> swap. So it looks like it does just that, and then sets the swapbacked
>> bit back to true, in case the folio is used for something differnet
>> later?
>>
>> If that's correct, then I don't think we need to do anything for this
>> one.
>>
>> If that's not correct, then we'll need to propagate the droppableness
>> to the folio level. But hopefully we don't need to do that.
> 
> Looks like that's not correct. This is for pages that have been dirtied
> since calling MADV_FREE. So, hm.
> 

Maybe we can find ways of simply never marking these pages dirty, so we 
don't have to special-case that code where we don't really have a VMA at 
hand?
Linus Torvalds July 11, 2024, 5:07 a.m. UTC | #6
On Wed, 10 Jul 2024 at 21:46, David Hildenbrand <david@redhat.com> wrote:
>
> Maybe we can find ways of simply never marking these pages dirty, so we
> don't have to special-case that code where we don't really have a VMA at
> hand?

That's one option. Jason's patch basically goes "ignore folio dirty
bit for these pages".

Your suggestion basically says "don't turn folios dirty in the first place".

It's mainly the pte_dirty games in mm/vmscan.c that does it
(walk_pte_range), but also the tear-down in mm/memory.c
(zap_present_folio_ptes). Possibly others that I didn't think of.

Both do have access to the vma, although in the case of
walk_pte_range() we don't actually pass it down because we haven't
needed it).

There's also page_vma_mkclean_one(), try_to_unmap_one() and
try_to_migrate_one().  And possibly many others I haven't even thought
about.

So quite a few places that do that "transfer dirty bit from pte to folio".

The other approach might be to just let all the dirty handling happen
- make droppable pages have a "page->mapping" (and not be anonymous),
and have the mapping->a_ops->writepage() just always return success
immediately.

That might actually be a conceptually simpler model. MAP_DROPPABLE
becomes a shared mapping that just has a really cheap writeback that
throws the data away. No need to worry about swap cache or anything
like that, because that's just for anonymous pages.

I say "conceptually simpler", because right now the patch does depend
on just using the regular anon page faulting etc code.

                 Linus
Jason A. Donenfeld July 11, 2024, 5:09 p.m. UTC | #7
Hi Linus, David,

On Wed, Jul 10, 2024 at 10:07:03PM -0700, Linus Torvalds wrote:
> The other approach might be to just let all the dirty handling happen
> - make droppable pages have a "page->mapping" (and not be anonymous),
> and have the mapping->a_ops->writepage() just always return success
> immediately.

When I was working on this patchset this year with the syscall, this is
similar somewhat to the initial approach I was taking with setting up a
special mapping. It turned into kind of a mess and I couldn't get it
working. There's a lot of functionality built around anonymous pages
that would need to be duplicated (I think?). I'll revisit it if need be,
but let's see if I can make avoiding the dirty bit propagation work.

> It's mainly the pte_dirty games in mm/vmscan.c that does it
> (walk_pte_range), but also the tear-down in mm/memory.c
> (zap_present_folio_ptes). Possibly others that I didn't think of.
> 
> Both do have access to the vma, although in the case of
> walk_pte_range() we don't actually pass it down because we haven't
> needed it).

Actually, it's there hanging out in args->vma, and the function makes
use of that member already. So not so bad.

> 
> There's also page_vma_mkclean_one(), try_to_unmap_one() and
> try_to_migrate_one().  And possibly many others I haven't even thought
> about.
> 
> So quite a few places that do that "transfer dirty bit from pte to folio".

Alright, an hour later of fiddling, and it doesn't actually work (yet?)
-- the selftest fails. A diff follows below.

So, hmm... The swapbacked thing really seemed so simple... I wonder if
there's a way of recovering that.

Jason


diff --git a/mm/gup.c b/mm/gup.c
index ca0f5cedce9b..38745cc4fa06 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -990,7 +990,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
-		    !pte_dirty(pte) && !PageDirty(page))
+		    !pte_dirty(pte) && !PageDirty(page) &&
+		    !(vma->vm_flags & VM_DROPPABLE))
 			set_page_dirty(page);
 		/*
 		 * pte_mkyoung() would be more correct here, but atomic care
diff --git a/mm/ksm.c b/mm/ksm.c
index 34c4820e0d3d..2401fc4203ba 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1339,7 +1339,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct folio *folio,
 			goto out_unlock;
 		}

-		if (pte_dirty(entry))
+		if (pte_dirty(entry) && !(vma->vm_flags & VM_DROPPABLE))
 			folio_mark_dirty(folio);
 		entry = pte_mkclean(entry);

@@ -1518,7 +1518,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 			 * Page reclaim just frees a clean page with no dirty
 			 * ptes: make sure that the ksm page would be swapped.
 			 */
-			if (!PageDirty(page))
+			if (!PageDirty(page) && !(vma->vm_flags & VM_DROPPABLE))
 				SetPageDirty(page);
 			err = 0;
 		} else if (pages_identical(page, kpage))
diff --git a/mm/memory.c b/mm/memory.c
index d10e616d7389..6a02d16309be 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1479,7 +1479,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,

 	if (!folio_test_anon(folio)) {
 		ptent = get_and_clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
-		if (pte_dirty(ptent)) {
+		if (pte_dirty(ptent) && !(vma->vm_flags & VM_DROPPABLE)) {
 			folio_mark_dirty(folio);
 			if (tlb_delay_rmap(tlb)) {
 				delay_rmap = true;
@@ -6140,7 +6140,8 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
 			if (write) {
 				copy_to_user_page(vma, page, addr,
 						  maddr + offset, buf, bytes);
-				set_page_dirty_lock(page);
+				if (!(vma->vm_flags & VM_DROPPABLE))
+					set_page_dirty_lock(page);
 			} else {
 				copy_from_user_page(vma, page, addr,
 						    buf, maddr + offset, bytes);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index aecc71972a87..72d3f8eaae6e 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -216,7 +216,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			migrate->cpages++;

 			/* Set the dirty flag on the folio now the pte is gone. */
-			if (pte_dirty(pte))
+			if (pte_dirty(pte) && !(vma->vm_flags & VM_DROPPABLE))
 				folio_mark_dirty(folio);

 			/* Setup special migration page table entry */
diff --git a/mm/rmap.c b/mm/rmap.c
index 1f9b5a9cb121..1688d06bb617 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1397,12 +1397,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
 	VM_BUG_ON_VMA(address < vma->vm_start ||
 			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
-	/*
-	 * VM_DROPPABLE mappings don't swap; instead they're just dropped when
-	 * under memory pressure.
-	 */
-	if (!(vma->vm_flags & VM_DROPPABLE))
-		__folio_set_swapbacked(folio);
+	__folio_set_swapbacked(folio);
 	__folio_set_anon(folio, vma, address, true);

 	if (likely(!folio_test_large(folio))) {
@@ -1777,7 +1772,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);

 		/* Set the dirty flag on the folio now the pte is gone. */
-		if (pte_dirty(pteval))
+		if (pte_dirty(pteval) && !(vma->vm_flags & VM_DROPPABLE))
 			folio_mark_dirty(folio);

 		/* Update high watermark before we lower rss */
@@ -1822,7 +1817,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			}

 			/* MADV_FREE page check */
-			if (!folio_test_swapbacked(folio)) {
+			if (!folio_test_swapbacked(folio) || (vma->vm_flags & VM_DROPPABLE)) {
 				int ref_count, map_count;

 				/*
@@ -1846,13 +1841,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 * plus the rmap(s) (dropped by discard:).
 				 */
 				if (ref_count == 1 + map_count &&
-				    (!folio_test_dirty(folio) ||
-				     /*
-				      * Unlike MADV_FREE mappings, VM_DROPPABLE
-				      * ones can be dropped even if they've
-				      * been dirtied.
-				      */
-				     (vma->vm_flags & VM_DROPPABLE))) {
+				    !folio_test_dirty(folio)) {
 					dec_mm_counter(mm, MM_ANONPAGES);
 					goto discard;
 				}
@@ -1862,12 +1851,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 * discarded. Remap the page to page table.
 				 */
 				set_pte_at(mm, address, pvmw.pte, pteval);
-				/*
-				 * Unlike MADV_FREE mappings, VM_DROPPABLE ones
-				 * never get swap backed on failure to drop.
-				 */
-				if (!(vma->vm_flags & VM_DROPPABLE))
-					folio_set_swapbacked(folio);
+				folio_set_swapbacked(folio);
 				ret = false;
 				page_vma_mapped_walk_done(&pvmw);
 				break;
@@ -2151,7 +2135,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		}

 		/* Set the dirty flag on the folio now the pte is gone. */
-		if (pte_dirty(pteval))
+		if (pte_dirty(pteval) && !(vma->vm_flags & VM_DROPPABLE))
 			folio_mark_dirty(folio);

 		/* Update high watermark before we lower rss */
@@ -2397,7 +2381,7 @@ static bool page_make_device_exclusive_one(struct folio *folio,
 		pteval = ptep_clear_flush(vma, address, pvmw.pte);

 		/* Set the dirty flag on the folio now the pte is gone. */
-		if (pte_dirty(pteval))
+		if (pte_dirty(pteval) && !(vma->vm_flags & VM_DROPPABLE))
 			folio_mark_dirty(folio);

 		/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e34de9cd0d4..cf5b26bd067a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3396,6 +3396,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		walk->mm_stats[MM_LEAF_YOUNG]++;

 		if (pte_dirty(ptent) && !folio_test_dirty(folio) &&
+		    !(args->vma->vm_flags & VM_DROPPABLE) &&
 		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
 		      !folio_test_swapcache(folio)))
 			folio_mark_dirty(folio);
@@ -3476,6 +3477,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 		walk->mm_stats[MM_LEAF_YOUNG]++;

 		if (pmd_dirty(pmd[i]) && !folio_test_dirty(folio) &&
+		    !(vma->vm_flags && VM_DROPPABLE) &&
 		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
 		      !folio_test_swapcache(folio)))
 			folio_mark_dirty(folio);
@@ -4076,6 +4078,7 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		young++;

 		if (pte_dirty(ptent) && !folio_test_dirty(folio) &&
+		    !(vma->vm_flags & VM_DROPPABLE) &&
 		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
 		      !folio_test_swapcache(folio)))
 			folio_mark_dirty(folio);
Jason A. Donenfeld July 11, 2024, 5:17 p.m. UTC | #8
On Thu, Jul 11, 2024 at 07:09:36PM +0200, Jason A. Donenfeld wrote:
> So, hmm... The swapbacked thing really seemed so simple... I wonder if
> there's a way of recovering that.

Not wanting to introduce a new bitflag, I went looking and noticed this:

/*
 * Private page markings that may be used by the filesystem that owns the page
 * for its own purposes.
 * - PG_private and PG_private_2 cause release_folio() and co to be invoked
 */
PAGEFLAG(Private, private, PF_ANY)
PAGEFLAG(Private2, private_2, PF_ANY) TESTSCFLAG(Private2, private_2, PF_ANY)
PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
        TESTCLEARFLAG(OwnerPriv1, owner_priv_1, PF_ANY)

The below +4/-1 diff is pretty hacky and might be illegal in the state
of California, but I think it does work. The idea is that if that bit is
normally only used for filesystems, then in the anonymous case, it's
free to be used for this.

Any opinions about this, or a suggestion on how to do that in a less
ugly way?

Jason


diff --git a/mm/rmap.c b/mm/rmap.c
index 1f9b5a9cb121..090554277e4a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1403,6 +1403,8 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	 */
 	if (!(vma->vm_flags & VM_DROPPABLE))
 		__folio_set_swapbacked(folio);
+	else
+		folio_set_owner_priv_1(folio);
 	__folio_set_anon(folio, vma, address, true);

 	if (likely(!folio_test_large(folio))) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e34de9cd0d4..398b46027e8f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4266,7 +4266,8 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	}

 	/* dirty lazyfree */
-	if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
+	if (type == LRU_GEN_FILE && folio_test_anon(folio) &&
+	    folio_test_dirty(folio) && !folio_test_owner_priv_1(folio)) {
 		success = lru_gen_del_folio(lruvec, folio, true);
 		VM_WARN_ON_ONCE_FOLIO(!success, folio);
 		folio_set_swapbacked(folio);
David Hildenbrand July 11, 2024, 5:24 p.m. UTC | #9
On 11.07.24 19:17, Jason A. Donenfeld wrote:
> On Thu, Jul 11, 2024 at 07:09:36PM +0200, Jason A. Donenfeld wrote:
>> So, hmm... The swapbacked thing really seemed so simple... I wonder if
>> there's a way of recovering that.
> 
> Not wanting to introduce a new bitflag, I went looking and noticed this:
> 
> /*
>   * Private page markings that may be used by the filesystem that owns the page
>   * for its own purposes.
>   * - PG_private and PG_private_2 cause release_folio() and co to be invoked
>   */
> PAGEFLAG(Private, private, PF_ANY)
> PAGEFLAG(Private2, private_2, PF_ANY) TESTSCFLAG(Private2, private_2, PF_ANY)
> PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
>          TESTCLEARFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
> 
> The below +4/-1 diff is pretty hacky and might be illegal in the state
> of California, but I think it does work. The idea is that if that bit is
> normally only used for filesystems, then in the anonymous case, it's
> free to be used for this.
> 
> Any opinions about this, or a suggestion on how to do that in a less
> ugly way?
> 
> Jason
> 
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1f9b5a9cb121..090554277e4a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1403,6 +1403,8 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>   	 */
>   	if (!(vma->vm_flags & VM_DROPPABLE))
>   		__folio_set_swapbacked(folio);
> +	else
> +		folio_set_owner_priv_1(folio);


PG_owner_priv_1 maps to PG_swapcache? :)
David Hildenbrand July 11, 2024, 5:27 p.m. UTC | #10
On 11.07.24 19:24, David Hildenbrand wrote:
> On 11.07.24 19:17, Jason A. Donenfeld wrote:
>> On Thu, Jul 11, 2024 at 07:09:36PM +0200, Jason A. Donenfeld wrote:
>>> So, hmm... The swapbacked thing really seemed so simple... I wonder if
>>> there's a way of recovering that.
>>
>> Not wanting to introduce a new bitflag, I went looking and noticed this:
>>
>> /*
>>    * Private page markings that may be used by the filesystem that owns the page
>>    * for its own purposes.
>>    * - PG_private and PG_private_2 cause release_folio() and co to be invoked
>>    */
>> PAGEFLAG(Private, private, PF_ANY)
>> PAGEFLAG(Private2, private_2, PF_ANY) TESTSCFLAG(Private2, private_2, PF_ANY)
>> PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
>>           TESTCLEARFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
>>
>> The below +4/-1 diff is pretty hacky and might be illegal in the state
>> of California, but I think it does work. The idea is that if that bit is
>> normally only used for filesystems, then in the anonymous case, it's
>> free to be used for this.
>>
>> Any opinions about this, or a suggestion on how to do that in a less
>> ugly way?
>>
>> Jason
>>
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1f9b5a9cb121..090554277e4a 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1403,6 +1403,8 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>    	 */
>>    	if (!(vma->vm_flags & VM_DROPPABLE))
>>    		__folio_set_swapbacked(folio);
>> +	else
>> +		folio_set_owner_priv_1(folio);
> 
> 
> PG_owner_priv_1 maps to PG_swapcache? :)

Maybe the combination !swapbacked && swapcache could be used to indicate 
such folios. (we will never set swapbacked)

But likely we have to be a bit careful here. We don't want 
folio_test_swapcache() to return for folios that ... are not in the 
swapcache.
Jason A. Donenfeld July 11, 2024, 5:49 p.m. UTC | #11
On Thu, Jul 11, 2024 at 07:24:33PM +0200, David Hildenbrand wrote:
> On 11.07.24 19:17, Jason A. Donenfeld wrote:
> > On Thu, Jul 11, 2024 at 07:09:36PM +0200, Jason A. Donenfeld wrote:
> >> So, hmm... The swapbacked thing really seemed so simple... I wonder if
> >> there's a way of recovering that.
> > 
> > Not wanting to introduce a new bitflag, I went looking and noticed this:
> > 
> > /*
> >   * Private page markings that may be used by the filesystem that owns the page
> >   * for its own purposes.
> >   * - PG_private and PG_private_2 cause release_folio() and co to be invoked
> >   */
> > PAGEFLAG(Private, private, PF_ANY)
> > PAGEFLAG(Private2, private_2, PF_ANY) TESTSCFLAG(Private2, private_2, PF_ANY)
> > PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
> >          TESTCLEARFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
> > 
> > The below +4/-1 diff is pretty hacky and might be illegal in the state
> > of California, but I think it does work. The idea is that if that bit is
> > normally only used for filesystems, then in the anonymous case, it's
> > free to be used for this.
> > 
> > Any opinions about this, or a suggestion on how to do that in a less
> > ugly way?
> > 
> > Jason
> > 
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 1f9b5a9cb121..090554277e4a 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1403,6 +1403,8 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
> >   	 */
> >   	if (!(vma->vm_flags & VM_DROPPABLE))
> >   		__folio_set_swapbacked(folio);
> > +	else
> > +		folio_set_owner_priv_1(folio);
> 
> 
> PG_owner_priv_1 maps to PG_swapcache? :)

Oh, drat, it looks like this overloading is nothing new then.
Jason A. Donenfeld July 11, 2024, 5:54 p.m. UTC | #12
On Thu, Jul 11, 2024 at 07:27:27PM +0200, David Hildenbrand wrote:
> > PG_owner_priv_1 maps to PG_swapcache? :)
> 
> Maybe the combination !swapbacked && swapcache could be used to indicate 
> such folios. (we will never set swapbacked)
> 
> But likely we have to be a bit careful here. We don't want 
> folio_test_swapcache() to return for folios that ... are not in the 
> swapcache.

I was thinking that too, but I'm afraid it's going to be another
whack-a-mole nightmare. Even for things like task_mmu in procfs that
show stats, that's going to be wonky.

Any other flags we can overload that aren't going to be already used in
our case?

Jason
Jason A. Donenfeld July 11, 2024, 5:56 p.m. UTC | #13
On Thu, Jul 11, 2024 at 07:54:34PM +0200, Jason A. Donenfeld wrote:
> On Thu, Jul 11, 2024 at 07:27:27PM +0200, David Hildenbrand wrote:
> > > PG_owner_priv_1 maps to PG_swapcache? :)
> > 
> > Maybe the combination !swapbacked && swapcache could be used to indicate 
> > such folios. (we will never set swapbacked)
> > 
> > But likely we have to be a bit careful here. We don't want 
> > folio_test_swapcache() to return for folios that ... are not in the 
> > swapcache.
> 
> I was thinking that too, but I'm afraid it's going to be another
> whack-a-mole nightmare. Even for things like task_mmu in procfs that
> show stats, that's going to be wonky.
> 
> Any other flags we can overload that aren't going to be already used in
> our case?

PG_error / folio_set_error seems unused in the non-IO case.
Linus Torvalds July 11, 2024, 5:57 p.m. UTC | #14
On Thu, 11 Jul 2024 at 10:09, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> When I was working on this patchset this year with the syscall, this is
> similar somewhat to the initial approach I was taking with setting up a
> special mapping. It turned into kind of a mess and I couldn't get it
> working. There's a lot of functionality built around anonymous pages
> that would need to be duplicated (I think?).

Yeah, I was kind of assuming that. You'd need to handle VM_DROPPABLE
in the fault path specially, the way we currently split up based on
vma_is_anonymous(), eg

        if (vma_is_anonymous(vmf->vma))
                return do_anonymous_page(vmf);
        else
                return do_fault(vmf);

in do_pte_missing() etc.

I don't actually think it would be too hard, but it's a more
"conceptual" change, and it's probably not worth it.

> Alright, an hour later of fiddling, and it doesn't actually work (yet?)
> -- the selftest fails. A diff follows below.

May I suggest a slightly different approach: do what we did for "pte_mkwrite()".

It needed the vma too, for not too dissimilar reasons: special dirty
bit handling for the shadow stack. See

  bb3aadf7d446 ("x86/mm: Start actually marking _PAGE_SAVED_DIRTY")
  b497e52ddb2a ("x86/mm: Teach pte_mkwrite() about stack memory")

and now we have "pte_mkwrite_novma()" with the old semantics for the
legacy cases that didn't get converted - whether it's because the
architecture doesn't have the issue, or because it's a kernel pte.

And the conversion was actually quite pain-free, because we have

  #ifndef pte_mkwrite
  static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
  {
        return pte_mkwrite_novma(pte);
  }
  #endif

so all any architecture that didn't want this needed to do was to
rename their pte_mkwrite() to pte_mkwrite_novma() and they were done.
In fact, that was done first as basically semantically no-op patches:

   2f0584f3f4bd ("mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()")
   6ecc21bb432d ("mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()")
   161e393c0f63 ("mm: Make pte_mkwrite() take a VMA")

which made this all very pain-free (and was largely a sed script, I think).

> -                   !pte_dirty(pte) && !PageDirty(page))
> +                   !pte_dirty(pte) && !PageDirty(page) &&
> +                   !(vma->vm_flags & VM_DROPPABLE))

So instead of this kind of thing, we'd have

> -                   !pte_dirty(pte) && !PageDirty(page))
> +                   !pte_dirty(pte, vma) && !PageDirty(page) &&

and the advantage here is that you can't miss anybody by mistake. The
compiler will be very unhappy if you don't pass in the vma, and then
any places that would be converted to "pte_dirty_novma()"

We don't actually have all that many users of pte_dirty(), so it
doesn't look too nasty. And if we make the pte_dirty() semantics
depend on the vma, I really think we should do it the same way we did
pte_mkwrite().

Long-term, maybe we should just aim to always pass in the vma to the
pte_xyz() functions, but...

          Linus
Jason A. Donenfeld July 11, 2024, 6:08 p.m. UTC | #15
On Thu, Jul 11, 2024 at 07:56:39PM +0200, Jason A. Donenfeld wrote:
> On Thu, Jul 11, 2024 at 07:54:34PM +0200, Jason A. Donenfeld wrote:
> > On Thu, Jul 11, 2024 at 07:27:27PM +0200, David Hildenbrand wrote:
> > > > PG_owner_priv_1 maps to PG_swapcache? :)
> > > 
> > > Maybe the combination !swapbacked && swapcache could be used to indicate 
> > > such folios. (we will never set swapbacked)
> > > 
> > > But likely we have to be a bit careful here. We don't want 
> > > folio_test_swapcache() to return for folios that ... are not in the 
> > > swapcache.
> > 
> > I was thinking that too, but I'm afraid it's going to be another
> > whack-a-mole nightmare. Even for things like task_mmu in procfs that
> > show stats, that's going to be wonky.
> > 
> > Any other flags we can overload that aren't going to be already used in
> > our case?
> 
> PG_error / folio_set_error seems unused in the non-IO case.

And PG_large_rmappable seems to only be used for hugetlb branches.

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b9e914e1face..7fdc03197438 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -190,6 +190,7 @@ enum pageflags {
 	/* At least one page in this folio has the hwpoison flag set */
 	PG_has_hwpoisoned = PG_error,
 	PG_large_rmappable = PG_workingset, /* anon or file-backed */
+	PG_droppable = PG_error, /* anon droppable, not hugetlb */
 };

 #define PAGEFLAGS_MASK		((1UL << NR_PAGEFLAGS) - 1)
@@ -640,6 +641,8 @@ FOLIO_TEST_CLEAR_FLAG_FALSE(young)
 FOLIO_FLAG_FALSE(idle)
 #endif

+FOLIO_FLAG(droppable, FOLIO_SECOND_PAGE)
+
 /*
  * PageReported() is used to track reported free pages within the Buddy
  * allocator. We can use the non-atomic version of the test and set
diff --git a/mm/rmap.c b/mm/rmap.c
index 1f9b5a9cb121..73b4052b2f82 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1403,6 +1403,8 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	 */
 	if (!(vma->vm_flags & VM_DROPPABLE))
 		__folio_set_swapbacked(folio);
+	else
+		folio_set_droppable(folio);
 	__folio_set_anon(folio, vma, address, true);

 	if (likely(!folio_test_large(folio))) {
@@ -1852,7 +1854,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				      * ones can be dropped even if they've
 				      * been dirtied.
 				      */
-				     (vma->vm_flags & VM_DROPPABLE))) {
+				      folio_test_droppable(folio))) {
 					dec_mm_counter(mm, MM_ANONPAGES);
 					goto discard;
 				}
@@ -1866,7 +1868,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 * Unlike MADV_FREE mappings, VM_DROPPABLE ones
 				 * never get swap backed on failure to drop.
 				 */
-				if (!(vma->vm_flags & VM_DROPPABLE))
+				if (!folio_test_droppable(folio))
 					folio_set_swapbacked(folio);
 				ret = false;
 				page_vma_mapped_walk_done(&pvmw);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e34de9cd0d4..41340f2a12c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4266,7 +4266,8 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	}

 	/* dirty lazyfree */
-	if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
+	if (type == LRU_GEN_FILE && folio_test_anon(folio) &&
+	    folio_test_dirty(folio) && !folio_test_droppable(folio)) {
 		success = lru_gen_del_folio(lruvec, folio, true);
 		VM_WARN_ON_ONCE_FOLIO(!success, folio);
 		folio_set_swapbacked(folio);
David Hildenbrand July 11, 2024, 6:24 p.m. UTC | #16
On 11.07.24 20:08, Jason A. Donenfeld wrote:
> On Thu, Jul 11, 2024 at 07:56:39PM +0200, Jason A. Donenfeld wrote:
>> On Thu, Jul 11, 2024 at 07:54:34PM +0200, Jason A. Donenfeld wrote:
>>> On Thu, Jul 11, 2024 at 07:27:27PM +0200, David Hildenbrand wrote:
>>>>> PG_owner_priv_1 maps to PG_swapcache? :)
>>>>
>>>> Maybe the combination !swapbacked && swapcache could be used to indicate
>>>> such folios. (we will never set swapbacked)
>>>>
>>>> But likely we have to be a bit careful here. We don't want
>>>> folio_test_swapcache() to return for folios that ... are not in the
>>>> swapcache.
>>>
>>> I was thinking that too, but I'm afraid it's going to be another
>>> whack-a-mole nightmare. Even for things like task_mmu in procfs that
>>> show stats, that's going to be wonky.
>>>
>>> Any other flags we can overload that aren't going to be already used in
>>> our case?
>>
>> PG_error / folio_set_error seems unused in the non-IO case.
> 

Note that Willy is about to remove PG_error IIRC.

> And PG_large_rmappable seems to only be used for hugetlb branches.

It should be set for THP/large folios.
Jason A. Donenfeld July 11, 2024, 6:54 p.m. UTC | #17
On Thu, Jul 11, 2024 at 08:24:07PM +0200, David Hildenbrand wrote:
> > And PG_large_rmappable seems to only be used for hugetlb branches.
> 
> It should be set for THP/large folios.

And it's tested too, apparently.

Okay, well, how disappointing is this below? Because I'm running out of
tricks for flag reuse.

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b9e914e1face..c1ea49a7f198 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -110,6 +110,7 @@ enum pageflags {
 	PG_workingset,
 	PG_error,
 	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
+	PG_owner_priv_2,
 	PG_arch_1,
 	PG_reserved,
 	PG_private,		/* If pagecache, has fs-private data */
@@ -190,6 +191,9 @@ enum pageflags {
 	/* At least one page in this folio has the hwpoison flag set */
 	PG_has_hwpoisoned = PG_error,
 	PG_large_rmappable = PG_workingset, /* anon or file-backed */
+
+	/* Zero page under memory pressure. */
+	PG_droppable = PG_owner_priv_2,
 };
 
 #define PAGEFLAGS_MASK		((1UL << NR_PAGEFLAGS) - 1)
@@ -549,6 +553,8 @@ PAGEFLAG(Private, private, PF_ANY)
 PAGEFLAG(Private2, private_2, PF_ANY) TESTSCFLAG(Private2, private_2, PF_ANY)
 PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
 	TESTCLEARFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
+PAGEFLAG(OwnerPriv2, owner_priv_2, PF_ANY)
+	TESTCLEARFLAG(OwnerPriv2, owner_priv_2, PF_ANY)
 
 /*
  * Only test-and-set exist for PG_writeback.  The unconditional operators are
@@ -640,6 +646,8 @@ FOLIO_TEST_CLEAR_FLAG_FALSE(young)
 FOLIO_FLAG_FALSE(idle)
 #endif
 
+FOLIO_FLAG(droppable, FOLIO_SECOND_PAGE)
+
 /*
  * PageReported() is used to track reported free pages within the Buddy
  * allocator. We can use the non-atomic version of the test and set
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index b63d211bd141..986551588805 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -108,6 +108,7 @@
 	DEF_PAGEFLAG_NAME(active),					\
 	DEF_PAGEFLAG_NAME(workingset),					\
 	DEF_PAGEFLAG_NAME(owner_priv_1),				\
+	DEF_PAGEFLAG_NAME(owner_priv_2),				\
 	DEF_PAGEFLAG_NAME(arch_1),					\
 	DEF_PAGEFLAG_NAME(reserved),					\
 	DEF_PAGEFLAG_NAME(private),					\
diff --git a/mm/rmap.c b/mm/rmap.c
index 1f9b5a9cb121..73b4052b2f82 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1403,6 +1403,8 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	 */
 	if (!(vma->vm_flags & VM_DROPPABLE))
 		__folio_set_swapbacked(folio);
+	else
+		folio_set_droppable(folio);
 	__folio_set_anon(folio, vma, address, true);
 
 	if (likely(!folio_test_large(folio))) {
@@ -1852,7 +1854,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				      * ones can be dropped even if they've
 				      * been dirtied.
 				      */
-				     (vma->vm_flags & VM_DROPPABLE))) {
+				      folio_test_droppable(folio))) {
 					dec_mm_counter(mm, MM_ANONPAGES);
 					goto discard;
 				}
@@ -1866,7 +1868,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 * Unlike MADV_FREE mappings, VM_DROPPABLE ones
 				 * never get swap backed on failure to drop.
 				 */
-				if (!(vma->vm_flags & VM_DROPPABLE))
+				if (!folio_test_droppable(folio))
 					folio_set_swapbacked(folio);
 				ret = false;
 				page_vma_mapped_walk_done(&pvmw);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e34de9cd0d4..41340f2a12c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4266,7 +4266,8 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	}
 
 	/* dirty lazyfree */
-	if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
+	if (type == LRU_GEN_FILE && folio_test_anon(folio) &&
+	    folio_test_dirty(folio) && !folio_test_droppable(folio)) {
 		success = lru_gen_del_folio(lruvec, folio, true);
 		VM_WARN_ON_ONCE_FOLIO(!success, folio);
 		folio_set_swapbacked(folio);
David Hildenbrand July 11, 2024, 6:56 p.m. UTC | #18
On 11.07.24 20:54, Jason A. Donenfeld wrote:
> On Thu, Jul 11, 2024 at 08:24:07PM +0200, David Hildenbrand wrote:
>>> And PG_large_rmappable seems to only be used for hugetlb branches.
>>
>> It should be set for THP/large folios.
> 
> And it's tested too, apparently.
> 
> Okay, well, how disappointing is this below? Because I'm running out of
> tricks for flag reuse.
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index b9e914e1face..c1ea49a7f198 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -110,6 +110,7 @@ enum pageflags {
>   	PG_workingset,
>   	PG_error,
>   	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
> +	PG_owner_priv_2,

Oh no, no new page flags please :)

Maybe just follow what Linux suggested: pass vma to pte_dirty() and 
always return false for these special VMAs.
David Hildenbrand July 11, 2024, 7:07 p.m. UTC | #19
On 11.07.24 19:57, Linus Torvalds wrote:
> On Thu, 11 Jul 2024 at 10:09, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>>
>> When I was working on this patchset this year with the syscall, this is
>> similar somewhat to the initial approach I was taking with setting up a
>> special mapping. It turned into kind of a mess and I couldn't get it
>> working. There's a lot of functionality built around anonymous pages
>> that would need to be duplicated (I think?).
> 
> Yeah, I was kind of assuming that. You'd need to handle VM_DROPPABLE
> in the fault path specially, the way we currently split up based on
> vma_is_anonymous(), eg
> 
>          if (vma_is_anonymous(vmf->vma))
>                  return do_anonymous_page(vmf);
>          else
>                  return do_fault(vmf);
> 
> in do_pte_missing() etc.
> 
> I don't actually think it would be too hard, but it's a more
> "conceptual" change, and it's probably not worth it.
> 
>> Alright, an hour later of fiddling, and it doesn't actually work (yet?)
>> -- the selftest fails. A diff follows below.
> 
> May I suggest a slightly different approach: do what we did for "pte_mkwrite()".
> 
> It needed the vma too, for not too dissimilar reasons: special dirty
> bit handling for the shadow stack. See
> 
>    bb3aadf7d446 ("x86/mm: Start actually marking _PAGE_SAVED_DIRTY")
>    b497e52ddb2a ("x86/mm: Teach pte_mkwrite() about stack memory")
> 
> and now we have "pte_mkwrite_novma()" with the old semantics for the
> legacy cases that didn't get converted - whether it's because the
> architecture doesn't have the issue, or because it's a kernel pte.
> 
> And the conversion was actually quite pain-free, because we have
> 
>    #ifndef pte_mkwrite
>    static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
>    {
>          return pte_mkwrite_novma(pte);
>    }
>    #endif
> 
> so all any architecture that didn't want this needed to do was to
> rename their pte_mkwrite() to pte_mkwrite_novma() and they were done.
> In fact, that was done first as basically semantically no-op patches:
> 
>     2f0584f3f4bd ("mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()")
>     6ecc21bb432d ("mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()")
>     161e393c0f63 ("mm: Make pte_mkwrite() take a VMA")
> 
> which made this all very pain-free (and was largely a sed script, I think).
> 
>> -                   !pte_dirty(pte) && !PageDirty(page))
>> +                   !pte_dirty(pte) && !PageDirty(page) &&
>> +                   !(vma->vm_flags & VM_DROPPABLE))
> 
> So instead of this kind of thing, we'd have
> 
>> -                   !pte_dirty(pte) && !PageDirty(page))
>> +                   !pte_dirty(pte, vma) && !PageDirty(page) &&
> 
> and the advantage here is that you can't miss anybody by mistake. The
> compiler will be very unhappy if you don't pass in the vma, and then
> any places that would be converted to "pte_dirty_novma()"
> 
> We don't actually have all that many users of pte_dirty(), so it
> doesn't look too nasty. And if we make the pte_dirty() semantics
> depend on the vma, I really think we should do it the same way we did
> pte_mkwrite().

We also have these folio_mark_dirty() calls, for example in 
unpin_user_pages_dirty_lock(). Hm ... so preventing the folio from 
getting dirtied is likely shaky.

I guess we need a way to just reliably identify these folios :/.
Linus Torvalds July 11, 2024, 7:17 p.m. UTC | #20
On Thu, 11 Jul 2024 at 12:08, David Hildenbrand <david@redhat.com> wrote:
>
> We also have these folio_mark_dirty() calls, for example in
> unpin_user_pages_dirty_lock(). Hm ... so preventing the folio from
> getting dirtied is likely shaky.

I do wonder if we should just disallow page pinning for these pages
entirely. When the page can get replaced by zeroes at any time,
pinning it doesn't make much sense.

Except we do have that whole "fast" case that intentionally doesn't
take locks and doesn't have a vma. Darn.

             Linus
David Hildenbrand July 11, 2024, 7:18 p.m. UTC | #21
On 11.07.24 20:56, David Hildenbrand wrote:
> On 11.07.24 20:54, Jason A. Donenfeld wrote:
>> On Thu, Jul 11, 2024 at 08:24:07PM +0200, David Hildenbrand wrote:
>>>> And PG_large_rmappable seems to only be used for hugetlb branches.
>>>
>>> It should be set for THP/large folios.
>>
>> And it's tested too, apparently.
>>
>> Okay, well, how disappointing is this below? Because I'm running out of
>> tricks for flag reuse.
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index b9e914e1face..c1ea49a7f198 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -110,6 +110,7 @@ enum pageflags {
>>    	PG_workingset,
>>    	PG_error,
>>    	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
>> +	PG_owner_priv_2,
> 
> Oh no, no new page flags please :)
> 
> Maybe just follow what Linux suggested: pass vma to pte_dirty() and
> always return false for these special VMAs.

... or look into removing that one case that gives us headake.

No idea what would happen if we do the following:

CCing Yu Zhao.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0761f91b407f..d1dfbd4fd38d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4280,14 +4280,9 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
                 return true;
         }
  
-       /* dirty lazyfree */
-       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
-               success = lru_gen_del_folio(lruvec, folio, true);
-               VM_WARN_ON_ONCE_FOLIO(!success, folio);
-               folio_set_swapbacked(folio);
-               lruvec_add_folio_tail(lruvec, folio);
-               return true;
-       }
+       /* lazyfree: we may not be allowed to set swapbacked: MAP_DROPPABLE */
+       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio))
+               return false;
David Hildenbrand July 11, 2024, 7:20 p.m. UTC | #22
On 11.07.24 21:18, David Hildenbrand wrote:
> On 11.07.24 20:56, David Hildenbrand wrote:
>> On 11.07.24 20:54, Jason A. Donenfeld wrote:
>>> On Thu, Jul 11, 2024 at 08:24:07PM +0200, David Hildenbrand wrote:
>>>>> And PG_large_rmappable seems to only be used for hugetlb branches.
>>>>
>>>> It should be set for THP/large folios.
>>>
>>> And it's tested too, apparently.
>>>
>>> Okay, well, how disappointing is this below? Because I'm running out of
>>> tricks for flag reuse.
>>>
>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>> index b9e914e1face..c1ea49a7f198 100644
>>> --- a/include/linux/page-flags.h
>>> +++ b/include/linux/page-flags.h
>>> @@ -110,6 +110,7 @@ enum pageflags {
>>>     	PG_workingset,
>>>     	PG_error,
>>>     	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
>>> +	PG_owner_priv_2,
>>
>> Oh no, no new page flags please :)
>>
>> Maybe just follow what Linux suggested: pass vma to pte_dirty() and
>> always return false for these special VMAs.
> 
> ... or look into removing that one case that gives us headake.
> 
> No idea what would happen if we do the following:
> 
> CCing Yu Zhao.
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0761f91b407f..d1dfbd4fd38d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4280,14 +4280,9 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                   return true;
>           }
>    
> -       /* dirty lazyfree */
> -       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
> -               success = lru_gen_del_folio(lruvec, folio, true);
> -               VM_WARN_ON_ONCE_FOLIO(!success, folio);
> -               folio_set_swapbacked(folio);
> -               lruvec_add_folio_tail(lruvec, folio);
> -               return true;
> -       }
> +       /* lazyfree: we may not be allowed to set swapbacked: MAP_DROPPABLE */
> +       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio))
> +               return false;

Note that something is unclear to me: are we maybe running into that 
code also if folio_set_swapbacked() is already set and we are not in the 
lazyfree path (in contrast to what is documented)?
David Hildenbrand July 11, 2024, 7:22 p.m. UTC | #23
On 11.07.24 21:17, Linus Torvalds wrote:
> On Thu, 11 Jul 2024 at 12:08, David Hildenbrand <david@redhat.com> wrote:
>>
>> We also have these folio_mark_dirty() calls, for example in
>> unpin_user_pages_dirty_lock(). Hm ... so preventing the folio from
>> getting dirtied is likely shaky.
> 
> I do wonder if we should just disallow page pinning for these pages
> entirely. When the page can get replaced by zeroes at any time,
> pinning it doesn't make much sense.
> 
> Except we do have that whole "fast" case that intentionally doesn't
> take locks and doesn't have a vma. Darn.

Yeah, and I think it should all be simpler; we shouldn't have to 
special-case these cases everywhere.

Maybe we can just find a way to not do *folio_set_swapbacked() without a 
VMA.
Yu Zhao July 11, 2024, 7:49 p.m. UTC | #24
On Thu, Jul 11, 2024 at 1:20 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 11.07.24 21:18, David Hildenbrand wrote:
> > On 11.07.24 20:56, David Hildenbrand wrote:
> >> On 11.07.24 20:54, Jason A. Donenfeld wrote:
> >>> On Thu, Jul 11, 2024 at 08:24:07PM +0200, David Hildenbrand wrote:
> >>>>> And PG_large_rmappable seems to only be used for hugetlb branches.
> >>>>
> >>>> It should be set for THP/large folios.
> >>>
> >>> And it's tested too, apparently.
> >>>
> >>> Okay, well, how disappointing is this below? Because I'm running out of
> >>> tricks for flag reuse.
> >>>
> >>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >>> index b9e914e1face..c1ea49a7f198 100644
> >>> --- a/include/linux/page-flags.h
> >>> +++ b/include/linux/page-flags.h
> >>> @@ -110,6 +110,7 @@ enum pageflags {
> >>>             PG_workingset,
> >>>             PG_error,
> >>>             PG_owner_priv_1,        /* Owner use. If pagecache, fs may use*/
> >>> +   PG_owner_priv_2,
> >>
> >> Oh no, no new page flags please :)
> >>
> >> Maybe just follow what Linux suggested: pass vma to pte_dirty() and
> >> always return false for these special VMAs.
> >
> > ... or look into removing that one case that gives us headake.
> >
> > No idea what would happen if we do the following:
> >
> > CCing Yu Zhao.
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 0761f91b407f..d1dfbd4fd38d 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4280,14 +4280,9 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >                   return true;
> >           }
> >
> > -       /* dirty lazyfree */
> > -       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
> > -               success = lru_gen_del_folio(lruvec, folio, true);
> > -               VM_WARN_ON_ONCE_FOLIO(!success, folio);
> > -               folio_set_swapbacked(folio);
> > -               lruvec_add_folio_tail(lruvec, folio);
> > -               return true;
> > -       }
> > +       /* lazyfree: we may not be allowed to set swapbacked: MAP_DROPPABLE */
> > +       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio))
> > +               return false;

This is an optimization to avoid an unnecessary trip to
shrink_folio_list(), so it's safe to delete the entire 'if' block, and
that would be preferable than leaving a dangling 'if'.

> Note that something is unclear to me: are we maybe running into that
> code also if folio_set_swapbacked() is already set and we are not in the
> lazyfree path (in contrast to what is documented)?

Not sure what you mean: either rmap sees pte_dirty() and does
folio_mark_dirty() and then folio_set_swapbacked(); or MGLRU does the
same sequence, with the first two steps in walk_pte_range() and the
last one here.
Yu Zhao July 11, 2024, 7:52 p.m. UTC | #25
On Thu, Jul 11, 2024 at 1:49 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Thu, Jul 11, 2024 at 1:20 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 11.07.24 21:18, David Hildenbrand wrote:
> > > On 11.07.24 20:56, David Hildenbrand wrote:
> > >> On 11.07.24 20:54, Jason A. Donenfeld wrote:
> > >>> On Thu, Jul 11, 2024 at 08:24:07PM +0200, David Hildenbrand wrote:
> > >>>>> And PG_large_rmappable seems to only be used for hugetlb branches.
> > >>>>
> > >>>> It should be set for THP/large folios.
> > >>>
> > >>> And it's tested too, apparently.
> > >>>
> > >>> Okay, well, how disappointing is this below? Because I'm running out of
> > >>> tricks for flag reuse.
> > >>>
> > >>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > >>> index b9e914e1face..c1ea49a7f198 100644
> > >>> --- a/include/linux/page-flags.h
> > >>> +++ b/include/linux/page-flags.h
> > >>> @@ -110,6 +110,7 @@ enum pageflags {
> > >>>             PG_workingset,
> > >>>             PG_error,
> > >>>             PG_owner_priv_1,        /* Owner use. If pagecache, fs may use*/
> > >>> +   PG_owner_priv_2,
> > >>
> > >> Oh no, no new page flags please :)
> > >>
> > >> Maybe just follow what Linux suggested: pass vma to pte_dirty() and
> > >> always return false for these special VMAs.
> > >
> > > ... or look into removing that one case that gives us headake.
> > >
> > > No idea what would happen if we do the following:
> > >
> > > CCing Yu Zhao.
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 0761f91b407f..d1dfbd4fd38d 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -4280,14 +4280,9 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > >                   return true;
> > >           }
> > >
> > > -       /* dirty lazyfree */
> > > -       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
> > > -               success = lru_gen_del_folio(lruvec, folio, true);
> > > -               VM_WARN_ON_ONCE_FOLIO(!success, folio);
> > > -               folio_set_swapbacked(folio);
> > > -               lruvec_add_folio_tail(lruvec, folio);
> > > -               return true;
> > > -       }
> > > +       /* lazyfree: we may not be allowed to set swapbacked: MAP_DROPPABLE */
> > > +       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio))
> > > +               return false;
>
> This is an optimization to avoid an unnecessary trip to
> shrink_folio_list(), so it's safe to delete the entire 'if' block, and
> that would be preferable than leaving a dangling 'if'.
>
> > Note that something is unclear to me: are we maybe running into that
> > code also if folio_set_swapbacked() is already set and we are not in the
> > lazyfree path (in contrast to what is documented)?
>
> Not sure what you mean: either rmap sees pte_dirty() and does
> folio_mark_dirty() and then folio_set_swapbacked(); or MGLRU does the
> same sequence, with the first two steps in walk_pte_range() and the
> last one here.

Rationale: rmap is expensive (cache unfriendly) and MGLRU tries to
avoid using it.
David Hildenbrand July 11, 2024, 7:53 p.m. UTC | #26
On 11.07.24 21:49, Yu Zhao wrote:
> On Thu, Jul 11, 2024 at 1:20 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 11.07.24 21:18, David Hildenbrand wrote:
>>> On 11.07.24 20:56, David Hildenbrand wrote:
>>>> On 11.07.24 20:54, Jason A. Donenfeld wrote:
>>>>> On Thu, Jul 11, 2024 at 08:24:07PM +0200, David Hildenbrand wrote:
>>>>>>> And PG_large_rmappable seems to only be used for hugetlb branches.
>>>>>>
>>>>>> It should be set for THP/large folios.
>>>>>
>>>>> And it's tested too, apparently.
>>>>>
>>>>> Okay, well, how disappointing is this below? Because I'm running out of
>>>>> tricks for flag reuse.
>>>>>
>>>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>>>> index b9e914e1face..c1ea49a7f198 100644
>>>>> --- a/include/linux/page-flags.h
>>>>> +++ b/include/linux/page-flags.h
>>>>> @@ -110,6 +110,7 @@ enum pageflags {
>>>>>              PG_workingset,
>>>>>              PG_error,
>>>>>              PG_owner_priv_1,        /* Owner use. If pagecache, fs may use*/
>>>>> +   PG_owner_priv_2,
>>>>
>>>> Oh no, no new page flags please :)
>>>>
>>>> Maybe just follow what Linux suggested: pass vma to pte_dirty() and
>>>> always return false for these special VMAs.
>>>
>>> ... or look into removing that one case that gives us headake.
>>>
>>> No idea what would happen if we do the following:
>>>
>>> CCing Yu Zhao.
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 0761f91b407f..d1dfbd4fd38d 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -4280,14 +4280,9 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>>>                    return true;
>>>            }
>>>
>>> -       /* dirty lazyfree */
>>> -       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
>>> -               success = lru_gen_del_folio(lruvec, folio, true);
>>> -               VM_WARN_ON_ONCE_FOLIO(!success, folio);
>>> -               folio_set_swapbacked(folio);
>>> -               lruvec_add_folio_tail(lruvec, folio);
>>> -               return true;
>>> -       }
>>> +       /* lazyfree: we may not be allowed to set swapbacked: MAP_DROPPABLE */
>>> +       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio))
>>> +               return false;
> 
> This is an optimization to avoid an unnecessary trip to
> shrink_folio_list(), so it's safe to delete the entire 'if' block, and
> that would be preferable than leaving a dangling 'if'.

Great, thanks.

> 
>> Note that something is unclear to me: are we maybe running into that
>> code also if folio_set_swapbacked() is already set and we are not in the
>> lazyfree path (in contrast to what is documented)?
> 
> Not sure what you mean: either rmap sees pte_dirty() and does
> folio_mark_dirty() and then folio_set_swapbacked(); or MGLRU does the
> same sequence, with the first two steps in walk_pte_range() and the
> last one here.

Let me rephrase:

Checking for lazyfree is

"folio_test_anon(folio) && !folio_test_swapbacked(folio)"

Testing for dirtied lazyfree is

"folio_test_anon(folio) && !folio_test_swapbacked(folio) &&
  folio_test)dirty(folio)"

So I'm wondering about the missing folio_test_swapbacked() test.
Yu Zhao July 11, 2024, 7:58 p.m. UTC | #27
On Thu, Jul 11, 2024 at 1:53 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 11.07.24 21:49, Yu Zhao wrote:
> > On Thu, Jul 11, 2024 at 1:20 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 11.07.24 21:18, David Hildenbrand wrote:
> >>> On 11.07.24 20:56, David Hildenbrand wrote:
> >>>> On 11.07.24 20:54, Jason A. Donenfeld wrote:
> >>>>> On Thu, Jul 11, 2024 at 08:24:07PM +0200, David Hildenbrand wrote:
> >>>>>>> And PG_large_rmappable seems to only be used for hugetlb branches.
> >>>>>>
> >>>>>> It should be set for THP/large folios.
> >>>>>
> >>>>> And it's tested too, apparently.
> >>>>>
> >>>>> Okay, well, how disappointing is this below? Because I'm running out of
> >>>>> tricks for flag reuse.
> >>>>>
> >>>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >>>>> index b9e914e1face..c1ea49a7f198 100644
> >>>>> --- a/include/linux/page-flags.h
> >>>>> +++ b/include/linux/page-flags.h
> >>>>> @@ -110,6 +110,7 @@ enum pageflags {
> >>>>>              PG_workingset,
> >>>>>              PG_error,
> >>>>>              PG_owner_priv_1,        /* Owner use. If pagecache, fs may use*/
> >>>>> +   PG_owner_priv_2,
> >>>>
> >>>> Oh no, no new page flags please :)
> >>>>
> >>>> Maybe just follow what Linux suggested: pass vma to pte_dirty() and
> >>>> always return false for these special VMAs.
> >>>
> >>> ... or look into removing that one case that gives us headake.
> >>>
> >>> No idea what would happen if we do the following:
> >>>
> >>> CCing Yu Zhao.
> >>>
> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>> index 0761f91b407f..d1dfbd4fd38d 100644
> >>> --- a/mm/vmscan.c
> >>> +++ b/mm/vmscan.c
> >>> @@ -4280,14 +4280,9 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >>>                    return true;
> >>>            }
> >>>
> >>> -       /* dirty lazyfree */
> >>> -       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
> >>> -               success = lru_gen_del_folio(lruvec, folio, true);
> >>> -               VM_WARN_ON_ONCE_FOLIO(!success, folio);
> >>> -               folio_set_swapbacked(folio);
> >>> -               lruvec_add_folio_tail(lruvec, folio);
> >>> -               return true;
> >>> -       }
> >>> +       /* lazyfree: we may not be allowed to set swapbacked: MAP_DROPPABLE */
> >>> +       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio))
> >>> +               return false;
> >
> > This is an optimization to avoid an unnecessary trip to
> > shrink_folio_list(), so it's safe to delete the entire 'if' block, and
> > that would be preferable than leaving a dangling 'if'.
>
> Great, thanks.
>
> >
> >> Note that something is unclear to me: are we maybe running into that
> >> code also if folio_set_swapbacked() is already set and we are not in the
> >> lazyfree path (in contrast to what is documented)?
> >
> > Not sure what you mean: either rmap sees pte_dirty() and does
> > folio_mark_dirty() and then folio_set_swapbacked(); or MGLRU does the
> > same sequence, with the first two steps in walk_pte_range() and the
> > last one here.
>
> Let me rephrase:
>
> Checking for lazyfree is
>
> "folio_test_anon(folio) && !folio_test_swapbacked(folio)"
>
> Testing for dirtied lazyfree is
>
> "folio_test_anon(folio) && !folio_test_swapbacked(folio) &&
>   folio_test)dirty(folio)"
>
> So I'm wondering about the missing folio_test_swapbacked() test.

It's not missing: type == LRU_GEN_FILE means folio_is_file_lru(),
which in turn means !folio_test_swapbacked().
Jason A. Donenfeld July 11, 2024, 8:07 p.m. UTC | #28
Hi Linus,

On Thu, Jul 11, 2024 at 10:57:17AM -0700, Linus Torvalds wrote:
> May I suggest a slightly different approach: do what we did for "pte_mkwrite()".
> 
> It needed the vma too, for not too dissimilar reasons: special dirty
> bit handling for the shadow stack. See

Thanks for the suggestion. That seems pretty clean.

It still needs to avoid setting swapbacked in the first place, but
ensuring that it's never dirty means it won't get turned back on.

The first patch renames pte_dirty() to pte_dirty_novma(). The second
patch adds an inline function, pte_dirty(pte, vma) that just forwards
the pte to pte_dirty_novma(), and then converts callers that have a vma
available to pass to call pte_dirty(). And then the VM_DROPPABLE patch
simply adds the `&& !(vma->vm_flags & VM_DROPPABLE)` condition to
pte_dirty().

I put these in https://git.zx2c4.com/linux-rng/log/ per usual, and I'll
post a new version to the list not before long (unless objections).

Jason
Jason A. Donenfeld July 11, 2024, 8:17 p.m. UTC | #29
On Thu, Jul 11, 2024 at 10:07:30PM +0200, Jason A. Donenfeld wrote:
> Hi Linus,
> 
> On Thu, Jul 11, 2024 at 10:57:17AM -0700, Linus Torvalds wrote:
> > May I suggest a slightly different approach: do what we did for "pte_mkwrite()".
> > 
> > It needed the vma too, for not too dissimilar reasons: special dirty
> > bit handling for the shadow stack. See
> 
> Thanks for the suggestion. That seems pretty clean.
> 
> It still needs to avoid setting swapbacked in the first place, but
> ensuring that it's never dirty means it won't get turned back on.
> 
> The first patch renames pte_dirty() to pte_dirty_novma(). The second
> patch adds an inline function, pte_dirty(pte, vma) that just forwards
> the pte to pte_dirty_novma(), and then converts callers that have a vma
> available to pass to call pte_dirty(). And then the VM_DROPPABLE patch
> simply adds the `&& !(vma->vm_flags & VM_DROPPABLE)` condition to
> pte_dirty().
> 
> I put these in https://git.zx2c4.com/linux-rng/log/ per usual, and I'll
> post a new version to the list not before long (unless objections).

Oh, I didn't catch upthread in time (my mail flow is based on `lei
up`, which I guess I should run at greater frequency). It seems like we
apparently might go in a different direction.

I'll move that to https://git.zx2c4.com/linux-rng/log/?h=jd/pte_dirty in
case it's useful later, though.

Jason
Jason A. Donenfeld July 11, 2024, 8:20 p.m. UTC | #30
Hi David,

On Thu, Jul 11, 2024 at 01:49:42PM -0600, Yu Zhao wrote:
> On Thu, Jul 11, 2024 at 1:20 PM David Hildenbrand <david@redhat.com> wrote:
> > > -       /* dirty lazyfree */
> > > -       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
> > > -               success = lru_gen_del_folio(lruvec, folio, true);
> > > -               VM_WARN_ON_ONCE_FOLIO(!success, folio);
> > > -               folio_set_swapbacked(folio);
> > > -               lruvec_add_folio_tail(lruvec, folio);
> > > -               return true;
> > > -       }

> This is an optimization to avoid an unnecessary trip to
> shrink_folio_list(), so it's safe to delete the entire 'if' block, and
> that would be preferable than leaving a dangling 'if'.

Alright, I'll just remove that entire chunk then, for v+1 of this patch?
That sounds prettttty okay.

Jason
David Hildenbrand July 11, 2024, 8:59 p.m. UTC | #31
On 11.07.24 21:58, Yu Zhao wrote:
> On Thu, Jul 11, 2024 at 1:53 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 11.07.24 21:49, Yu Zhao wrote:
>>> On Thu, Jul 11, 2024 at 1:20 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 11.07.24 21:18, David Hildenbrand wrote:
>>>>> On 11.07.24 20:56, David Hildenbrand wrote:
>>>>>> On 11.07.24 20:54, Jason A. Donenfeld wrote:
>>>>>>> On Thu, Jul 11, 2024 at 08:24:07PM +0200, David Hildenbrand wrote:
>>>>>>>>> And PG_large_rmappable seems to only be used for hugetlb branches.
>>>>>>>>
>>>>>>>> It should be set for THP/large folios.
>>>>>>>
>>>>>>> And it's tested too, apparently.
>>>>>>>
>>>>>>> Okay, well, how disappointing is this below? Because I'm running out of
>>>>>>> tricks for flag reuse.
>>>>>>>
>>>>>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>>>>>> index b9e914e1face..c1ea49a7f198 100644
>>>>>>> --- a/include/linux/page-flags.h
>>>>>>> +++ b/include/linux/page-flags.h
>>>>>>> @@ -110,6 +110,7 @@ enum pageflags {
>>>>>>>               PG_workingset,
>>>>>>>               PG_error,
>>>>>>>               PG_owner_priv_1,        /* Owner use. If pagecache, fs may use*/
>>>>>>> +   PG_owner_priv_2,
>>>>>>
>>>>>> Oh no, no new page flags please :)
>>>>>>
>>>>>> Maybe just follow what Linux suggested: pass vma to pte_dirty() and
>>>>>> always return false for these special VMAs.
>>>>>
>>>>> ... or look into removing that one case that gives us headake.
>>>>>
>>>>> No idea what would happen if we do the following:
>>>>>
>>>>> CCing Yu Zhao.
>>>>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index 0761f91b407f..d1dfbd4fd38d 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -4280,14 +4280,9 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>>>>>                     return true;
>>>>>             }
>>>>>
>>>>> -       /* dirty lazyfree */
>>>>> -       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
>>>>> -               success = lru_gen_del_folio(lruvec, folio, true);
>>>>> -               VM_WARN_ON_ONCE_FOLIO(!success, folio);
>>>>> -               folio_set_swapbacked(folio);
>>>>> -               lruvec_add_folio_tail(lruvec, folio);
>>>>> -               return true;
>>>>> -       }
>>>>> +       /* lazyfree: we may not be allowed to set swapbacked: MAP_DROPPABLE */
>>>>> +       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio))
>>>>> +               return false;
>>>
>>> This is an optimization to avoid an unnecessary trip to
>>> shrink_folio_list(), so it's safe to delete the entire 'if' block, and
>>> that would be preferable than leaving a dangling 'if'.
>>
>> Great, thanks.
>>
>>>
>>>> Note that something is unclear to me: are we maybe running into that
>>>> code also if folio_set_swapbacked() is already set and we are not in the
>>>> lazyfree path (in contrast to what is documented)?
>>>
>>> Not sure what you mean: either rmap sees pte_dirty() and does
>>> folio_mark_dirty() and then folio_set_swapbacked(); or MGLRU does the
>>> same sequence, with the first two steps in walk_pte_range() and the
>>> last one here.
>>
>> Let me rephrase:
>>
>> Checking for lazyfree is
>>
>> "folio_test_anon(folio) && !folio_test_swapbacked(folio)"
>>
>> Testing for dirtied lazyfree is
>>
>> "folio_test_anon(folio) && !folio_test_swapbacked(folio) &&
>>    folio_test)dirty(folio)"
>>
>> So I'm wondering about the missing folio_test_swapbacked() test.
> 
> It's not missing: type == LRU_GEN_FILE means folio_is_file_lru(),
> which in turn means !folio_test_swapbacked().
> 

Ahh, got it, thanks!
David Hildenbrand July 11, 2024, 8:59 p.m. UTC | #32
On 11.07.24 22:20, Jason A. Donenfeld wrote:
> Hi David,
> 
> On Thu, Jul 11, 2024 at 01:49:42PM -0600, Yu Zhao wrote:
>> On Thu, Jul 11, 2024 at 1:20 PM David Hildenbrand <david@redhat.com> wrote:
>>>> -       /* dirty lazyfree */
>>>> -       if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
>>>> -               success = lru_gen_del_folio(lruvec, folio, true);
>>>> -               VM_WARN_ON_ONCE_FOLIO(!success, folio);
>>>> -               folio_set_swapbacked(folio);
>>>> -               lruvec_add_folio_tail(lruvec, folio);
>>>> -               return true;
>>>> -       }
> 
>> This is an optimization to avoid an unnecessary trip to
>> shrink_folio_list(), so it's safe to delete the entire 'if' block, and
>> that would be preferable than leaving a dangling 'if'.
> 
> Alright, I'll just remove that entire chunk then, for v+1 of this patch?
> That sounds prettttty okay.

Yes!
David Hildenbrand July 11, 2024, 10:29 p.m. UTC | #33
On 10.07.24 05:27, David Hildenbrand wrote:
> On 09.07.24 15:05, Jason A. Donenfeld wrote:
>> The vDSO getrandom() implementation works with a buffer allocated with a
>> new system call that has certain requirements:
>>
>> - It shouldn't be written to core dumps.
>>     * Easy: VM_DONTDUMP.
>> - It should be zeroed on fork.
>>     * Easy: VM_WIPEONFORK.
>>
>> - It shouldn't be written to swap.
>>     * Uh-oh: mlock is rlimited.
>>     * Uh-oh: mlock isn't inherited by forks.
>>
>> It turns out that the vDSO getrandom() function has three really nice
>> characteristics that we can exploit to solve this problem:
>>
>> 1) Due to being wiped during fork(), the vDSO code is already robust to
>>      having the contents of the pages it reads zeroed out midway through
>>      the function's execution.
>>
>> 2) In the absolute worst case of whatever contingency we're coding for,
>>      we have the option to fallback to the getrandom() syscall, and
>>      everything is fine.
>>
>> 3) The buffers the function uses are only ever useful for a maximum of
>>      60 seconds -- a sort of cache, rather than a long term allocation.
>>
>> These characteristics mean that we can introduce VM_DROPPABLE, which
>> has the following semantics:
>>
>> a) It never is written out to swap.
>> b) Under memory pressure, mm can just drop the pages (so that they're
>>      zero when read back again).
>> c) It is inherited by fork.
>> d) It doesn't count against the mlock budget, since nothing is locked.
>>
>> This is fairly simple to implement, with the one snag that we have to
>> use 64-bit VM_* flags, but this shouldn't be a problem, since the only
>> consumers will probably be 64-bit anyway.
>>
>> This way, allocations used by vDSO getrandom() can use:
>>
>>       VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
>>
>> And there will be no problem with using memory when not in use, not
>> wiping on fork(), coredumps, or writing out to swap.
>>
>> In order to let vDSO getrandom() use this, expose these via mmap(2) as
>> MAP_DROPPABLE.
>>
>> Finally, the provided self test ensures that this is working as desired.
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 
> 
> I'll try to think of some corner cases we might be missing.

Sorry that I keep coming up with corner cases :) But these should be easy to handle:

1) We should disallow KSM.

diff --git a/mm/ksm.c b/mm/ksm.c
index df6bae3a5a2c..d6744183ba41 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -713,7 +713,7 @@ static bool vma_ksm_compatible(struct vm_area_struct *vma)
  {
         if (vma->vm_flags & (VM_SHARED  | VM_MAYSHARE   | VM_PFNMAP  |
                              VM_IO      | VM_DONTEXPAND | VM_HUGETLB |
-                            VM_MIXEDMAP))
+                            VM_MIXEDMAP | VM_DROPPABLE))
                 return false;           /* just ignore the advice */
  
         if (vma_is_dax(vma))


We don't want to suddenly get pages that are swapbacked.


2) We should disable userfaultfd

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 05d59f74fc88..a12bcf042551 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -218,6 +218,9 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
  {
         vm_flags &= __VM_UFFD_FLAGS;
  
+       if (vm_flags & VM_DROPPABLE)
+               return false;
+
         if ((vm_flags & VM_UFFD_MINOR) &&
             (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
                 return false;


Otherwise someone could place swapbacked pages in there (using UFFDIO_MOVE)
I think. But conceptually, I don't think userfaultfd might not make sense at
all with uffd. And if there are good reasons for it in the future, we could
enable the parts that make sense.


I think other places like khugepaged should handle it correctly (not set
swapbacked) due to your changes to folio_add_new_anon_rmap().
Jason A. Donenfeld July 12, 2024, 1:21 a.m. UTC | #34
On Fri, Jul 12, 2024 at 12:29:17AM +0200, David Hildenbrand wrote:
> > I'll try to think of some corner cases we might be missing.
> 
> Sorry that I keep coming up with corner cases :) But these should be easy to handle:

Thank you for coming up with them!

> We don't want to suddenly get pages that are swapbacked.
 
> Otherwise someone could place swapbacked pages in there (using UFFDIO_MOVE)

Both seem like reasonable concerns. Added to v+1.

Jason
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 71e5039d940d..46f0b0fe9ee3 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -708,6 +708,7 @@  static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_SHADOW_STACK)] = "ss",
 #endif
 #ifdef CONFIG_64BIT
+		[ilog2(VM_DROPPABLE)] = "dp",
 		[ilog2(VM_SEALED)] = "sl",
 #endif
 	};
diff --git a/include/linux/mm.h b/include/linux/mm.h
index eb7c96d24ac0..e078c2890bf8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -406,6 +406,13 @@  extern unsigned int kobjsize(const void *objp);
 #define VM_ALLOW_ANY_UNCACHED		VM_NONE
 #endif
 
+#ifdef CONFIG_64BIT
+#define VM_DROPPABLE_BIT	40
+#define VM_DROPPABLE		BIT(VM_DROPPABLE_BIT)
+#else
+#define VM_DROPPABLE		VM_NONE
+#endif
+
 #ifdef CONFIG_64BIT
 /* VM is sealed, in vm_flags */
 #define VM_SEALED	_BITUL(63)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index e46d6e82765e..b63d211bd141 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -165,6 +165,12 @@  IF_HAVE_PG_ARCH_X(arch_3)
 # define IF_HAVE_UFFD_MINOR(flag, name)
 #endif
 
+#ifdef CONFIG_64BIT
+# define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name},
+#else
+# define IF_HAVE_VM_DROPPABLE(flag, name)
+#endif
+
 #define __def_vmaflag_names						\
 	{VM_READ,			"read"		},		\
 	{VM_WRITE,			"write"		},		\
@@ -197,6 +203,7 @@  IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 	{VM_MIXEDMAP,			"mixedmap"	},		\
 	{VM_HUGEPAGE,			"hugepage"	},		\
 	{VM_NOHUGEPAGE,			"nohugepage"	},		\
+IF_HAVE_VM_DROPPABLE(VM_DROPPABLE,	"droppable"	)		\
 	{VM_MERGEABLE,			"mergeable"	}		\
 
 #define show_vma_flags(flags)						\
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index a246e11988d5..e89d00528f2f 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -17,6 +17,7 @@ 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
 #define MAP_SHARED_VALIDATE 0x03	/* share + validate extension flags */
+#define MAP_DROPPABLE	0x08		/* Zero memory under memory pressure. */
 
 /*
  * Huge page size encoding when MAP_HUGETLB is specified, and a huge page
diff --git a/mm/madvise.c b/mm/madvise.c
index a77893462b92..cba5bc652fc4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1068,13 +1068,16 @@  static int madvise_vma_behavior(struct vm_area_struct *vma,
 		new_flags |= VM_WIPEONFORK;
 		break;
 	case MADV_KEEPONFORK:
+		if (vma->vm_flags & VM_DROPPABLE)
+			return -EINVAL;
 		new_flags &= ~VM_WIPEONFORK;
 		break;
 	case MADV_DONTDUMP:
 		new_flags |= VM_DONTDUMP;
 		break;
 	case MADV_DODUMP:
-		if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL)
+		if ((!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) ||
+		    (vma->vm_flags & VM_DROPPABLE))
 			return -EINVAL;
 		new_flags &= ~VM_DONTDUMP;
 		break;
diff --git a/mm/mlock.c b/mm/mlock.c
index 30b51cdea89d..b87b3d8cc9cc 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -485,7 +485,7 @@  static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 	if (newflags == oldflags || (oldflags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
-	    vma_is_dax(vma) || vma_is_secretmem(vma))
+	    vma_is_dax(vma) || vma_is_secretmem(vma) || (oldflags & VM_DROPPABLE))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 83b4682ec85c..8aeedeb784c2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1369,6 +1369,36 @@  unsigned long do_mmap(struct file *file, unsigned long addr,
 			pgoff = 0;
 			vm_flags |= VM_SHARED | VM_MAYSHARE;
 			break;
+		case MAP_DROPPABLE:
+			if (VM_DROPPABLE == VM_NONE)
+				return -ENOTSUPP;
+			/*
+			 * A locked or stack area makes no sense to be droppable.
+			 *
+			 * Also, since droppable pages can just go away at any time
+			 * it makes no sense to copy them on fork or dump them.
+			 *
+			 * And don't attempt to combine with hugetlb for now.
+			 */
+			if (flags & (MAP_LOCKED | MAP_HUGETLB))
+			        return -EINVAL;
+			if (vm_flags & (VM_GROWSDOWN | VM_GROWSUP))
+			        return -EINVAL;
+
+			vm_flags |= VM_DROPPABLE;
+
+			/*
+			 * If the pages can be dropped, then it doesn't make
+			 * sense to reserve them.
+			 */
+			vm_flags |= VM_NORESERVE;
+
+			/*
+			 * Likewise, they're volatile enough that they
+			 * shouldn't survive forks or coredumps.
+			 */
+			vm_flags |= VM_WIPEONFORK | VM_DONTDUMP;
+			fallthrough;
 		case MAP_PRIVATE:
 			/*
 			 * Set pgoff according to addr for anon_vma.
diff --git a/mm/rmap.c b/mm/rmap.c
index e8fc5ecb59b2..1f9b5a9cb121 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1397,7 +1397,12 @@  void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
 	VM_BUG_ON_VMA(address < vma->vm_start ||
 			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
-	__folio_set_swapbacked(folio);
+	/*
+	 * VM_DROPPABLE mappings don't swap; instead they're just dropped when
+	 * under memory pressure.
+	 */
+	if (!(vma->vm_flags & VM_DROPPABLE))
+		__folio_set_swapbacked(folio);
 	__folio_set_anon(folio, vma, address, true);
 
 	if (likely(!folio_test_large(folio))) {
@@ -1841,7 +1846,13 @@  static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 * plus the rmap(s) (dropped by discard:).
 				 */
 				if (ref_count == 1 + map_count &&
-				    !folio_test_dirty(folio)) {
+				    (!folio_test_dirty(folio) ||
+				     /*
+				      * Unlike MADV_FREE mappings, VM_DROPPABLE
+				      * ones can be dropped even if they've
+				      * been dirtied.
+				      */
+				     (vma->vm_flags & VM_DROPPABLE))) {
 					dec_mm_counter(mm, MM_ANONPAGES);
 					goto discard;
 				}
@@ -1851,7 +1862,12 @@  static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 * discarded. Remap the page to page table.
 				 */
 				set_pte_at(mm, address, pvmw.pte, pteval);
-				folio_set_swapbacked(folio);
+				/*
+				 * Unlike MADV_FREE mappings, VM_DROPPABLE ones
+				 * never get swap backed on failure to drop.
+				 */
+				if (!(vma->vm_flags & VM_DROPPABLE))
+					folio_set_swapbacked(folio);
 				ret = false;
 				page_vma_mapped_walk_done(&pvmw);
 				break;
diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h
index a246e11988d5..e89d00528f2f 100644
--- a/tools/include/uapi/linux/mman.h
+++ b/tools/include/uapi/linux/mman.h
@@ -17,6 +17,7 @@ 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
 #define MAP_SHARED_VALIDATE 0x03	/* share + validate extension flags */
+#define MAP_DROPPABLE	0x08		/* Zero memory under memory pressure. */
 
 /*
  * Huge page size encoding when MAP_HUGETLB is specified, and a huge page
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index 0b9ab987601c..a8beeb43c2b5 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -49,3 +49,4 @@  hugetlb_fault_after_madv
 hugetlb_madv_vs_map
 mseal_test
 seal_elf
+droppable
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 3b49bc3d0a3b..e3e5740e13e1 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -73,6 +73,7 @@  TEST_GEN_FILES += ksm_functional_tests
 TEST_GEN_FILES += mdwe_test
 TEST_GEN_FILES += hugetlb_fault_after_madv
 TEST_GEN_FILES += hugetlb_madv_vs_map
+TEST_GEN_FILES += droppable
 
 ifneq ($(ARCH),arm64)
 TEST_GEN_FILES += soft-dirty
diff --git a/tools/testing/selftests/mm/droppable.c b/tools/testing/selftests/mm/droppable.c
new file mode 100644
index 000000000000..f3d9ecf96890
--- /dev/null
+++ b/tools/testing/selftests/mm/droppable.c
@@ -0,0 +1,53 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2024 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#include <assert.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <signal.h>
+#include <sys/mman.h>
+#include <linux/mman.h>
+
+#include "../kselftest.h"
+
+int main(int argc, char *argv[])
+{
+	size_t alloc_size = 134217728;
+	size_t page_size = getpagesize();
+	void *alloc;
+	pid_t child;
+
+	ksft_print_header();
+	ksft_set_plan(1);
+
+	alloc = mmap(0, alloc_size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_DROPPABLE, -1, 0);
+	assert(alloc != MAP_FAILED);
+	memset(alloc, 'A', alloc_size);
+	for (size_t i = 0; i < alloc_size; i += page_size)
+		assert(*(uint8_t *)(alloc + i));
+
+	child = fork();
+	assert(child >= 0);
+	if (!child) {
+		for (;;)
+			*(char *)malloc(page_size) = 'B';
+	}
+
+	for (bool done = false; !done;) {
+		for (size_t i = 0; i < alloc_size; i += page_size) {
+			if (!*(uint8_t *)(alloc + i)) {
+				done = true;
+				break;
+			}
+		}
+	}
+	kill(child, SIGTERM);
+
+	ksft_test_result_pass("MAP_DROPPABLE: PASS\n");
+	exit(KSFT_PASS);
+}