diff mbox series

mm: Migrate high-order folios in swap cache correctly

Message ID 20231214045841.961776-1-willy@infradead.org (mailing list archive)
State New
Headers show
Series mm: Migrate high-order folios in swap cache correctly | expand

Commit Message

Matthew Wilcox (Oracle) Dec. 14, 2023, 4:58 a.m. UTC
From: Charan Teja Kalla <quic_charante@quicinc.com>

Large folios occupy N consecutive entries in the swap cache
instead of using multi-index entries like the page cache.
However, if a large folio is re-added to the LRU list, it can
be migrated.  The migration code was not aware of the difference
between the swap cache and the page cache and assumed that a single
xas_store() would be sufficient.

This leaves potentially many stale pointers to the now-migrated folio
in the swap cache, which can lead to almost arbitrary data corruption
in the future.  This can also manifest as infinite loops with the
RCU read lock held.

Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
[modifications to the changelog & tweaked the fix]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/migrate.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Andrew Morton Dec. 14, 2023, 10:11 p.m. UTC | #1
On Thu, 14 Dec 2023 04:58:41 +0000 "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote:

> From: Charan Teja Kalla <quic_charante@quicinc.com>
> 
> Large folios occupy N consecutive entries in the swap cache
> instead of using multi-index entries like the page cache.
> However, if a large folio is re-added to the LRU list, it can
> be migrated.  The migration code was not aware of the difference
> between the swap cache and the page cache and assumed that a single
> xas_store() would be sufficient.
> 
> This leaves potentially many stale pointers to the now-migrated folio
> in the swap cache, which can lead to almost arbitrary data corruption
> in the future.  This can also manifest as infinite loops with the
> RCU read lock held.
> 
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
> [modifications to the changelog & tweaked the fix]
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

I'm thinking

Fixes: 3417013e0d183be ("mm/migrate: Add folio_migrate_mapping()")

hence

Cc: <stable@vger.kernel.org>

And also

Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
  Closes: https://lkml.kernel.org/r/1700569840-17327-1-git-send-email-quic_charante@quicinc.com
Charan Teja Kalla Feb. 6, 2024, 2:54 p.m. UTC | #2
Hi Matthew,

It seems that issue is not completely fixed with the below change. This
time it is because of not filling the ->private for subpages of THP.
This can result into same type of issue this patch is tried for:

1) For reclaim of anon THP, page is first added to swap cache. As part
of this, contig swap space of THP size is allocated, fills the
folio_nr_pages of swap cache indices with folio, set each subpage
->private with respective swap entry.
add_to_swap_cache():
	for (i = 0; i < nr; i++) {
		set_page_private(folio_page(folio, i), entry.val + i);
		xas_store(&xas, folio);
		xas_next(&xas);
	}

2) Migrating the folio that is sitting on the swap cache causes only
head page of newfolio->private set to swap entry. The tail pages are missed.
folio_migrate_mapping():
	newfolio->private = folio_get_private(folio);
	for (i = 0; i < entries; i++) {
		xas_store(&xas, newfolio);
		xas_next(&xas);
	}

3) Now again, when this migrated page that is still sitting on the swap
cache, is tried to migrate, this THP page can be split (see
migrate_pages()->try_split_thp), which will endup in storing the swap
cache entries with sub pages in the respective indices, but this sub
pages->private doesn't contain the valid swap cache entry.
__split_huge_page():
	for (i = nr - 1; i >= 1; i--) {
		if {
		...
		} else if(swap_cache) {
			__xa_store(&swap_cache->i_pages, offset + i,
					head + i, 0);
		}
	}

This leads to a state where all the sub pages are sitting on the swap
cache with ->private not holding the valid value. As the subpage is
tried with delete_from_swap_cache(), it tries to replace the __wrong
swap cache index__ with NULL/shadow value and subsequently decrease the
refcount of the sub page.
	for (i = 0; i < nr; i++) {
		if (subpage == page)
			continue;
			
		free_page_and_swap_cache(subpage):
			free_swap_cache(page);
			(Calls delete_from_swap_cache())
			put_page(page);
	}

Consider a folio just sitting on a swap cache, its subpage entries will
now have refcount of 1(ref counts decremented from swap cache deletion +
put_page) from 3 (isolate + swap cache + lru_add_page_tail()). Now
migrate_pages() is tried again on these splitted THP pages. But the
refcount of '1' makes the page freed directly (unmap_and_move).

So, this leads to a state of "freed page entry on the swap cache" which
can causes various corruptions, loop under RCU lock(observed in
mapping_get_entry) e.t.c.,

Seems below change is also required here.

diff --git a/mm/migrate.c b/mm/migrate.c
index 9f5f52d..8049f4e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -427,10 +427,8 @@ int folio_migrate_mapping(struct address_space
*mapping,
        folio_ref_add(newfolio, nr); /* add cache reference */
        if (folio_test_swapbacked(folio)) {
                __folio_set_swapbacked(newfolio);
-               if (folio_test_swapcache(folio)) {
+               if (folio_test_swapcache(folio))
                        folio_set_swapcache(newfolio);
-                       newfolio->private = folio_get_private(folio);
-               }
                entries = nr;
        } else {
                VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
@@ -446,6 +444,8 @@ int folio_migrate_mapping(struct address_space *mapping,

        /* Swap cache still stores N entries instead of a high-order
entry */
        for (i = 0; i < entries; i++) {
+               set_page_private(folio_page(newfolio, i),
+                               folio_page(folio, i)->private);
                xas_store(&xas, newfolio);
                xas_next(&xas);
        }



On 12/14/2023 10:28 AM, Matthew Wilcox (Oracle) wrote:
> From: Charan Teja Kalla <quic_charante@quicinc.com>
> 
> Large folios occupy N consecutive entries in the swap cache
> instead of using multi-index entries like the page cache.
> However, if a large folio is re-added to the LRU list, it can
> be migrated.  The migration code was not aware of the difference
> between the swap cache and the page cache and assumed that a single
> xas_store() would be sufficient.
> 
> This leaves potentially many stale pointers to the now-migrated folio
> in the swap cache, which can lead to almost arbitrary data corruption
> in the future.  This can also manifest as infinite loops with the
> RCU read lock held.
> 
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
> [modifications to the changelog & tweaked the fix]
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/migrate.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index d9d2b9432e81..2d67ca47d2e2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -405,6 +405,7 @@ int folio_migrate_mapping(struct address_space *mapping,
>  	int dirty;
>  	int expected_count = folio_expected_refs(mapping, folio) + extra_count;
>  	long nr = folio_nr_pages(folio);
> +	long entries, i;
>  
>  	if (!mapping) {
>  		/* Anonymous page without mapping */
> @@ -442,8 +443,10 @@ int folio_migrate_mapping(struct address_space *mapping,
>  			folio_set_swapcache(newfolio);
>  			newfolio->private = folio_get_private(folio);
>  		}
> +		entries = nr;
>  	} else {
>  		VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
> +		entries = 1;
>  	}
>  
>  	/* Move dirty while page refs frozen and newpage not yet exposed */
> @@ -453,7 +456,11 @@ int folio_migrate_mapping(struct address_space *mapping,
>  		folio_set_dirty(newfolio);
>  	}
>  
> -	xas_store(&xas, newfolio);
> +	/* Swap cache still stores N entries instead of a high-order entry */
> +	for (i = 0; i < entries; i++) {
> +		xas_store(&xas, newfolio);
> +		xas_next(&xas);
> +	}
>  
>  	/*
>  	 * Drop cache reference from old page by unfreezing
Charan Teja Kalla Feb. 7, 2024, 2:38 p.m. UTC | #3
I am sorry that I didn't check the latest kernel before posting. We are
working on 6.1 LTS kernel where the mentioned issue is observed.  I
assume the analysis still holds true on 6.1 kernel.

I see that this is not a problem on the latest kernels because of [1],
by David.

[1]
https://lore.kernel.org/linux-mm/20230821160849.531668-2-david@redhat.com/#Z31mm:huge_memory.c

Thanks,
Charan
On 2/6/2024 8:24 PM, Charan Teja Kalla wrote:
> Hi Matthew,
> 
> It seems that issue is not completely fixed with the below change. This
> time it is because of not filling the ->private for subpages of THP.
> This can result into same type of issue this patch is tried for:
> 
> 1) For reclaim of anon THP, page is first added to swap cache. As part
> of this, contig swap space of THP size is allocated, fills the
> folio_nr_pages of swap cache indices with folio, set each subpage
> ->private with respective swap entry.
> add_to_swap_cache():
> 	for (i = 0; i < nr; i++) {
> 		set_page_private(folio_page(folio, i), entry.val + i);
> 		xas_store(&xas, folio);
> 		xas_next(&xas);
> 	}
> 
> 2) Migrating the folio that is sitting on the swap cache causes only
> head page of newfolio->private set to swap entry. The tail pages are missed.
> folio_migrate_mapping():
> 	newfolio->private = folio_get_private(folio);
> 	for (i = 0; i < entries; i++) {
> 		xas_store(&xas, newfolio);
> 		xas_next(&xas);
> 	}
> 
> 3) Now again, when this migrated page that is still sitting on the swap
> cache, is tried to migrate, this THP page can be split (see
> migrate_pages()->try_split_thp), which will endup in storing the swap
> cache entries with sub pages in the respective indices, but this sub
> pages->private doesn't contain the valid swap cache entry.
> __split_huge_page():
> 	for (i = nr - 1; i >= 1; i--) {
> 		if {
> 		...
> 		} else if(swap_cache) {
> 			__xa_store(&swap_cache->i_pages, offset + i,
> 					head + i, 0);
> 		}
> 	}
> 
> This leads to a state where all the sub pages are sitting on the swap
> cache with ->private not holding the valid value. As the subpage is
> tried with delete_from_swap_cache(), it tries to replace the __wrong
> swap cache index__ with NULL/shadow value and subsequently decrease the
> refcount of the sub page.
> 	for (i = 0; i < nr; i++) {
> 		if (subpage == page)
> 			continue;
> 			
> 		free_page_and_swap_cache(subpage):
> 			free_swap_cache(page);
> 			(Calls delete_from_swap_cache())
> 			put_page(page);
> 	}
> 
> Consider a folio just sitting on a swap cache, its subpage entries will
> now have refcount of 1(ref counts decremented from swap cache deletion +
> put_page) from 3 (isolate + swap cache + lru_add_page_tail()). Now
> migrate_pages() is tried again on these splitted THP pages. But the
> refcount of '1' makes the page freed directly (unmap_and_move).
> 
> So, this leads to a state of "freed page entry on the swap cache" which
> can causes various corruptions, loop under RCU lock(observed in
> mapping_get_entry) e.t.c.,
> 
> Seems below change is also required here.
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 9f5f52d..8049f4e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -427,10 +427,8 @@ int folio_migrate_mapping(struct address_space
> *mapping,
>         folio_ref_add(newfolio, nr); /* add cache reference */
>         if (folio_test_swapbacked(folio)) {
>                 __folio_set_swapbacked(newfolio);
> -               if (folio_test_swapcache(folio)) {
> +               if (folio_test_swapcache(folio))
>                         folio_set_swapcache(newfolio);
> -                       newfolio->private = folio_get_private(folio);
> -               }
>                 entries = nr;
>         } else {
>                 VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
> @@ -446,6 +444,8 @@ int folio_migrate_mapping(struct address_space *mapping,
> 
>         /* Swap cache still stores N entries instead of a high-order
> entry */
>         for (i = 0; i < entries; i++) {
> +               set_page_private(folio_page(newfolio, i),
> +                               folio_page(folio, i)->private);


I see that the expectation of page_tail->private should be zero for
__split_huge_page_tail. So, this patch also wrong :( .

>                 xas_store(&xas, newfolio);
>                 xas_next(&xas);
>         }
> 
> 
> 
> On 12/14/2023 10:28 AM, Matthew Wilcox (Oracle) wrote:
>> From: Charan Teja Kalla <quic_charante@quicinc.com>
>>
>> Large folios occupy N consecutive entries in the swap cache
>> instead of using multi-index entries like the page cache.
>> However, if a large folio is re-added to the LRU list, it can
>> be migrated.  The migration code was not aware of the difference
>> between the swap cache and the page cache and assumed that a single
>> xas_store() would be sufficient.
>>
>> This leaves potentially many stale pointers to the now-migrated folio
>> in the swap cache, which can lead to almost arbitrary data corruption
>> in the future.  This can also manifest as infinite loops with the
>> RCU read lock held.
>>
>> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
>> [modifications to the changelog & tweaked the fix]
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> ---
>>  mm/migrate.c | 9 ++++++++-
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index d9d2b9432e81..2d67ca47d2e2 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -405,6 +405,7 @@ int folio_migrate_mapping(struct address_space *mapping,
>>  	int dirty;
>>  	int expected_count = folio_expected_refs(mapping, folio) + extra_count;
>>  	long nr = folio_nr_pages(folio);
>> +	long entries, i;
>>  
>>  	if (!mapping) {
>>  		/* Anonymous page without mapping */
>> @@ -442,8 +443,10 @@ int folio_migrate_mapping(struct address_space *mapping,
>>  			folio_set_swapcache(newfolio);
>>  			newfolio->private = folio_get_private(folio);
>>  		}
>> +		entries = nr;
>>  	} else {
>>  		VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
>> +		entries = 1;
>>  	}
>>  
>>  	/* Move dirty while page refs frozen and newpage not yet exposed */
>> @@ -453,7 +456,11 @@ int folio_migrate_mapping(struct address_space *mapping,
>>  		folio_set_dirty(newfolio);
>>  	}
>>  
>> -	xas_store(&xas, newfolio);
>> +	/* Swap cache still stores N entries instead of a high-order entry */
>> +	for (i = 0; i < entries; i++) {
>> +		xas_store(&xas, newfolio);
>> +		xas_next(&xas);
>> +	}
>>  
>>  	/*
>>  	 * Drop cache reference from old page by unfreezing
diff mbox series

Patch

diff --git a/mm/migrate.c b/mm/migrate.c
index d9d2b9432e81..2d67ca47d2e2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -405,6 +405,7 @@  int folio_migrate_mapping(struct address_space *mapping,
 	int dirty;
 	int expected_count = folio_expected_refs(mapping, folio) + extra_count;
 	long nr = folio_nr_pages(folio);
+	long entries, i;
 
 	if (!mapping) {
 		/* Anonymous page without mapping */
@@ -442,8 +443,10 @@  int folio_migrate_mapping(struct address_space *mapping,
 			folio_set_swapcache(newfolio);
 			newfolio->private = folio_get_private(folio);
 		}
+		entries = nr;
 	} else {
 		VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
+		entries = 1;
 	}
 
 	/* Move dirty while page refs frozen and newpage not yet exposed */
@@ -453,7 +456,11 @@  int folio_migrate_mapping(struct address_space *mapping,
 		folio_set_dirty(newfolio);
 	}
 
-	xas_store(&xas, newfolio);
+	/* Swap cache still stores N entries instead of a high-order entry */
+	for (i = 0; i < entries; i++) {
+		xas_store(&xas, newfolio);
+		xas_next(&xas);
+	}
 
 	/*
 	 * Drop cache reference from old page by unfreezing