diff mbox series

[RFC,v2,6/9] mm/khugepaged: remove reuse_swap_page() usage

Message ID 20220126095557.32392-7-david@redhat.com (mailing list archive)
State New
Headers show
Series mm: COW fixes part 1: fix the COW security issue for THP and swap | expand

Commit Message

David Hildenbrand Jan. 26, 2022, 9:55 a.m. UTC
reuse_swap_page() currently indicates if we can write to an anon page
without COW. A COW is required if the page is shared by multiple
processes (either already mapped or via swap entries) or if there is
concurrent writeback that cannot tolerate concurrent page modifications.

reuse_swap_page() doesn't check for pending references from other
processes that already unmapped the page, however,
is_refcount_suitable() essentially does the same thing in the context of
khugepaged. khugepaged is the last remaining user of reuse_swap_page() and
we want to remove that function.

In the context of khugepaged, we are not actually going to write to the
page and we don't really care about other processes mapping the page:
for example, without swap, we don't care about shared pages at all.

The current logic seems to be:
* Writable: -> Not shared, but might be in the swapcache. Nobody can
  fault it in from the swapcache as there are no other swap entries.
* Readable and not in the swapcache: Might be shared (but nobody can
  fault it in from the swapcache).
* Readable and in the swapcache: Might be shared and someone might be
  able to fault it in from the swapcache. Make sure we're the exclusive
  owner via reuse_swap_page().

Having to guess due to lack of comments and documentation, the current
logic really only wants to make sure that a page that might be shared
cannot be faulted in from the swapcache while khugepaged is active.
It's hard to guess why that is that case and if it's really still required,
but let's try keeping that logic unmodified.

Instead of relying on reuse_swap_page(), let's unconditionally
try_to_free_swap(), special casing PageKsm(). try_to_free_swap() will fail
if there are still swap entries targeting the page or if the page is under
writeback.

After a successful try_to_free_swap() that page cannot be readded to the
swapcache because we're keeping the page locked and removed from the LRU
until we actually perform the copy. So once we succeeded removing a page
from the swapcache, it cannot be re-added until we're done copying. Add a
comment stating that.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/khugepaged.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Comments

Yang Shi Jan. 27, 2022, 9:23 p.m. UTC | #1
On Wed, Jan 26, 2022 at 2:00 AM David Hildenbrand <david@redhat.com> wrote:
>
> reuse_swap_page() currently indicates if we can write to an anon page
> without COW. A COW is required if the page is shared by multiple
> processes (either already mapped or via swap entries) or if there is
> concurrent writeback that cannot tolerate concurrent page modifications.
>
> reuse_swap_page() doesn't check for pending references from other
> processes that already unmapped the page, however,
> is_refcount_suitable() essentially does the same thing in the context of
> khugepaged. khugepaged is the last remaining user of reuse_swap_page() and
> we want to remove that function.
>
> In the context of khugepaged, we are not actually going to write to the
> page and we don't really care about other processes mapping the page:
> for example, without swap, we don't care about shared pages at all.
>
> The current logic seems to be:
> * Writable: -> Not shared, but might be in the swapcache. Nobody can
>   fault it in from the swapcache as there are no other swap entries.
> * Readable and not in the swapcache: Might be shared (but nobody can
>   fault it in from the swapcache).
> * Readable and in the swapcache: Might be shared and someone might be
>   able to fault it in from the swapcache. Make sure we're the exclusive
>   owner via reuse_swap_page().
>
> Having to guess due to lack of comments and documentation, the current
> logic really only wants to make sure that a page that might be shared
> cannot be faulted in from the swapcache while khugepaged is active.
> It's hard to guess why that is that case and if it's really still required,
> but let's try keeping that logic unmodified.

I don't think it could be faulted in while khugepaged is active since
khugepaged does hold mmap_lock in write mode IIUC. So page fault is
serialized against khugepaged.

My wild guess is that collapsing shared pages was not supported before
v5.8, so we need reuse_swap_page() to tell us if the page in swap
cache is shared or not. But it is not true anymore. And khugepaged
just allocates a THP then copy the data from base pages to huge page
then replace PTEs to PMD, it doesn't change the content of the page,
so I failed to see a problem by collapsing a shared page in swap
cache. But I'm really not entirely sure, I may miss something...



>
> Instead of relying on reuse_swap_page(), let's unconditionally
> try_to_free_swap(), special casing PageKsm(). try_to_free_swap() will fail
> if there are still swap entries targeting the page or if the page is under
> writeback.
>
> After a successful try_to_free_swap() that page cannot be readded to the
> swapcache because we're keeping the page locked and removed from the LRU
> until we actually perform the copy. So once we succeeded removing a page
> from the swapcache, it cannot be re-added until we're done copying. Add a
> comment stating that.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/khugepaged.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 35f14d0a00a6..bc0ff598e98f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -683,10 +683,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                         goto out;
>                 }
>                 if (!pte_write(pteval) && PageSwapCache(page) &&
> -                               !reuse_swap_page(page)) {
> +                   (PageKsm(page) || !try_to_free_swap(page))) {
>                         /*
> -                        * Page is in the swap cache and cannot be re-used.
> -                        * It cannot be collapsed into a THP.
> +                        * Possibly shared page cannot be removed from the
> +                        * swapache. It cannot be collapsed into a THP.
>                          */
>                         unlock_page(page);
>                         result = SCAN_SWAP_CACHE_PAGE;
> @@ -702,6 +702,16 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                         result = SCAN_DEL_PAGE_LRU;
>                         goto out;
>                 }
> +
> +               /*
> +                * We're holding the page lock and removed the page from the
> +                * LRU. Once done copying, we'll unlock and readd to the
> +                * LRU via release_pte_page(). If the page is still in the
> +                * swapcache, we're the exclusive owner. Due to the page lock
> +                * the page cannot be added to the swapcache until we're done
> +                * and consequently it cannot be faulted in from the swapcache
> +                * into another process.
> +                */
>                 mod_node_page_state(page_pgdat(page),
>                                 NR_ISOLATED_ANON + page_is_file_lru(page),
>                                 compound_nr(page));
> --
> 2.34.1
>
David Hildenbrand Jan. 28, 2022, 8:41 a.m. UTC | #2
On 27.01.22 22:23, Yang Shi wrote:
> On Wed, Jan 26, 2022 at 2:00 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> reuse_swap_page() currently indicates if we can write to an anon page
>> without COW. A COW is required if the page is shared by multiple
>> processes (either already mapped or via swap entries) or if there is
>> concurrent writeback that cannot tolerate concurrent page modifications.
>>
>> reuse_swap_page() doesn't check for pending references from other
>> processes that already unmapped the page, however,
>> is_refcount_suitable() essentially does the same thing in the context of
>> khugepaged. khugepaged is the last remaining user of reuse_swap_page() and
>> we want to remove that function.
>>
>> In the context of khugepaged, we are not actually going to write to the
>> page and we don't really care about other processes mapping the page:
>> for example, without swap, we don't care about shared pages at all.
>>
>> The current logic seems to be:
>> * Writable: -> Not shared, but might be in the swapcache. Nobody can
>>   fault it in from the swapcache as there are no other swap entries.
>> * Readable and not in the swapcache: Might be shared (but nobody can
>>   fault it in from the swapcache).
>> * Readable and in the swapcache: Might be shared and someone might be
>>   able to fault it in from the swapcache. Make sure we're the exclusive
>>   owner via reuse_swap_page().
>>
>> Having to guess due to lack of comments and documentation, the current
>> logic really only wants to make sure that a page that might be shared
>> cannot be faulted in from the swapcache while khugepaged is active.
>> It's hard to guess why that is that case and if it's really still required,
>> but let's try keeping that logic unmodified.
> 
> I don't think it could be faulted in while khugepaged is active since
> khugepaged does hold mmap_lock in write mode IIUC. So page fault is
> serialized against khugepaged.

It could get faulted in by another process sharing the page, because we 
only synchronize against the current process.

> 
> My wild guess is that collapsing shared pages was not supported before
> v5.8, so we need reuse_swap_page() to tell us if the page in swap
> cache is shared or not. But it is not true anymore. And khugepaged
> just allocates a THP then copy the data from base pages to huge page
> then replace PTEs to PMD, it doesn't change the content of the page,
> so I failed to see a problem by collapsing a shared page in swap
> cache. But I'm really not entirely sure, I may miss something...


Looking more closely where this logic originates from, it was introduced in:

commit 10359213d05acf804558bda7cc9b8422a828d1cd
Author: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Date:   Wed Feb 11 15:28:28 2015 -0800

    mm: incorporate read-only pages into transparent huge pages
    
    This patch aims to improve THP collapse rates, by allowing THP collapse in
    the presence of read-only ptes, like those left in place by do_swap_page
    after a read fault.
    
    Currently THP can collapse 4kB pages into a THP when there are up to
    khugepaged_max_ptes_none pte_none ptes in a 2MB range.  This patch applies
    the same limit for read-only ptes.


The change essentially results in a read-only mapped PTE page getting copied and
mapped writable via a new PMD-mapped THP.

It mentions do_swap_page(), so I assume it just tried to do what do_swap_page()
would do when trying to map a page swapped in from the page cache writable
immediately.

But we differ from do_swap_page() that we're not actually going to map the page
writable, we're going to copy the page (__collapse_huge_page_copy()) and map
the copy writable.

I assume we can remove that logic.
Yang Shi Jan. 28, 2022, 5:10 p.m. UTC | #3
On Fri, Jan 28, 2022 at 12:41 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 27.01.22 22:23, Yang Shi wrote:
> > On Wed, Jan 26, 2022 at 2:00 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> reuse_swap_page() currently indicates if we can write to an anon page
> >> without COW. A COW is required if the page is shared by multiple
> >> processes (either already mapped or via swap entries) or if there is
> >> concurrent writeback that cannot tolerate concurrent page modifications.
> >>
> >> reuse_swap_page() doesn't check for pending references from other
> >> processes that already unmapped the page, however,
> >> is_refcount_suitable() essentially does the same thing in the context of
> >> khugepaged. khugepaged is the last remaining user of reuse_swap_page() and
> >> we want to remove that function.
> >>
> >> In the context of khugepaged, we are not actually going to write to the
> >> page and we don't really care about other processes mapping the page:
> >> for example, without swap, we don't care about shared pages at all.
> >>
> >> The current logic seems to be:
> >> * Writable: -> Not shared, but might be in the swapcache. Nobody can
> >>   fault it in from the swapcache as there are no other swap entries.
> >> * Readable and not in the swapcache: Might be shared (but nobody can
> >>   fault it in from the swapcache).
> >> * Readable and in the swapcache: Might be shared and someone might be
> >>   able to fault it in from the swapcache. Make sure we're the exclusive
> >>   owner via reuse_swap_page().
> >>
> >> Having to guess due to lack of comments and documentation, the current
> >> logic really only wants to make sure that a page that might be shared
> >> cannot be faulted in from the swapcache while khugepaged is active.
> >> It's hard to guess why that is that case and if it's really still required,
> >> but let's try keeping that logic unmodified.
> >
> > I don't think it could be faulted in while khugepaged is active since
> > khugepaged does hold mmap_lock in write mode IIUC. So page fault is
> > serialized against khugepaged.
>
> It could get faulted in by another process sharing the page, because we
> only synchronize against the current process.

Yes, sure. I'm supposed you meant do_swap_page() called by another
process. But it is serialized by page lock. So khugepaged won't see
something in limbo state IIUC.

>
> >
> > My wild guess is that collapsing shared pages was not supported before
> > v5.8, so we need reuse_swap_page() to tell us if the page in swap
> > cache is shared or not. But it is not true anymore. And khugepaged
> > just allocates a THP then copy the data from base pages to huge page
> > then replace PTEs to PMD, it doesn't change the content of the page,
> > so I failed to see a problem by collapsing a shared page in swap
> > cache. But I'm really not entirely sure, I may miss something...
>
>
> Looking more closely where this logic originates from, it was introduced in:
>
> commit 10359213d05acf804558bda7cc9b8422a828d1cd
> Author: Ebru Akagunduz <ebru.akagunduz@gmail.com>
> Date:   Wed Feb 11 15:28:28 2015 -0800
>
>     mm: incorporate read-only pages into transparent huge pages
>
>     This patch aims to improve THP collapse rates, by allowing THP collapse in
>     the presence of read-only ptes, like those left in place by do_swap_page
>     after a read fault.
>
>     Currently THP can collapse 4kB pages into a THP when there are up to
>     khugepaged_max_ptes_none pte_none ptes in a 2MB range.  This patch applies
>     the same limit for read-only ptes.
>
>
> The change essentially results in a read-only mapped PTE page getting copied and
> mapped writable via a new PMD-mapped THP.
>
> It mentions do_swap_page(), so I assume it just tried to do what do_swap_page()
> would do when trying to map a page swapped in from the page cache writable
> immediately.
>
> But we differ from do_swap_page() that we're not actually going to map the page
> writable, we're going to copy the page (__collapse_huge_page_copy()) and map
> the copy writable.

Yeah, this is the point. Khugepaged or the process being collapsed
won't write to the original page. Just unshare it.

>
> I assume we can remove that logic.
>
> --
> Thanks,
>
> David / dhildenb
>
diff mbox series

Patch

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 35f14d0a00a6..bc0ff598e98f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -683,10 +683,10 @@  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			goto out;
 		}
 		if (!pte_write(pteval) && PageSwapCache(page) &&
-				!reuse_swap_page(page)) {
+		    (PageKsm(page) || !try_to_free_swap(page))) {
 			/*
-			 * Page is in the swap cache and cannot be re-used.
-			 * It cannot be collapsed into a THP.
+			 * Possibly shared page cannot be removed from the
+			 * swapache. It cannot be collapsed into a THP.
 			 */
 			unlock_page(page);
 			result = SCAN_SWAP_CACHE_PAGE;
@@ -702,6 +702,16 @@  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			result = SCAN_DEL_PAGE_LRU;
 			goto out;
 		}
+
+		/*
+		 * We're holding the page lock and removed the page from the
+		 * LRU. Once done copying, we'll unlock and readd to the
+		 * LRU via release_pte_page(). If the page is still in the
+		 * swapcache, we're the exclusive owner. Due to the page lock
+		 * the page cannot be added to the swapcache until we're done
+		 * and consequently it cannot be faulted in from the swapcache
+		 * into another process.
+		 */
 		mod_node_page_state(page_pgdat(page),
 				NR_ISOLATED_ANON + page_is_file_lru(page),
 				compound_nr(page));