diff mbox series

mm/huge_memory: don't clear active swapcache entry from page->private

Message ID 20220906190602.1626037-1-bfoster@redhat.com (mailing list archive)
State New
Headers show
Series mm/huge_memory: don't clear active swapcache entry from page->private | expand

Commit Message

Brian Foster Sept. 6, 2022, 7:06 p.m. UTC
If a swap cache resident hugepage is passed into
__split_huge_page(), the tail pages are incrementally split off and
each offset in the swap cache covered by the hugepage is updated to
point to the associated subpage instead of the original head page.
As a final step, each subpage is individually passed to
free_page_and_swap_cache() to free the associated swap cache entry
and release the page. This eventually lands in
delete_from_swap_cache(), which refers to page->private for the
swp_entry_t, which in turn encodes the swap address space and page
offset information.

The problem here is that the earlier call to
__split_huge_page_tail() clears page->private of each tail page in
the hugepage. This means that the swap entry passed to
__delete_from_swap_cache() is zeroed, resulting in a bogus address
space and offset tuple for the swapcache update. If DEBUG_VM is
enabled, this results in a BUG() in the latter function upon
detection of the old value in the swap address space not matching
the page being removed.

The ramifications are less clear if DEBUG_VM is not enabled. In the
particular stress-ng workload that reproduces this problem, this
reliably occurs via MADV_PAGEOUT, which eventually triggers swap
cache reclaim before the madvise() call returns. The swap cache
reclaim sequence attempts to reuse the entry that should have been
freed by the delete operation, but since that failed to correctly
update the swap address space, swap cache reclaim attempts to look
up the already freed page still stored at said offset and falls into
a tight loop in find_get_page() -> __filemap_get_folio() due to
repetitive folio_try_get_rcu() (reference count update) failures.
This leads to a soft lockup BUG and never seems to recover.

To avoid this problem, update __split_huge_page_tail() to not clear
page->private when the associated page has the swap cache flag set.
Note that this flag is transferred to the tail page by the preceding
->flags update.

Fixes: b653db77350c7 ("mm: Clear page->private when splitting or migrating a page")
Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Original bug report is here [1]. I figure there's probably at least a
couple different ways to fix this problem, but I started with what
seemed most straightforward. Thoughts appreciated..

Brian

[1] https://lore.kernel.org/linux-mm/YxDyZLfBdFHK1Y1P@bfoster/

 mm/huge_memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Brian Foster Oct. 17, 2022, 4:14 p.m. UTC | #1
On Tue, Sep 06, 2022 at 03:06:02PM -0400, Brian Foster wrote:
> If a swap cache resident hugepage is passed into
> __split_huge_page(), the tail pages are incrementally split off and
> each offset in the swap cache covered by the hugepage is updated to
> point to the associated subpage instead of the original head page.
> As a final step, each subpage is individually passed to
> free_page_and_swap_cache() to free the associated swap cache entry
> and release the page. This eventually lands in
> delete_from_swap_cache(), which refers to page->private for the
> swp_entry_t, which in turn encodes the swap address space and page
> offset information.
> 
> The problem here is that the earlier call to
> __split_huge_page_tail() clears page->private of each tail page in
> the hugepage. This means that the swap entry passed to
> __delete_from_swap_cache() is zeroed, resulting in a bogus address
> space and offset tuple for the swapcache update. If DEBUG_VM is
> enabled, this results in a BUG() in the latter function upon
> detection of the old value in the swap address space not matching
> the page being removed.
> 
> The ramifications are less clear if DEBUG_VM is not enabled. In the
> particular stress-ng workload that reproduces this problem, this
> reliably occurs via MADV_PAGEOUT, which eventually triggers swap
> cache reclaim before the madvise() call returns. The swap cache
> reclaim sequence attempts to reuse the entry that should have been
> freed by the delete operation, but since that failed to correctly
> update the swap address space, swap cache reclaim attempts to look
> up the already freed page still stored at said offset and falls into
> a tight loop in find_get_page() -> __filemap_get_folio() due to
> repetitive folio_try_get_rcu() (reference count update) failures.
> This leads to a soft lockup BUG and never seems to recover.
> 
> To avoid this problem, update __split_huge_page_tail() to not clear
> page->private when the associated page has the swap cache flag set.
> Note that this flag is transferred to the tail page by the preceding
> ->flags update.
> 
> Fixes: b653db77350c7 ("mm: Clear page->private when splitting or migrating a page")
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> Original bug report is here [1]. I figure there's probably at least a
> couple different ways to fix this problem, but I started with what
> seemed most straightforward. Thoughts appreciated..
> 

Ping? I can still reproduce this on latest kernels as of last week or
so..

Brian

> Brian
> 
> [1] https://lore.kernel.org/linux-mm/YxDyZLfBdFHK1Y1P@bfoster/
> 
>  mm/huge_memory.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e9414ee57c5b..c2ddbb81a743 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2445,7 +2445,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  			page_tail);
>  	page_tail->mapping = head->mapping;
>  	page_tail->index = head->index + tail;
> -	page_tail->private = 0;
> +	if (!PageSwapCache(page_tail))
> +		page_tail->private = 0;
>  
>  	/* Page flags must be visible before we make the page non-compound. */
>  	smp_wmb();
> -- 
> 2.37.1
> 
>
Kirill A. Shutemov Oct. 18, 2022, 1:39 p.m. UTC | #2
On Tue, Sep 06, 2022 at 03:06:02PM -0400, Brian Foster wrote:
> If a swap cache resident hugepage is passed into
> __split_huge_page(), the tail pages are incrementally split off and
> each offset in the swap cache covered by the hugepage is updated to
> point to the associated subpage instead of the original head page.
> As a final step, each subpage is individually passed to
> free_page_and_swap_cache() to free the associated swap cache entry
> and release the page. This eventually lands in
> delete_from_swap_cache(), which refers to page->private for the
> swp_entry_t, which in turn encodes the swap address space and page
> offset information.
> 
> The problem here is that the earlier call to
> __split_huge_page_tail() clears page->private of each tail page in
> the hugepage. This means that the swap entry passed to
> __delete_from_swap_cache() is zeroed, resulting in a bogus address
> space and offset tuple for the swapcache update. If DEBUG_VM is
> enabled, this results in a BUG() in the latter function upon
> detection of the old value in the swap address space not matching
> the page being removed.
> 
> The ramifications are less clear if DEBUG_VM is not enabled. In the
> particular stress-ng workload that reproduces this problem, this
> reliably occurs via MADV_PAGEOUT, which eventually triggers swap
> cache reclaim before the madvise() call returns. The swap cache
> reclaim sequence attempts to reuse the entry that should have been
> freed by the delete operation, but since that failed to correctly
> update the swap address space, swap cache reclaim attempts to look
> up the already freed page still stored at said offset and falls into
> a tight loop in find_get_page() -> __filemap_get_folio() due to
> repetitive folio_try_get_rcu() (reference count update) failures.
> This leads to a soft lockup BUG and never seems to recover.
> 
> To avoid this problem, update __split_huge_page_tail() to not clear
> page->private when the associated page has the swap cache flag set.
> Note that this flag is transferred to the tail page by the preceding
> ->flags update.
> 
> Fixes: b653db77350c7 ("mm: Clear page->private when splitting or migrating a page")
> Signed-off-by: Brian Foster <bfoster@redhat.com>

stable@ ?

> ---
> 
> Original bug report is here [1]. I figure there's probably at least a
> couple different ways to fix this problem, but I started with what
> seemed most straightforward. Thoughts appreciated..
> 
> Brian
> 
> [1] https://lore.kernel.org/linux-mm/YxDyZLfBdFHK1Y1P@bfoster/
> 
>  mm/huge_memory.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e9414ee57c5b..c2ddbb81a743 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2445,7 +2445,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  			page_tail);
>  	page_tail->mapping = head->mapping;
>  	page_tail->index = head->index + tail;
> -	page_tail->private = 0;
> +	if (!PageSwapCache(page_tail))
> +		page_tail->private = 0;

The patch looks good to me, but this check deserves a comment.

Otherwise:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

>  
>  	/* Page flags must be visible before we make the page non-compound. */
>  	smp_wmb();
> -- 
> 2.37.1
> 
>
Brian Foster Oct. 18, 2022, 5:41 p.m. UTC | #3
On Tue, Oct 18, 2022 at 04:39:23PM +0300, Kirill A. Shutemov wrote:
> On Tue, Sep 06, 2022 at 03:06:02PM -0400, Brian Foster wrote:
> > If a swap cache resident hugepage is passed into
> > __split_huge_page(), the tail pages are incrementally split off and
> > each offset in the swap cache covered by the hugepage is updated to
> > point to the associated subpage instead of the original head page.
> > As a final step, each subpage is individually passed to
> > free_page_and_swap_cache() to free the associated swap cache entry
> > and release the page. This eventually lands in
> > delete_from_swap_cache(), which refers to page->private for the
> > swp_entry_t, which in turn encodes the swap address space and page
> > offset information.
> > 
> > The problem here is that the earlier call to
> > __split_huge_page_tail() clears page->private of each tail page in
> > the hugepage. This means that the swap entry passed to
> > __delete_from_swap_cache() is zeroed, resulting in a bogus address
> > space and offset tuple for the swapcache update. If DEBUG_VM is
> > enabled, this results in a BUG() in the latter function upon
> > detection of the old value in the swap address space not matching
> > the page being removed.
> > 
> > The ramifications are less clear if DEBUG_VM is not enabled. In the
> > particular stress-ng workload that reproduces this problem, this
> > reliably occurs via MADV_PAGEOUT, which eventually triggers swap
> > cache reclaim before the madvise() call returns. The swap cache
> > reclaim sequence attempts to reuse the entry that should have been
> > freed by the delete operation, but since that failed to correctly
> > update the swap address space, swap cache reclaim attempts to look
> > up the already freed page still stored at said offset and falls into
> > a tight loop in find_get_page() -> __filemap_get_folio() due to
> > repetitive folio_try_get_rcu() (reference count update) failures.
> > This leads to a soft lockup BUG and never seems to recover.
> > 
> > To avoid this problem, update __split_huge_page_tail() to not clear
> > page->private when the associated page has the swap cache flag set.
> > Note that this flag is transferred to the tail page by the preceding
> > ->flags update.
> > 
> > Fixes: b653db77350c7 ("mm: Clear page->private when splitting or migrating a page")
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> stable@ ?
> 

Ok.

> > ---
> > 
> > Original bug report is here [1]. I figure there's probably at least a
> > couple different ways to fix this problem, but I started with what
> > seemed most straightforward. Thoughts appreciated..
> > 
> > Brian
> > 
> > [1] https://lore.kernel.org/linux-mm/YxDyZLfBdFHK1Y1P@bfoster/
> > 
> >  mm/huge_memory.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e9414ee57c5b..c2ddbb81a743 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2445,7 +2445,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
> >  			page_tail);
> >  	page_tail->mapping = head->mapping;
> >  	page_tail->index = head->index + tail;
> > -	page_tail->private = 0;
> > +	if (!PageSwapCache(page_tail))
> > +		page_tail->private = 0;
> 
> The patch looks good to me, but this check deserves a comment.
> 

Sure... not sure how descriptive a comment you're looking for. Something
like the following perhaps?

"If the hugepage is in swapcache, page_tail->private tracks the
swap_entry_t of the tail page. We can't clear it until the tail page is
removed from swapcache."

I'll wait a bit for any further comment on Andrew's question [1] in the
thread for the issue reported by Oleksandr (which so far also appears to
be resolved by this patch). Barring further feedback, I'll plan a v2
that includes something like the above.

> Otherwise:
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> 

Thanks!

Brian

[1] https://lore.kernel.org/linux-mm/20221017152423.37a126325b4330e71cf8f869@linux-foundation.org/

> >  
> >  	/* Page flags must be visible before we make the page non-compound. */
> >  	smp_wmb();
> > -- 
> > 2.37.1
> > 
> > 
> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov
>
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e9414ee57c5b..c2ddbb81a743 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2445,7 +2445,8 @@  static void __split_huge_page_tail(struct page *head, int tail,
 			page_tail);
 	page_tail->mapping = head->mapping;
 	page_tail->index = head->index + tail;
-	page_tail->private = 0;
+	if (!PageSwapCache(page_tail))
+		page_tail->private = 0;
 
 	/* Page flags must be visible before we make the page non-compound. */
 	smp_wmb();