diff mbox series

[04/15] shmem: shmem_writepage() split unlikely i915 THP

Message ID 20200919042009.bomzxmrg7%akpm@linux-foundation.org (mailing list archive)
State New, archived
Headers show
Series [01/15] mailmap: add older email addresses for Kees Cook | expand

Commit Message

Andrew Morton Sept. 19, 2020, 4:20 a.m. UTC
From: Hugh Dickins <hughd@google.com>
Subject: shmem: shmem_writepage() split unlikely i915 THP

drivers/gpu/drm/i915/gem/i915_gem_shmem.c contains a shmem_writeback()
which calls shmem_writepage() from a shrinker: that usually works well
enough; but if /sys/kernel/mm/transparent_hugepage/shmem_enabled has been
set to "force" (documented as "Force the huge option on for all - very
useful for testing"), shmem_writepage() is surprised to be called with a
huge page, and crashes on the VM_BUG_ON_PAGE(PageCompound) (I did not find
out where the crash happens when CONFIG_DEBUG_VM is off).

LRU page reclaim always splits the shmem huge page first: I'd prefer not
to demand that of i915, so check and split compound in shmem_writepage().

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301401390.5954@eggly.anvils
Fixes: 2d6692e642e7 ("drm/i915: Start writeback from the shrinker")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>	[5.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/shmem.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Comments

Matthew Wilcox (Oracle) Sept. 19, 2020, 4:44 a.m. UTC | #1
On Fri, Sep 18, 2020 at 09:20:09PM -0700, Andrew Morton wrote:
> LRU page reclaim always splits the shmem huge page first: I'd prefer not
> to demand that of i915, so check and split compound in shmem_writepage().

Sorry for not checking this earlier, but I don't think this is right.

        for (i = 0; i < obj->base.size >> PAGE_SHIFT; i++) {
...
                if (!page_mapped(page) && clear_page_dirty_for_io(page)) {
...
                        ret = mapping->a_ops->writepage(page, &wbc);

so we cleared the dirty bit on the entire hugepage, but then split
it after clearing the dirty bit, so the subpages are now not dirty.
I think we'll lose writes as a result?  At least we won't swap pages
out that deserve to be paged out.

>  
> -	VM_BUG_ON_PAGE(PageCompound(page), page);
> +	/*
> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> +	 * and its shmem_writeback() needs them to be split when swapping.
> +	 */
> +	if (PageTransCompound(page))
> +		if (split_huge_page(page) < 0)
> +			goto redirty;
> +
>  	BUG_ON(!PageLocked(page));
>  	mapping = page->mapping;
>  	index = page->index;
> _
>
Hugh Dickins Sept. 19, 2020, 5:44 a.m. UTC | #2
On Sat, 19 Sep 2020, Matthew Wilcox wrote:
> On Fri, Sep 18, 2020 at 09:20:09PM -0700, Andrew Morton wrote:
> > LRU page reclaim always splits the shmem huge page first: I'd prefer not
> > to demand that of i915, so check and split compound in shmem_writepage().
> 
> Sorry for not checking this earlier, but I don't think this is right.
> 
>         for (i = 0; i < obj->base.size >> PAGE_SHIFT; i++) {
> ...
>                 if (!page_mapped(page) && clear_page_dirty_for_io(page)) {
> ...
>                         ret = mapping->a_ops->writepage(page, &wbc);
> 
> so we cleared the dirty bit on the entire hugepage, but then split
> it after clearing the dirty bit, so the subpages are now not dirty.
> I think we'll lose writes as a result?  At least we won't swap pages
> out that deserve to be paged out.

Very good observation, thank you.

It behaves a lot better with this patch in than without it; but you're
right, only the head will get written to swap, and the tails left in
memory; with dirty cleared, so they may be left indefinitely (I've
not yet looked to see when if ever PageDirty might get set later).

Hmm. It may just be a matter of restyling the i915 code with

		if (!page_mapped(page)) {
			clear_page_dirty_for_io(page);

but I don't want to rush to that conclusion - there might turn
out to be a good way of doing it at the shmem_writepage() end, but
probably only hacks available.  I'll mull it over: it deserves some
thought about what would suit, if a THP arrived here some other way.

We did have some i915 guys Cc'ed on the original posting, but no
need to Cc them again until I'm closer to deciding what's right.

Linus, Andrew, probably best to drop this patch for now, since
no-one else has reported the problem here than me, testing with 
/sys/kernel/mm/transparent_hugepage/shmem_enabled set to "force";
and what it fixes is not a new regression in 5.9.

Though no harm done if the patch goes in: it is an improvement,
but seriously incomplete, as Matthew has observed.

Hugh

> 
> >  
> > -	VM_BUG_ON_PAGE(PageCompound(page), page);
> > +	/*
> > +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> > +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> > +	 * and its shmem_writeback() needs them to be split when swapping.
> > +	 */
> > +	if (PageTransCompound(page))
> > +		if (split_huge_page(page) < 0)
> > +			goto redirty;
> > +
> >  	BUG_ON(!PageLocked(page));
> >  	mapping = page->mapping;
> >  	index = page->index;
> > _
> >
Matthew Wilcox (Oracle) Sept. 19, 2020, 4:18 p.m. UTC | #3
On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
> It behaves a lot better with this patch in than without it; but you're
> right, only the head will get written to swap, and the tails left in
> memory; with dirty cleared, so they may be left indefinitely (I've
> not yet looked to see when if ever PageDirty might get set later).
> 
> Hmm. It may just be a matter of restyling the i915 code with
> 
> 		if (!page_mapped(page)) {
> 			clear_page_dirty_for_io(page);
> 
> but I don't want to rush to that conclusion - there might turn
> out to be a good way of doing it at the shmem_writepage() end, but
> probably only hacks available.  I'll mull it over: it deserves some
> thought about what would suit, if a THP arrived here some other way.

I think the ultimate solution is to do as I have done for iomap and make
->writepage handle arbitrary sized pages.  However, I don't know the
swap I/O path particularly well, and I would rather not learn it just yet.

How about this for a band-aid until we sort that out properly?  Just mark
the page as dirty before splitting it so subsequent iterations see the
subpages as dirty.  Arguably, we should use set_page_dirty() instead of
SetPageDirty, but I don't think i915 cares.  In particular, it uses
an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.

diff --git a/mm/shmem.c b/mm/shmem.c
index 271548ca20f3..6231207ab1eb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	swp_entry_t swap;
 	pgoff_t index;
 
-	VM_BUG_ON_PAGE(PageCompound(page), page);
 	BUG_ON(!PageLocked(page));
+
+	/*
+	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
+	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
+	 * and its shmem_writeback() needs them to be split when swapping.
+	 */
+	if (PageTransCompound(page)) {
+		/* Ensure the subpages are still dirty */
+		SetPageDirty(page);
+		if (split_huge_page(page) < 0)
+			goto redirty;
+		ClearPageDirty(page);
+	}
+
 	mapping = page->mapping;
 	index = page->index;
 	inode = mapping->host;
Hugh Dickins Sept. 20, 2020, 12:16 a.m. UTC | #4
On Sat, 19 Sep 2020, Matthew Wilcox wrote:
> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
> > It behaves a lot better with this patch in than without it; but you're
> > right, only the head will get written to swap, and the tails left in
> > memory; with dirty cleared, so they may be left indefinitely (I've
> > not yet looked to see when if ever PageDirty might get set later).
> > 
> > Hmm. It may just be a matter of restyling the i915 code with
> > 
> > 		if (!page_mapped(page)) {
> > 			clear_page_dirty_for_io(page);
> > 
> > but I don't want to rush to that conclusion - there might turn
> > out to be a good way of doing it at the shmem_writepage() end, but
> > probably only hacks available.  I'll mull it over: it deserves some
> > thought about what would suit, if a THP arrived here some other way.
> 
> I think the ultimate solution is to do as I have done for iomap and make
> ->writepage handle arbitrary sized pages.  However, I don't know the
> swap I/O path particularly well, and I would rather not learn it just yet.

Understood ;)

> 
> How about this for a band-aid until we sort that out properly?  Just mark
> the page as dirty before splitting it so subsequent iterations see the
> subpages as dirty.

This is certainly much better than my original, and I've had no problem
in running with it (briefly); and it's what I'll now keep in my testing
tree - thanks.  But it is more obviously a hack, or as you put it, a
band-aid: fit for Linus's tree?

This issue has only come up with i915 and shmem_enabled "force", nobody
has been bleeding except me: but if it's something that impedes or may
impede your own testing, or that you feel should go in anyway, say so and
I'll push your new improved version to akpm (adding your signoff to mine).

> Arguably, we should use set_page_dirty() instead of SetPageDirty,

My position on that has changed down the years: I went through a
phase of replacing shmem.c's SetPageDirtys by set_page_dirtys, with
the idea that its "if (!PageDirty)" is good to avoid setting cacheline
dirty.  Then Spectre changed the balance, so now I'd rather avoid the
indirect function call, and go with your SetPageDirty.  But the bdi
flags are such that they do the same thing, and if they ever diverge,
then we make the necessary changes at that time.

> but I don't think i915 cares.  In particular, it uses
> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.

PAGECACHE_TAG_DIRTY comes with bdi flags that shmem does not use:
with one exception, every shmem page is always dirty (since its swap
is freed as soon as the page is moved from swap cache to file cache);
unless I'm forgetting something (like the tails in my patch!), the
only exception is a page allocated to a hole on fault (after which
a write fault will soon set pte dirty).

So I didn't quite get what i915 was doing with its set_page_dirty()s:
but very probably being a good citizen, marking when it exposed the
page to changes itself, making no assumption of what shmem.c does.

It would be good to have a conversation with i915 guys about huge pages
sometime (they forked off their own mount point in i915_gemfs_init(),
precisely in order to make use of huge pages, but couldn't get them
working, so removed the option to ask for them, but kept the separate
mount point.  But not a conversation I'll be responsive on at present.)

Hugh

> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 271548ca20f3..6231207ab1eb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  	swp_entry_t swap;
>  	pgoff_t index;
>  
> -	VM_BUG_ON_PAGE(PageCompound(page), page);
>  	BUG_ON(!PageLocked(page));
> +
> +	/*
> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> +	 * and its shmem_writeback() needs them to be split when swapping.
> +	 */
> +	if (PageTransCompound(page)) {
> +		/* Ensure the subpages are still dirty */
> +		SetPageDirty(page);
> +		if (split_huge_page(page) < 0)
> +			goto redirty;
> +		ClearPageDirty(page);
> +	}
> +
>  	mapping = page->mapping;
>  	index = page->index;
>  	inode = mapping->host;
Matthew Wilcox (Oracle) Sept. 20, 2020, 3:32 a.m. UTC | #5
On Sat, Sep 19, 2020 at 05:16:57PM -0700, Hugh Dickins wrote:
> On Sat, 19 Sep 2020, Matthew Wilcox wrote:
> > How about this for a band-aid until we sort that out properly?  Just mark
> > the page as dirty before splitting it so subsequent iterations see the
> > subpages as dirty.
> 
> This is certainly much better than my original, and I've had no problem
> in running with it (briefly); and it's what I'll now keep in my testing
> tree - thanks.  But it is more obviously a hack, or as you put it, a
> band-aid: fit for Linus's tree?
> 
> This issue has only come up with i915 and shmem_enabled "force", nobody
> has been bleeding except me: but if it's something that impedes or may
> impede your own testing, or that you feel should go in anyway, say so and
> I'll push your new improved version to akpm (adding your signoff to mine).

No, it's not impeding my testing; I don't have i915 even compiled into
my kernel.

> > Arguably, we should use set_page_dirty() instead of SetPageDirty,
> 
> My position on that has changed down the years: I went through a
> phase of replacing shmem.c's SetPageDirtys by set_page_dirtys, with
> the idea that its "if (!PageDirty)" is good to avoid setting cacheline
> dirty.  Then Spectre changed the balance, so now I'd rather avoid the
> indirect function call, and go with your SetPageDirty.  But the bdi
> flags are such that they do the same thing, and if they ever diverge,
> then we make the necessary changes at that time.

That makes sense.

> > but I don't think i915 cares.  In particular, it uses
> > an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
> 
> PAGECACHE_TAG_DIRTY comes with bdi flags that shmem does not use:
> with one exception, every shmem page is always dirty (since its swap
> is freed as soon as the page is moved from swap cache to file cache);
> unless I'm forgetting something (like the tails in my patch!), the
> only exception is a page allocated to a hole on fault (after which
> a write fault will soon set pte dirty).

Ah, of course.  I keep forgetting that shmem doesn't use the
dirty/writeback bits.

> So I didn't quite get what i915 was doing with its set_page_dirty()s:
> but very probably being a good citizen, marking when it exposed the
> page to changes itself, making no assumption of what shmem.c does.
> 
> It would be good to have a conversation with i915 guys about huge pages
> sometime (they forked off their own mount point in i915_gemfs_init(),
> precisely in order to make use of huge pages, but couldn't get them
> working, so removed the option to ask for them, but kept the separate
> mount point.  But not a conversation I'll be responsive on at present.)

Yes, I have a strong feeling the i915 shmem code is cargo-culted.  There
are some very questionable constructs in there.  It's a little unfair to
expect graphics developers to suddenly become experts on the page cache,
so I've taken this as impetus to clean up our APIs a little (eg moving
find_lock_entry() to mm/internal.h)
Matthew Wilcox (Oracle) Oct. 2, 2020, 6:37 p.m. UTC | #6
On Sat, Sep 19, 2020 at 05:18:47PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
> > It behaves a lot better with this patch in than without it; but you're
> > right, only the head will get written to swap, and the tails left in
> > memory; with dirty cleared, so they may be left indefinitely (I've
> > not yet looked to see when if ever PageDirty might get set later).
> > 
> > Hmm. It may just be a matter of restyling the i915 code with
> > 
> > 		if (!page_mapped(page)) {
> > 			clear_page_dirty_for_io(page);
> > 
> > but I don't want to rush to that conclusion - there might turn
> > out to be a good way of doing it at the shmem_writepage() end, but
> > probably only hacks available.  I'll mull it over: it deserves some
> > thought about what would suit, if a THP arrived here some other way.
> 
> I think the ultimate solution is to do as I have done for iomap and make
> ->writepage handle arbitrary sized pages.  However, I don't know the
> swap I/O path particularly well, and I would rather not learn it just yet.
> 
> How about this for a band-aid until we sort that out properly?  Just mark
> the page as dirty before splitting it so subsequent iterations see the
> subpages as dirty.  Arguably, we should use set_page_dirty() instead of
> SetPageDirty, but I don't think i915 cares.  In particular, it uses
> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 271548ca20f3..6231207ab1eb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  	swp_entry_t swap;
>  	pgoff_t index;
>  
> -	VM_BUG_ON_PAGE(PageCompound(page), page);
>  	BUG_ON(!PageLocked(page));
> +
> +	/*
> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> +	 * and its shmem_writeback() needs them to be split when swapping.
> +	 */
> +	if (PageTransCompound(page)) {
> +		/* Ensure the subpages are still dirty */
> +		SetPageDirty(page);
> +		if (split_huge_page(page) < 0)
> +			goto redirty;
> +		ClearPageDirty(page);
> +	}
> +
>  	mapping = page->mapping;
>  	index = page->index;
>  	inode = mapping->host;

It turns out that I have an entirely different reason for wanting
->writepage to handle an unsplit page.  In vmscan.c:shrink_page_list(),
we currently try to split file-backed THPs.  This always fails for XFS
file-backed THPs because they have page_private set which increments
the refcount by 1.  And so we OOM when the page cache is full of XFS
THPs.  I've been running successfully for a few days with this patch:

@@ -1271,10 +1271,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
                                /* Adding to swap updated mapping */
                                mapping = page_mapping(page);
                        }
-               } else if (unlikely(PageTransHuge(page))) {
-                       /* Split file THP */
-                       if (split_huge_page_to_list(page, page_list))
-                               goto keep_locked;
                }
 
                /*


Kirill points out that this will probably make shmem unhappy (it's
possible that said pages will get split anyway if they're mapped
because we pass TTU_SPLIT_HUGE_PMD into try_to_unmap()), but if
they're (a) Dirty, (b) !mapped, we'll call pageout() which calls
->writepage().

The patch above is probably not exactly the right solution for this
case, since pageout() calls writepage only once, not once for each
sub-page.  This is hard to write a cute patch for because the
pages get unlocked by split_huge_page().  I think I'm going to have
to learn about the swap path, unless someone can save me from that.
Huang, Ying Oct. 9, 2020, 8:14 a.m. UTC | #7
Hi, Matthew,

Sorry for late reply.  I just come back from a long holiday.

Matthew Wilcox <willy@infradead.org> writes:

> On Sat, Sep 19, 2020 at 05:18:47PM +0100, Matthew Wilcox wrote:
>> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
>> > It behaves a lot better with this patch in than without it; but you're
>> > right, only the head will get written to swap, and the tails left in
>> > memory; with dirty cleared, so they may be left indefinitely (I've
>> > not yet looked to see when if ever PageDirty might get set later).
>> > 
>> > Hmm. It may just be a matter of restyling the i915 code with
>> > 
>> > 		if (!page_mapped(page)) {
>> > 			clear_page_dirty_for_io(page);
>> > 
>> > but I don't want to rush to that conclusion - there might turn
>> > out to be a good way of doing it at the shmem_writepage() end, but
>> > probably only hacks available.  I'll mull it over: it deserves some
>> > thought about what would suit, if a THP arrived here some other way.
>> 
>> I think the ultimate solution is to do as I have done for iomap and make
>> ->writepage handle arbitrary sized pages.  However, I don't know the
>> swap I/O path particularly well, and I would rather not learn it just yet.
>> 
>> How about this for a band-aid until we sort that out properly?  Just mark
>> the page as dirty before splitting it so subsequent iterations see the
>> subpages as dirty.  Arguably, we should use set_page_dirty() instead of
>> SetPageDirty, but I don't think i915 cares.  In particular, it uses
>> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
>> 
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 271548ca20f3..6231207ab1eb 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>>  	swp_entry_t swap;
>>  	pgoff_t index;
>>  
>> -	VM_BUG_ON_PAGE(PageCompound(page), page);
>>  	BUG_ON(!PageLocked(page));
>> +
>> +	/*
>> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
>> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
>> +	 * and its shmem_writeback() needs them to be split when swapping.
>> +	 */
>> +	if (PageTransCompound(page)) {
>> +		/* Ensure the subpages are still dirty */
>> +		SetPageDirty(page);
>> +		if (split_huge_page(page) < 0)
>> +			goto redirty;
>> +		ClearPageDirty(page);
>> +	}
>> +
>>  	mapping = page->mapping;
>>  	index = page->index;
>>  	inode = mapping->host;
>
> It turns out that I have an entirely different reason for wanting
> ->writepage to handle an unsplit page.  In vmscan.c:shrink_page_list(),
> we currently try to split file-backed THPs.  This always fails for XFS
> file-backed THPs because they have page_private set which increments
> the refcount by 1.  And so we OOM when the page cache is full of XFS
> THPs.  I've been running successfully for a few days with this patch:
>
> @@ -1271,10 +1271,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>                                 /* Adding to swap updated mapping */
>                                 mapping = page_mapping(page);
>                         }
> -               } else if (unlikely(PageTransHuge(page))) {
> -                       /* Split file THP */
> -                       if (split_huge_page_to_list(page, page_list))
> -                               goto keep_locked;
>                 }
>  
>                 /*
>
>
> Kirill points out that this will probably make shmem unhappy (it's
> possible that said pages will get split anyway if they're mapped
> because we pass TTU_SPLIT_HUGE_PMD into try_to_unmap()), but if
> they're (a) Dirty, (b) !mapped, we'll call pageout() which calls
> ->writepage().

We may distinguish the shmem THPs from the XFS file cache THPs via
PageSwapBacked()?

Best Regards,
Huang, Ying

> The patch above is probably not exactly the right solution for this
> case, since pageout() calls writepage only once, not once for each
> sub-page.  This is hard to write a cute patch for because the
> pages get unlocked by split_huge_page().  I think I'm going to have
> to learn about the swap path, unless someone can save me from that.
Matthew Wilcox (Oracle) Oct. 10, 2020, 3:32 p.m. UTC | #8
On Fri, Oct 09, 2020 at 04:14:25PM +0800, Huang, Ying wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> > On Sat, Sep 19, 2020 at 05:18:47PM +0100, Matthew Wilcox wrote:
> >> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
> >> > It behaves a lot better with this patch in than without it; but you're
> >> > right, only the head will get written to swap, and the tails left in
> >> > memory; with dirty cleared, so they may be left indefinitely (I've
> >> > not yet looked to see when if ever PageDirty might get set later).
> >> > 
> >> > Hmm. It may just be a matter of restyling the i915 code with
> >> > 
> >> > 		if (!page_mapped(page)) {
> >> > 			clear_page_dirty_for_io(page);
> >> > 
> >> > but I don't want to rush to that conclusion - there might turn
> >> > out to be a good way of doing it at the shmem_writepage() end, but
> >> > probably only hacks available.  I'll mull it over: it deserves some
> >> > thought about what would suit, if a THP arrived here some other way.
> >> 
> >> I think the ultimate solution is to do as I have done for iomap and make
> >> ->writepage handle arbitrary sized pages.  However, I don't know the
> >> swap I/O path particularly well, and I would rather not learn it just yet.
> >> 
> >> How about this for a band-aid until we sort that out properly?  Just mark
> >> the page as dirty before splitting it so subsequent iterations see the
> >> subpages as dirty.  Arguably, we should use set_page_dirty() instead of
> >> SetPageDirty, but I don't think i915 cares.  In particular, it uses
> >> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
> >> 
> >> diff --git a/mm/shmem.c b/mm/shmem.c
> >> index 271548ca20f3..6231207ab1eb 100644
> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> >>  	swp_entry_t swap;
> >>  	pgoff_t index;
> >>  
> >> -	VM_BUG_ON_PAGE(PageCompound(page), page);
> >>  	BUG_ON(!PageLocked(page));
> >> +
> >> +	/*
> >> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> >> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> >> +	 * and its shmem_writeback() needs them to be split when swapping.
> >> +	 */
> >> +	if (PageTransCompound(page)) {
> >> +		/* Ensure the subpages are still dirty */
> >> +		SetPageDirty(page);
> >> +		if (split_huge_page(page) < 0)
> >> +			goto redirty;
> >> +		ClearPageDirty(page);
> >> +	}
> >> +
> >>  	mapping = page->mapping;
> >>  	index = page->index;
> >>  	inode = mapping->host;
> >
> > It turns out that I have an entirely different reason for wanting
> > ->writepage to handle an unsplit page.  In vmscan.c:shrink_page_list(),
> > we currently try to split file-backed THPs.  This always fails for XFS
> > file-backed THPs because they have page_private set which increments
> > the refcount by 1.  And so we OOM when the page cache is full of XFS
> > THPs.  I've been running successfully for a few days with this patch:
> >
> > @@ -1271,10 +1271,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> >                                 /* Adding to swap updated mapping */
> >                                 mapping = page_mapping(page);
> >                         }
> > -               } else if (unlikely(PageTransHuge(page))) {
> > -                       /* Split file THP */
> > -                       if (split_huge_page_to_list(page, page_list))
> > -                               goto keep_locked;
> >                 }
> >  
> >                 /*
> >
> >
> > Kirill points out that this will probably make shmem unhappy (it's
> > possible that said pages will get split anyway if they're mapped
> > because we pass TTU_SPLIT_HUGE_PMD into try_to_unmap()), but if
> > they're (a) Dirty, (b) !mapped, we'll call pageout() which calls
> > ->writepage().
> 
> We may distinguish the shmem THPs from the XFS file cache THPs via
> PageSwapBacked()?

Yes, we _can_, but we now have two reasons for wanting to be able to write
THPs to swap without splitting them.  Another thing this solves is that,
in my tree, we don't allocate the bottom layer of the XArray for THPs.
So when we split, we have to allocate memory to store the split pages,
and it seems like a bad idea to allocate memory in order to free memory,
particularly when we don't have to.
Huang, Ying Oct. 12, 2020, 2:01 a.m. UTC | #9
Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Oct 09, 2020 at 04:14:25PM +0800, Huang, Ying wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>> > On Sat, Sep 19, 2020 at 05:18:47PM +0100, Matthew Wilcox wrote:
>> >> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
>> >> > It behaves a lot better with this patch in than without it; but you're
>> >> > right, only the head will get written to swap, and the tails left in
>> >> > memory; with dirty cleared, so they may be left indefinitely (I've
>> >> > not yet looked to see when if ever PageDirty might get set later).
>> >> > 
>> >> > Hmm. It may just be a matter of restyling the i915 code with
>> >> > 
>> >> > 		if (!page_mapped(page)) {
>> >> > 			clear_page_dirty_for_io(page);
>> >> > 
>> >> > but I don't want to rush to that conclusion - there might turn
>> >> > out to be a good way of doing it at the shmem_writepage() end, but
>> >> > probably only hacks available.  I'll mull it over: it deserves some
>> >> > thought about what would suit, if a THP arrived here some other way.
>> >> 
>> >> I think the ultimate solution is to do as I have done for iomap and make
>> >> ->writepage handle arbitrary sized pages.  However, I don't know the
>> >> swap I/O path particularly well, and I would rather not learn it just yet.
>> >> 
>> >> How about this for a band-aid until we sort that out properly?  Just mark
>> >> the page as dirty before splitting it so subsequent iterations see the
>> >> subpages as dirty.  Arguably, we should use set_page_dirty() instead of
>> >> SetPageDirty, but I don't think i915 cares.  In particular, it uses
>> >> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
>> >> 
>> >> diff --git a/mm/shmem.c b/mm/shmem.c
>> >> index 271548ca20f3..6231207ab1eb 100644
>> >> --- a/mm/shmem.c
>> >> +++ b/mm/shmem.c
>> >> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>> >>  	swp_entry_t swap;
>> >>  	pgoff_t index;
>> >>  
>> >> -	VM_BUG_ON_PAGE(PageCompound(page), page);
>> >>  	BUG_ON(!PageLocked(page));
>> >> +
>> >> +	/*
>> >> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
>> >> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
>> >> +	 * and its shmem_writeback() needs them to be split when swapping.
>> >> +	 */
>> >> +	if (PageTransCompound(page)) {
>> >> +		/* Ensure the subpages are still dirty */
>> >> +		SetPageDirty(page);
>> >> +		if (split_huge_page(page) < 0)
>> >> +			goto redirty;
>> >> +		ClearPageDirty(page);
>> >> +	}
>> >> +
>> >>  	mapping = page->mapping;
>> >>  	index = page->index;
>> >>  	inode = mapping->host;
>> >
>> > It turns out that I have an entirely different reason for wanting
>> > ->writepage to handle an unsplit page.  In vmscan.c:shrink_page_list(),
>> > we currently try to split file-backed THPs.  This always fails for XFS
>> > file-backed THPs because they have page_private set which increments
>> > the refcount by 1.  And so we OOM when the page cache is full of XFS
>> > THPs.  I've been running successfully for a few days with this patch:
>> >
>> > @@ -1271,10 +1271,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>> >                                 /* Adding to swap updated mapping */
>> >                                 mapping = page_mapping(page);
>> >                         }
>> > -               } else if (unlikely(PageTransHuge(page))) {
>> > -                       /* Split file THP */
>> > -                       if (split_huge_page_to_list(page, page_list))
>> > -                               goto keep_locked;
>> >                 }
>> >  
>> >                 /*
>> >
>> >
>> > Kirill points out that this will probably make shmem unhappy (it's
>> > possible that said pages will get split anyway if they're mapped
>> > because we pass TTU_SPLIT_HUGE_PMD into try_to_unmap()), but if
>> > they're (a) Dirty, (b) !mapped, we'll call pageout() which calls
>> > ->writepage().
>> 
>> We may distinguish the shmem THPs from the XFS file cache THPs via
>> PageSwapBacked()?
>
> Yes, we _can_, but we now have two reasons for wanting to be able to write
> THPs to swap without splitting them.  Another thing this solves is that,
> in my tree, we don't allocate the bottom layer of the XArray for THPs.
> So when we split, we have to allocate memory to store the split pages,
> and it seems like a bad idea to allocate memory in order to free memory,
> particularly when we don't have to.

I am afraid we cannot avoid to allocate memory during swapping.  Because
the anonymous page (strictly PageSwapBacked()) will be put in swap cache
(a special page cache) during page reclaiming.  This means we need to
allocate XArray node for them.

And, we cannot guarantee there are always large continuous free space
available in the swap devices to accommodate the THP as a whole, so we
need to be prepared to split the THP anyway.

Things may be different for the page cache.

Best Regards,
Huang, Ying
diff mbox series

Patch

--- a/mm/shmem.c~shmem-shmem_writepage-split-unlikely-i915-thp
+++ a/mm/shmem.c
@@ -1362,7 +1362,15 @@  static int shmem_writepage(struct page *
 	swp_entry_t swap;
 	pgoff_t index;
 
-	VM_BUG_ON_PAGE(PageCompound(page), page);
+	/*
+	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
+	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
+	 * and its shmem_writeback() needs them to be split when swapping.
+	 */
+	if (PageTransCompound(page))
+		if (split_huge_page(page) < 0)
+			goto redirty;
+
 	BUG_ON(!PageLocked(page));
 	mapping = page->mapping;
 	index = page->index;