diff mbox series

fs: buffer: check huge page size instead of single page for invalidatepage

Message ID 20210917205731.262693-1-shy828301@gmail.com (mailing list archive)
State New
Headers show
Series fs: buffer: check huge page size instead of single page for invalidatepage | expand

Commit Message

Yang Shi Sept. 17, 2021, 8:57 p.m. UTC
Hao Sun reported the below BUG on v5.15-rc1 kernel:

kernel BUG at fs/buffer.c:1510!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.14.0+ #15
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events delayed_fput
RIP: 0010:block_invalidatepage+0x27f/0x2a0 -origin/fs/buffer.c:1510
Code: ff ff e8 b4 07 d7 ff b9 02 00 00 00 be 02 00 00 00 4c 89 ff 48
c7 c2 40 4e 25 84 e8 2b c2 c4 02 e9 c9 fe ff ff e8 91 07 d7 ff <0f> 0b
e8 8a 07 d7 ff 0f 0b e8 83 07 d7 ff 48 8d 5d ff e9 57 ff ff
RSP: 0018:ffffc9000065bb60 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffffea0000670000 RCX: 0000000000000000
RDX: ffff8880097fa240 RSI: ffffffff81608a9f RDI: ffffea0000670000
RBP: ffffea0000670000 R08: 0000000000000001 R09: 0000000000000000
R10: ffffc9000065b9f8 R11: 0000000000000003 R12: ffffffff81608820
R13: ffffc9000065bc68 R14: 0000000000000000 R15: ffffc9000065bbf0
FS:  0000000000000000(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4aef93fb08 CR3: 0000000108cf2000 CR4: 0000000000750ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 do_invalidatepage -origin/mm/truncate.c:157 [inline]
 truncate_cleanup_page+0x15c/0x280 -origin/mm/truncate.c:176
 truncate_inode_pages_range+0x169/0xc30 -origin/mm/truncate.c:325
 kill_bdev.isra.29+0x28/0x30
 blkdev_flush_mapping+0x4c/0x130 -origin/block/bdev.c:658
 blkdev_put_whole+0x54/0x60 -origin/block/bdev.c:689
 blkdev_put+0x6f/0x210 -origin/block/bdev.c:953
 blkdev_close+0x25/0x30 -origin/block/fops.c:459
 __fput+0xdf/0x380 -origin/fs/file_table.c:280
 delayed_fput+0x25/0x40 -origin/fs/file_table.c:308
 process_one_work+0x359/0x850 -origin/kernel/workqueue.c:2297
 worker_thread+0x41/0x4d0 -origin/kernel/workqueue.c:2444
 kthread+0x178/0x1b0 -origin/kernel/kthread.c:319
 ret_from_fork+0x1f/0x30 -origin/arch/x86/entry/entry_64.S:295
Modules linked in:
Dumping ftrace buffer:
   (ftrace buffer empty)
---[ end trace 9dbb8f58f2109f10 ]---
RIP: 0010:block_invalidatepage+0x27f/0x2a0 -origin/fs/buffer.c:1510
Code: ff ff e8 b4 07 d7 ff b9 02 00 00 00 be 02 00 00 00 4c 89 ff 48
c7 c2 40 4e 25 84 e8 2b c2 c4 02 e9 c9 fe ff ff e8 91 07 d7 ff <0f> 0b
e8 8a 07 d7 ff 0f 0b e8 83 07 d7 ff 48 8d 5d ff e9 57 ff ff
RSP: 0018:ffffc9000065bb60 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffffea0000670000 RCX: 0000000000000000
RDX: ffff8880097fa240 RSI: ffffffff81608a9f RDI: ffffea0000670000
RBP: ffffea0000670000 R08: 0000000000000001 R09: 0000000000000000
R10: ffffc9000065b9f8 R11: 0000000000000003 R12: ffffffff81608820
R13: ffffc9000065bc68 R14: 0000000000000000 R15: ffffc9000065bbf0
FS:  0000000000000000(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff98674f000 CR3: 0000000106b2e000 CR4: 0000000000750ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554

The debugging showed the page passed to invalidatepage is a huge page
and the length is the size of huge page instead of single page due to
read only FS THP support.  But block_invalidatepage() would throw BUG if
the size is greater than single page.

However there is actually a bigger problem in invalidatepage().  All the
implementations are *NOT* THP aware and hardcoded PAGE_SIZE.  Some triggers
BUG(), like block_invalidatepage(), some just returns error if length is
greater than PAGE_SIZE.

Converting PAGE_SIZE to thp_size() actually is not enough since the actual
invalidation part just assumes single page is passed in.  Since other subpages
may have private too because PG_private is per subpage so there may be
multiple subpages have private.  This may prevent the THP from splitting
and reclaiming since the extra refcount pins from private of subpages.

The complete fix seems not trivial and involve how to deal with huge
page in page cache.  So the scope of this patch is limited to close the
BUG at the moment.

Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Reported-by: Hao Sun <sunhao.th@gmail.com>
Tested-by: Hao Sun <sunhao.th@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 fs/buffer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Yang Shi Sept. 18, 2021, 12:07 a.m. UTC | #1
Cc to the reporter. Sorry for missing Hao Sun in the first place.

On Fri, Sep 17, 2021 at 1:57 PM Yang Shi <shy828301@gmail.com> wrote:
>
> Hao Sun reported the below BUG on v5.15-rc1 kernel:
>
> kernel BUG at fs/buffer.c:1510!
> invalid opcode: 0000 [#1] PREEMPT SMP
> CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.14.0+ #15
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> Workqueue: events delayed_fput
> RIP: 0010:block_invalidatepage+0x27f/0x2a0 -origin/fs/buffer.c:1510
> Code: ff ff e8 b4 07 d7 ff b9 02 00 00 00 be 02 00 00 00 4c 89 ff 48
> c7 c2 40 4e 25 84 e8 2b c2 c4 02 e9 c9 fe ff ff e8 91 07 d7 ff <0f> 0b
> e8 8a 07 d7 ff 0f 0b e8 83 07 d7 ff 48 8d 5d ff e9 57 ff ff
> RSP: 0018:ffffc9000065bb60 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: ffffea0000670000 RCX: 0000000000000000
> RDX: ffff8880097fa240 RSI: ffffffff81608a9f RDI: ffffea0000670000
> RBP: ffffea0000670000 R08: 0000000000000001 R09: 0000000000000000
> R10: ffffc9000065b9f8 R11: 0000000000000003 R12: ffffffff81608820
> R13: ffffc9000065bc68 R14: 0000000000000000 R15: ffffc9000065bbf0
> FS:  0000000000000000(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f4aef93fb08 CR3: 0000000108cf2000 CR4: 0000000000750ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
>  do_invalidatepage -origin/mm/truncate.c:157 [inline]
>  truncate_cleanup_page+0x15c/0x280 -origin/mm/truncate.c:176
>  truncate_inode_pages_range+0x169/0xc30 -origin/mm/truncate.c:325
>  kill_bdev.isra.29+0x28/0x30
>  blkdev_flush_mapping+0x4c/0x130 -origin/block/bdev.c:658
>  blkdev_put_whole+0x54/0x60 -origin/block/bdev.c:689
>  blkdev_put+0x6f/0x210 -origin/block/bdev.c:953
>  blkdev_close+0x25/0x30 -origin/block/fops.c:459
>  __fput+0xdf/0x380 -origin/fs/file_table.c:280
>  delayed_fput+0x25/0x40 -origin/fs/file_table.c:308
>  process_one_work+0x359/0x850 -origin/kernel/workqueue.c:2297
>  worker_thread+0x41/0x4d0 -origin/kernel/workqueue.c:2444
>  kthread+0x178/0x1b0 -origin/kernel/kthread.c:319
>  ret_from_fork+0x1f/0x30 -origin/arch/x86/entry/entry_64.S:295
> Modules linked in:
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> ---[ end trace 9dbb8f58f2109f10 ]---
> RIP: 0010:block_invalidatepage+0x27f/0x2a0 -origin/fs/buffer.c:1510
> Code: ff ff e8 b4 07 d7 ff b9 02 00 00 00 be 02 00 00 00 4c 89 ff 48
> c7 c2 40 4e 25 84 e8 2b c2 c4 02 e9 c9 fe ff ff e8 91 07 d7 ff <0f> 0b
> e8 8a 07 d7 ff 0f 0b e8 83 07 d7 ff 48 8d 5d ff e9 57 ff ff
> RSP: 0018:ffffc9000065bb60 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: ffffea0000670000 RCX: 0000000000000000
> RDX: ffff8880097fa240 RSI: ffffffff81608a9f RDI: ffffea0000670000
> RBP: ffffea0000670000 R08: 0000000000000001 R09: 0000000000000000
> R10: ffffc9000065b9f8 R11: 0000000000000003 R12: ffffffff81608820
> R13: ffffc9000065bc68 R14: 0000000000000000 R15: ffffc9000065bbf0
> FS:  0000000000000000(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007ff98674f000 CR3: 0000000106b2e000 CR4: 0000000000750ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
>
> The debugging showed the page passed to invalidatepage is a huge page
> and the length is the size of huge page instead of single page due to
> read only FS THP support.  But block_invalidatepage() would throw BUG if
> the size is greater than single page.
>
> However there is actually a bigger problem in invalidatepage().  All the
> implementations are *NOT* THP aware and hardcoded PAGE_SIZE.  Some triggers
> BUG(), like block_invalidatepage(), some just returns error if length is
> greater than PAGE_SIZE.
>
> Converting PAGE_SIZE to thp_size() actually is not enough since the actual
> invalidation part just assumes single page is passed in.  Since other subpages
> may have private too because PG_private is per subpage so there may be
> multiple subpages have private.  This may prevent the THP from splitting
> and reclaiming since the extra refcount pins from private of subpages.
>
> The complete fix seems not trivial and involve how to deal with huge
> page in page cache.  So the scope of this patch is limited to close the
> BUG at the moment.
>
> Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
> Reported-by: Hao Sun <sunhao.th@gmail.com>
> Tested-by: Hao Sun <sunhao.th@gmail.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  fs/buffer.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index ab7573d72dd7..4bcb54c4d1be 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1507,7 +1507,7 @@ void block_invalidatepage(struct page *page, unsigned int offset,
>         /*
>          * Check for overflow
>          */
> -       BUG_ON(stop > PAGE_SIZE || stop < length);
> +       BUG_ON(stop > thp_size(page) || stop < length);
>
>         head = page_buffers(page);
>         bh = head;
> @@ -1535,7 +1535,7 @@ void block_invalidatepage(struct page *page, unsigned int offset,
>          * The get_block cached value has been unconditionally invalidated,
>          * so real IO is not possible anymore.
>          */
> -       if (length == PAGE_SIZE)
> +       if (length >= PAGE_SIZE)
>                 try_to_release_page(page, 0);
>  out:
>         return;
> --
> 2.26.2
>
Matthew Wilcox Sept. 19, 2021, 2:40 p.m. UTC | #2
On Fri, Sep 17, 2021 at 05:07:03PM -0700, Yang Shi wrote:
> > The debugging showed the page passed to invalidatepage is a huge page
> > and the length is the size of huge page instead of single page due to
> > read only FS THP support.  But block_invalidatepage() would throw BUG if
> > the size is greater than single page.

Things have already gone wrong before we get to this point.  See
do_dentry_open().  You aren't supposed to be able to get a writable file
descriptor on a file which has had huge pages added to the page cache
without the filesystem's knowledge.  That's the problem that needs to
be fixed.
Yang Shi Sept. 20, 2021, 9:23 p.m. UTC | #3
On Sun, Sep 19, 2021 at 7:41 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Sep 17, 2021 at 05:07:03PM -0700, Yang Shi wrote:
> > > The debugging showed the page passed to invalidatepage is a huge page
> > > and the length is the size of huge page instead of single page due to
> > > read only FS THP support.  But block_invalidatepage() would throw BUG if
> > > the size is greater than single page.
>
> Things have already gone wrong before we get to this point.  See
> do_dentry_open().  You aren't supposed to be able to get a writable file
> descriptor on a file which has had huge pages added to the page cache
> without the filesystem's knowledge.  That's the problem that needs to
> be fixed.

I don't quite understand your point here. Do you mean do_dentry_open()
should fail for such cases instead of truncating the page cache?
Matthew Wilcox Sept. 20, 2021, 9:50 p.m. UTC | #4
On Mon, Sep 20, 2021 at 02:23:41PM -0700, Yang Shi wrote:
> On Sun, Sep 19, 2021 at 7:41 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Sep 17, 2021 at 05:07:03PM -0700, Yang Shi wrote:
> > > > The debugging showed the page passed to invalidatepage is a huge page
> > > > and the length is the size of huge page instead of single page due to
> > > > read only FS THP support.  But block_invalidatepage() would throw BUG if
> > > > the size is greater than single page.
> >
> > Things have already gone wrong before we get to this point.  See
> > do_dentry_open().  You aren't supposed to be able to get a writable file
> > descriptor on a file which has had huge pages added to the page cache
> > without the filesystem's knowledge.  That's the problem that needs to
> > be fixed.
> 
> I don't quite understand your point here. Do you mean do_dentry_open()
> should fail for such cases instead of truncating the page cache?

No, do_dentry_open() should have truncated the page cache when it was
called and found that there were THPs in the cache.  Then khugepaged
should see that someone has the file open for write and decline to create
new THPs.  So it shouldn't be possible to get here with THPs in the cache.
Yang Shi Sept. 20, 2021, 10:35 p.m. UTC | #5
On Mon, Sep 20, 2021 at 2:50 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Sep 20, 2021 at 02:23:41PM -0700, Yang Shi wrote:
> > On Sun, Sep 19, 2021 at 7:41 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, Sep 17, 2021 at 05:07:03PM -0700, Yang Shi wrote:
> > > > > The debugging showed the page passed to invalidatepage is a huge page
> > > > > and the length is the size of huge page instead of single page due to
> > > > > read only FS THP support.  But block_invalidatepage() would throw BUG if
> > > > > the size is greater than single page.
> > >
> > > Things have already gone wrong before we get to this point.  See
> > > do_dentry_open().  You aren't supposed to be able to get a writable file
> > > descriptor on a file which has had huge pages added to the page cache
> > > without the filesystem's knowledge.  That's the problem that needs to
> > > be fixed.
> >
> > I don't quite understand your point here. Do you mean do_dentry_open()
> > should fail for such cases instead of truncating the page cache?
>
> No, do_dentry_open() should have truncated the page cache when it was
> called and found that there were THPs in the cache.  Then khugepaged
> should see that someone has the file open for write and decline to create
> new THPs.  So it shouldn't be possible to get here with THPs in the cache.

AFAICT, it does so.

In do_dentry_open():
/*
         * XXX: Huge page cache doesn't support writing yet. Drop all page
         * cache for this file before processing writes.
         */
        if (f->f_mode & FMODE_WRITE) {
                /*
                 * Paired with smp_mb() in collapse_file() to ensure nr_thps
                 * is up to date and the update to i_writecount by
                 * get_write_access() is visible. Ensures subsequent insertion
                 * of THPs into the page cache will fail.
                 */
                smp_mb();
                if (filemap_nr_thps(inode->i_mapping))
                        truncate_pagecache(inode, 0);
        }


In khugepaged:
filemap_nr_thps_inc(mapping);
                /*
                 * Paired with smp_mb() in do_dentry_open() to ensure
                 * i_writecount is up to date and the update to nr_thps is
                 * visible. Ensures the page cache will be truncated if the
                 * file is opened writable.
                 */
                smp_mb();
                if (inode_is_open_for_write(mapping->host)) {
                        result = SCAN_FAIL;
                        __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr);
                        filemap_nr_thps_dec(mapping);
                        goto xa_locked;
                }

But I'm not quite sure if there is any race condition.
Yang Shi Oct. 11, 2021, 7:57 p.m. UTC | #6
On Mon, Sep 20, 2021 at 3:35 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Mon, Sep 20, 2021 at 2:50 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Sep 20, 2021 at 02:23:41PM -0700, Yang Shi wrote:
> > > On Sun, Sep 19, 2021 at 7:41 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, Sep 17, 2021 at 05:07:03PM -0700, Yang Shi wrote:
> > > > > > The debugging showed the page passed to invalidatepage is a huge page
> > > > > > and the length is the size of huge page instead of single page due to
> > > > > > read only FS THP support.  But block_invalidatepage() would throw BUG if
> > > > > > the size is greater than single page.
> > > >
> > > > Things have already gone wrong before we get to this point.  See
> > > > do_dentry_open().  You aren't supposed to be able to get a writable file
> > > > descriptor on a file which has had huge pages added to the page cache
> > > > without the filesystem's knowledge.  That's the problem that needs to
> > > > be fixed.
> > >
> > > I don't quite understand your point here. Do you mean do_dentry_open()
> > > should fail for such cases instead of truncating the page cache?
> >
> > No, do_dentry_open() should have truncated the page cache when it was
> > called and found that there were THPs in the cache.  Then khugepaged
> > should see that someone has the file open for write and decline to create
> > new THPs.  So it shouldn't be possible to get here with THPs in the cache.
>

I think Hugh's skipping special file patch
(https://lore.kernel.org/linux-mm/a07564a3-b2fc-9ffe-3ace-3f276075ea5c@google.com/)
could fix this specific BUG report and seems like a more proper fix to
this.

However, it still doesn't make too much sense to have thp_size passed
to do_invalidatepage(), then have PAGE_SIZE hardcoded in a BUG
assertion IMHO. So it seems this patch is still useful because
block_invalidatepage() is called by a few filesystems as well, for
example, ext4. Or I'm wondering whether we should call
do_invalidatepage() for each subpage of THP in truncate_cleanup_page()
since private is for each subpage IIUC.

> AFAICT, it does so.
>
> In do_dentry_open():
> /*
>          * XXX: Huge page cache doesn't support writing yet. Drop all page
>          * cache for this file before processing writes.
>          */
>         if (f->f_mode & FMODE_WRITE) {
>                 /*
>                  * Paired with smp_mb() in collapse_file() to ensure nr_thps
>                  * is up to date and the update to i_writecount by
>                  * get_write_access() is visible. Ensures subsequent insertion
>                  * of THPs into the page cache will fail.
>                  */
>                 smp_mb();
>                 if (filemap_nr_thps(inode->i_mapping))
>                         truncate_pagecache(inode, 0);
>         }
>
>
> In khugepaged:
> filemap_nr_thps_inc(mapping);
>                 /*
>                  * Paired with smp_mb() in do_dentry_open() to ensure
>                  * i_writecount is up to date and the update to nr_thps is
>                  * visible. Ensures the page cache will be truncated if the
>                  * file is opened writable.
>                  */
>                 smp_mb();
>                 if (inode_is_open_for_write(mapping->host)) {
>                         result = SCAN_FAIL;
>                         __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr);
>                         filemap_nr_thps_dec(mapping);
>                         goto xa_locked;
>                 }
>
> But I'm not quite sure if there is any race condition.
Yang Shi Oct. 20, 2021, 11:38 p.m. UTC | #7
On Mon, Oct 11, 2021 at 12:57 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Mon, Sep 20, 2021 at 3:35 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Mon, Sep 20, 2021 at 2:50 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Sep 20, 2021 at 02:23:41PM -0700, Yang Shi wrote:
> > > > On Sun, Sep 19, 2021 at 7:41 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Fri, Sep 17, 2021 at 05:07:03PM -0700, Yang Shi wrote:
> > > > > > > The debugging showed the page passed to invalidatepage is a huge page
> > > > > > > and the length is the size of huge page instead of single page due to
> > > > > > > read only FS THP support.  But block_invalidatepage() would throw BUG if
> > > > > > > the size is greater than single page.
> > > > >
> > > > > Things have already gone wrong before we get to this point.  See
> > > > > do_dentry_open().  You aren't supposed to be able to get a writable file
> > > > > descriptor on a file which has had huge pages added to the page cache
> > > > > without the filesystem's knowledge.  That's the problem that needs to
> > > > > be fixed.
> > > >
> > > > I don't quite understand your point here. Do you mean do_dentry_open()
> > > > should fail for such cases instead of truncating the page cache?
> > >
> > > No, do_dentry_open() should have truncated the page cache when it was
> > > called and found that there were THPs in the cache.  Then khugepaged
> > > should see that someone has the file open for write and decline to create
> > > new THPs.  So it shouldn't be possible to get here with THPs in the cache.
> >
>
> I think Hugh's skipping special file patch
> (https://lore.kernel.org/linux-mm/a07564a3-b2fc-9ffe-3ace-3f276075ea5c@google.com/)
> could fix this specific BUG report and seems like a more proper fix to
> this.
>
> However, it still doesn't make too much sense to have thp_size passed
> to do_invalidatepage(), then have PAGE_SIZE hardcoded in a BUG
> assertion IMHO. So it seems this patch is still useful because
> block_invalidatepage() is called by a few filesystems as well, for
> example, ext4. Or I'm wondering whether we should call
> do_invalidatepage() for each subpage of THP in truncate_cleanup_page()
> since private is for each subpage IIUC.

Seems no interest?

Anyway the more I was staring at the code the more I thought calling
do_invalidatepage() for each subpage made more sense. So, something
like the below makes sense?

diff --git a/mm/truncate.c b/mm/truncate.c
index 714eaf19821d..9048f498cd02 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -169,11 +169,16 @@ void do_invalidatepage(struct page *page,
unsigned int offset,
  */
 static void truncate_cleanup_page(struct page *page)
 {
+       int nr = thp_nr_pages(page);
+       int i;
+
        if (page_mapped(page))
                unmap_mapping_page(page);

-       if (page_has_private(page))
-               do_invalidatepage(page, 0, thp_size(page));
+       for (i = 0; i < nr; i++) {
+               if (page_has_private(page + i))
+                       do_invalidatepage(page + i, 0, PAGE_SIZE);
+       }

        /*
         * Some filesystems seem to re-dirty the page even after

>
> > AFAICT, it does so.
> >
> > In do_dentry_open():
> > /*
> >          * XXX: Huge page cache doesn't support writing yet. Drop all page
> >          * cache for this file before processing writes.
> >          */
> >         if (f->f_mode & FMODE_WRITE) {
> >                 /*
> >                  * Paired with smp_mb() in collapse_file() to ensure nr_thps
> >                  * is up to date and the update to i_writecount by
> >                  * get_write_access() is visible. Ensures subsequent insertion
> >                  * of THPs into the page cache will fail.
> >                  */
> >                 smp_mb();
> >                 if (filemap_nr_thps(inode->i_mapping))
> >                         truncate_pagecache(inode, 0);
> >         }
> >
> >
> > In khugepaged:
> > filemap_nr_thps_inc(mapping);
> >                 /*
> >                  * Paired with smp_mb() in do_dentry_open() to ensure
> >                  * i_writecount is up to date and the update to nr_thps is
> >                  * visible. Ensures the page cache will be truncated if the
> >                  * file is opened writable.
> >                  */
> >                 smp_mb();
> >                 if (inode_is_open_for_write(mapping->host)) {
> >                         result = SCAN_FAIL;
> >                         __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr);
> >                         filemap_nr_thps_dec(mapping);
> >                         goto xa_locked;
> >                 }
> >
> > But I'm not quite sure if there is any race condition.
Matthew Wilcox Oct. 20, 2021, 11:49 p.m. UTC | #8
On Wed, Oct 20, 2021 at 04:38:49PM -0700, Yang Shi wrote:
> > However, it still doesn't make too much sense to have thp_size passed
> > to do_invalidatepage(), then have PAGE_SIZE hardcoded in a BUG
> > assertion IMHO. So it seems this patch is still useful because
> > block_invalidatepage() is called by a few filesystems as well, for
> > example, ext4. Or I'm wondering whether we should call
> > do_invalidatepage() for each subpage of THP in truncate_cleanup_page()
> > since private is for each subpage IIUC.
> 
> Seems no interest?

No.  I have changes in this area as part of the folio patchset (where
we end up converting this to invalidate_folio).  I'm not really
interested in doing anything before that, since this shouldn't be
reachable today.

> Anyway the more I was staring at the code the more I thought calling
> do_invalidatepage() for each subpage made more sense. So, something
> like the below makes sense?

Definitely not.  We want to invalidate the entire folio at once.
Yang Shi Oct. 21, 2021, 12:24 a.m. UTC | #9
On Wed, Oct 20, 2021 at 4:51 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Oct 20, 2021 at 04:38:49PM -0700, Yang Shi wrote:
> > > However, it still doesn't make too much sense to have thp_size passed
> > > to do_invalidatepage(), then have PAGE_SIZE hardcoded in a BUG
> > > assertion IMHO. So it seems this patch is still useful because
> > > block_invalidatepage() is called by a few filesystems as well, for
> > > example, ext4. Or I'm wondering whether we should call
> > > do_invalidatepage() for each subpage of THP in truncate_cleanup_page()
> > > since private is for each subpage IIUC.
> >
> > Seems no interest?
>
> No.  I have changes in this area as part of the folio patchset (where
> we end up converting this to invalidate_folio).  I'm not really
> interested in doing anything before that, since this shouldn't be
> reachable today.

Understood. But this is definitely reachable unless Hugh's patch
(skipping non-regular file) is applied.

>
> > Anyway the more I was staring at the code the more I thought calling
> > do_invalidatepage() for each subpage made more sense. So, something
> > like the below makes sense?
>
> Definitely not.  We want to invalidate the entire folio at once.

I didn't look at the folio patch (on each individual patch level), but
I'm supposed it still needs to invalidate buffer for each subpage,
right?
Matthew Wilcox Oct. 21, 2021, 1:36 a.m. UTC | #10
On Wed, Oct 20, 2021 at 05:24:09PM -0700, Yang Shi wrote:
> On Wed, Oct 20, 2021 at 4:51 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Oct 20, 2021 at 04:38:49PM -0700, Yang Shi wrote:
> > > > However, it still doesn't make too much sense to have thp_size passed
> > > > to do_invalidatepage(), then have PAGE_SIZE hardcoded in a BUG
> > > > assertion IMHO. So it seems this patch is still useful because
> > > > block_invalidatepage() is called by a few filesystems as well, for
> > > > example, ext4. Or I'm wondering whether we should call
> > > > do_invalidatepage() for each subpage of THP in truncate_cleanup_page()
> > > > since private is for each subpage IIUC.
> > >
> > > Seems no interest?
> >
> > No.  I have changes in this area as part of the folio patchset (where
> > we end up converting this to invalidate_folio).  I'm not really
> > interested in doing anything before that, since this shouldn't be
> > reachable today.
> 
> Understood. But this is definitely reachable unless Hugh's patch
> (skipping non-regular file) is applied.

Right.  That's the bug that needs to be fixed.  Seeing THPs here is
a symptom.  Getting rid of the error just makes the problem silent.

> > > Anyway the more I was staring at the code the more I thought calling
> > > do_invalidatepage() for each subpage made more sense. So, something
> > > like the below makes sense?
> >
> > Definitely not.  We want to invalidate the entire folio at once.
> 
> I didn't look at the folio patch (on each individual patch level), but
> I'm supposed it still needs to invalidate buffer for each subpage,
> right?

No.  Buffers are tracked for the entire folio, not on each subpage.

Actually, the filesystem people currently believe that the O(n^2) nature
of buffer-head handling mean that it's a bad idea to create multi-page
folios for bufferhead based filesystems (which includes block devices),
and the correct path forward is to migrate away from buffer-heads.
That may change, but it's the current plan.
diff mbox series

Patch

diff --git a/fs/buffer.c b/fs/buffer.c
index ab7573d72dd7..4bcb54c4d1be 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1507,7 +1507,7 @@  void block_invalidatepage(struct page *page, unsigned int offset,
 	/*
 	 * Check for overflow
 	 */
-	BUG_ON(stop > PAGE_SIZE || stop < length);
+	BUG_ON(stop > thp_size(page) || stop < length);
 
 	head = page_buffers(page);
 	bh = head;
@@ -1535,7 +1535,7 @@  void block_invalidatepage(struct page *page, unsigned int offset,
 	 * The get_block cached value has been unconditionally invalidated,
 	 * so real IO is not possible anymore.
 	 */
-	if (length == PAGE_SIZE)
+	if (length >= PAGE_SIZE)
 		try_to_release_page(page, 0);
 out:
 	return;