diff mbox series

iomap: Set all uptodate bits for an Uptodate page

Message ID 20200924125608.31231-1-willy@infradead.org (mailing list archive)
State New, archived
Headers show
Series iomap: Set all uptodate bits for an Uptodate page | expand

Commit Message

Matthew Wilcox (Oracle) Sept. 24, 2020, 12:56 p.m. UTC
For filesystems with block size < page size, we need to set all the
per-block uptodate bits if the page was already uptodate at the time
we create the per-block metadata.  This can happen if the page is
invalidated (eg by a write to drop_caches) but ultimately not removed
from the page cache.

This is a data corruption issue as page writeback skips blocks which
are marked !uptodate.

Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Qian Cai <cai@redhat.com>
Cc: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Brian Foster Sept. 24, 2020, 1:12 p.m. UTC | #1
On Thu, Sep 24, 2020 at 01:56:08PM +0100, Matthew Wilcox (Oracle) wrote:
> For filesystems with block size < page size, we need to set all the
> per-block uptodate bits if the page was already uptodate at the time
> we create the per-block metadata.  This can happen if the page is
> invalidated (eg by a write to drop_caches) but ultimately not removed
> from the page cache.
> 
> This is a data corruption issue as page writeback skips blocks which
> are marked !uptodate.
> 
> Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reported-by: Qian Cai <cai@redhat.com>
> Cc: Brian Foster <bfoster@redhat.com>
> ---
>  fs/iomap/buffered-io.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 8b6cca7e34e4..8180061b9e16 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -60,6 +60,8 @@ iomap_page_create(struct inode *inode, struct page *page)
>  	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
>  			GFP_NOFS | __GFP_NOFAIL);
>  	spin_lock_init(&iop->uptodate_lock);
> +	if (PageUptodate(page))
> +		bitmap_fill(iop->uptodate, nr_blocks);

Thanks. Based on my testing of clearing PageUptodate here I suspect this
will similarly prevent the problem, but I'll give this a test
nonetheless. 

I am a little curious why we'd prefer to fill the iop here rather than
just clear the page state if the iop data has been released. If the page
is partially uptodate, then we end up having to re-read the page
anyways, right? OTOH, I guess this behavior is more consistent with page
size == block size filesystems where iop wouldn't exist and we just go
by page state, so perhaps that makes more sense.

Brian

>  	attach_page_private(page, iop);
>  	return iop;
>  }
> -- 
> 2.28.0
>
Matthew Wilcox (Oracle) Sept. 24, 2020, 1:59 p.m. UTC | #2
On Thu, Sep 24, 2020 at 09:12:35AM -0400, Brian Foster wrote:
> On Thu, Sep 24, 2020 at 01:56:08PM +0100, Matthew Wilcox (Oracle) wrote:
> > For filesystems with block size < page size, we need to set all the
> > per-block uptodate bits if the page was already uptodate at the time
> > we create the per-block metadata.  This can happen if the page is
> > invalidated (eg by a write to drop_caches) but ultimately not removed
> > from the page cache.
> > 
> > This is a data corruption issue as page writeback skips blocks which
> > are marked !uptodate.
> 
> Thanks. Based on my testing of clearing PageUptodate here I suspect this
> will similarly prevent the problem, but I'll give this a test
> nonetheless. 
> 
> I am a little curious why we'd prefer to fill the iop here rather than
> just clear the page state if the iop data has been released. If the page
> is partially uptodate, then we end up having to re-read the page
> anyways, right? OTOH, I guess this behavior is more consistent with page
> size == block size filesystems where iop wouldn't exist and we just go
> by page state, so perhaps that makes more sense.

Well, it's _true_ ... the PageUptodate bit means that every byte in this
page is at least as new as every byte on storage.  There's no need to
re-read it, which is what we'll do if we ClearPageUptodate.

My original motivation for this was splitting a THP.  In that case,
we have, let's say, 16 * 4kB pages, and an iop for 64 blocks.  When we
split that 64kB page into 16 4kB pages, we can't afford to allocate 16
iops for them, so we just drop the iop and copy the uptodate state from
the head page to all subpages.

So now we have 16 pages, all marked uptodate (and with valid data) but
no iop.  So we need to create an iop for each page during the writeback
path, and that has to be created with uptodate bits or we'll skip the
entire page.  When I wrote the patch below, I had no idea we could
already get an iop allocated for an uptodate page, or I would have
submitted this patch months ago.

http://git.infradead.org/users/willy/pagecache.git/commitdiff/bc503912d4a9aad4496a4591e9992f0ada47a9c9
Gao Xiang Sept. 24, 2020, 2:47 p.m. UTC | #3
On Thu, Sep 24, 2020 at 02:59:00PM +0100, Matthew Wilcox wrote:
> On Thu, Sep 24, 2020 at 09:12:35AM -0400, Brian Foster wrote:
> > On Thu, Sep 24, 2020 at 01:56:08PM +0100, Matthew Wilcox (Oracle) wrote:
> > > For filesystems with block size < page size, we need to set all the
> > > per-block uptodate bits if the page was already uptodate at the time
> > > we create the per-block metadata.  This can happen if the page is
> > > invalidated (eg by a write to drop_caches) but ultimately not removed
> > > from the page cache.
> > > 
> > > This is a data corruption issue as page writeback skips blocks which
> > > are marked !uptodate.
> > 
> > Thanks. Based on my testing of clearing PageUptodate here I suspect this
> > will similarly prevent the problem, but I'll give this a test
> > nonetheless. 
> > 
> > I am a little curious why we'd prefer to fill the iop here rather than
> > just clear the page state if the iop data has been released. If the page
> > is partially uptodate, then we end up having to re-read the page
> > anyways, right? OTOH, I guess this behavior is more consistent with page
> > size == block size filesystems where iop wouldn't exist and we just go
> > by page state, so perhaps that makes more sense.
> 
> Well, it's _true_ ... the PageUptodate bit means that every byte in this
> page is at least as new as every byte on storage.  There's no need to
> re-read it, which is what we'll do if we ClearPageUptodate.
> 

Agreed, Pagetodate(page) means the whole page content is available now,
see create_page_buffers() -> create_empty_buffers() and try_to_free_buffers()
(much like .releasepage()) in buffer head approach.

Reviewed-by: Gao Xiang <hsiangkao@redhat.com>

Thanks,
Gao Xiang
Sedat Dilek Sept. 24, 2020, 3:08 p.m. UTC | #4
On Thu, Sep 24, 2020 at 2:58 PM Matthew Wilcox (Oracle)
<willy@infradead.org> wrote:
>
> For filesystems with block size < page size, we need to set all the
> per-block uptodate bits if the page was already uptodate at the time
> we create the per-block metadata.  This can happen if the page is
> invalidated (eg by a write to drop_caches) but ultimately not removed
> from the page cache.
>
> This is a data corruption issue as page writeback skips blocks which
> are marked !uptodate.
>
> Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")

This commit is also in Linux v5.9-rc6+ but does not cleanly apply.
Against which Git tree is this patch?
When Linux v5.9-rc6+ is affected, do you have a backport?

- Sedat -

> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reported-by: Qian Cai <cai@redhat.com>
> Cc: Brian Foster <bfoster@redhat.com>
> ---
>  fs/iomap/buffered-io.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 8b6cca7e34e4..8180061b9e16 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -60,6 +60,8 @@ iomap_page_create(struct inode *inode, struct page *page)
>         iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
>                         GFP_NOFS | __GFP_NOFAIL);
>         spin_lock_init(&iop->uptodate_lock);
> +       if (PageUptodate(page))
> +               bitmap_fill(iop->uptodate, nr_blocks);
>         attach_page_private(page, iop);
>         return iop;
>  }
> --
> 2.28.0
>
Brian Foster Sept. 24, 2020, 3:12 p.m. UTC | #5
On Thu, Sep 24, 2020 at 02:59:00PM +0100, Matthew Wilcox wrote:
> On Thu, Sep 24, 2020 at 09:12:35AM -0400, Brian Foster wrote:
> > On Thu, Sep 24, 2020 at 01:56:08PM +0100, Matthew Wilcox (Oracle) wrote:
> > > For filesystems with block size < page size, we need to set all the
> > > per-block uptodate bits if the page was already uptodate at the time
> > > we create the per-block metadata.  This can happen if the page is
> > > invalidated (eg by a write to drop_caches) but ultimately not removed
> > > from the page cache.
> > > 
> > > This is a data corruption issue as page writeback skips blocks which
> > > are marked !uptodate.
> > 
> > Thanks. Based on my testing of clearing PageUptodate here I suspect this
> > will similarly prevent the problem, but I'll give this a test
> > nonetheless. 
> > 
> > I am a little curious why we'd prefer to fill the iop here rather than
> > just clear the page state if the iop data has been released. If the page
> > is partially uptodate, then we end up having to re-read the page
> > anyways, right? OTOH, I guess this behavior is more consistent with page
> > size == block size filesystems where iop wouldn't exist and we just go
> > by page state, so perhaps that makes more sense.
> 
> Well, it's _true_ ... the PageUptodate bit means that every byte in this
> page is at least as new as every byte on storage.  There's no need to
> re-read it, which is what we'll do if we ClearPageUptodate.
> 

Yes, of course. I'm just noting the inconsistent behavior between a full
and partially uptodate page.

Brian

> My original motivation for this was splitting a THP.  In that case,
> we have, let's say, 16 * 4kB pages, and an iop for 64 blocks.  When we
> split that 64kB page into 16 4kB pages, we can't afford to allocate 16
> iops for them, so we just drop the iop and copy the uptodate state from
> the head page to all subpages.
> 
> So now we have 16 pages, all marked uptodate (and with valid data) but
> no iop.  So we need to create an iop for each page during the writeback
> path, and that has to be created with uptodate bits or we'll skip the
> entire page.  When I wrote the patch below, I had no idea we could
> already get an iop allocated for an uptodate page, or I would have
> submitted this patch months ago.
> 
> http://git.infradead.org/users/willy/pagecache.git/commitdiff/bc503912d4a9aad4496a4591e9992f0ada47a9c9
>
Matthew Wilcox (Oracle) Sept. 24, 2020, 3:15 p.m. UTC | #6
On Thu, Sep 24, 2020 at 05:08:03PM +0200, Sedat Dilek wrote:
> On Thu, Sep 24, 2020 at 2:58 PM Matthew Wilcox (Oracle)
> <willy@infradead.org> wrote:
> >
> > For filesystems with block size < page size, we need to set all the
> > per-block uptodate bits if the page was already uptodate at the time
> > we create the per-block metadata.  This can happen if the page is
> > invalidated (eg by a write to drop_caches) but ultimately not removed
> > from the page cache.
> >
> > This is a data corruption issue as page writeback skips blocks which
> > are marked !uptodate.
> >
> > Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
> 
> This commit is also in Linux v5.9-rc6+ but does not cleanly apply.
> Against which Git tree is this patch?
> When Linux v5.9-rc6+ is affected, do you have a backport?

This applies to v5.8.  I'll happily backport this for any other kernel
versions.

From 51f85a97ccdd7071e5f95b2ac4e41c12bf4d4176 Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Thu, 24 Sep 2020 08:44:56 -0400
Subject: [PATCH] iomap: Set all uptodate bits for an Uptodate page

For filesystems with block size < page size, we need to set all the
per-block uptodate bits if the page was already uptodate at the time
we create the per-block metadata.  This can happen if the page is
invalidated (eg by a write to drop_caches) but ultimately not removed
from the page cache.

This is a data corruption issue as page writeback skips blocks which
are marked !uptodate.

Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Qian Cai <cai@redhat.com>
Cc: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index bcfc288dba3f..810f7dae11d9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -53,7 +53,10 @@ iomap_page_create(struct inode *inode, struct page *page)
 	atomic_set(&iop->read_count, 0);
 	atomic_set(&iop->write_count, 0);
 	spin_lock_init(&iop->uptodate_lock);
-	bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
+	if (PageUptodate(page))
+		bitmap_fill(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
+	else
+		bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
 
 	/*
 	 * migrate_page_move_mapping() assumes that pages with private data have
Sedat Dilek Sept. 24, 2020, 3:21 p.m. UTC | #7
On Thu, Sep 24, 2020 at 5:15 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Sep 24, 2020 at 05:08:03PM +0200, Sedat Dilek wrote:
> > On Thu, Sep 24, 2020 at 2:58 PM Matthew Wilcox (Oracle)
> > <willy@infradead.org> wrote:
> > >
> > > For filesystems with block size < page size, we need to set all the
> > > per-block uptodate bits if the page was already uptodate at the time
> > > we create the per-block metadata.  This can happen if the page is
> > > invalidated (eg by a write to drop_caches) but ultimately not removed
> > > from the page cache.
> > >
> > > This is a data corruption issue as page writeback skips blocks which
> > > are marked !uptodate.
> > >
> > > Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
> >
> > This commit is also in Linux v5.9-rc6+ but does not cleanly apply.
> > Against which Git tree is this patch?
> > When Linux v5.9-rc6+ is affected, do you have a backport?
>
> This applies to v5.8.  I'll happily backport this for any other kernel
> versions.
>
> From 51f85a97ccdd7071e5f95b2ac4e41c12bf4d4176 Mon Sep 17 00:00:00 2001
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Date: Thu, 24 Sep 2020 08:44:56 -0400
> Subject: [PATCH] iomap: Set all uptodate bits for an Uptodate page
>
> For filesystems with block size < page size, we need to set all the
> per-block uptodate bits if the page was already uptodate at the time
> we create the per-block metadata.  This can happen if the page is
> invalidated (eg by a write to drop_caches) but ultimately not removed
> from the page cache.
>
> This is a data corruption issue as page writeback skips blocks which
> are marked !uptodate.
>
> Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reported-by: Qian Cai <cai@redhat.com>
> Cc: Brian Foster <bfoster@redhat.com>
> ---
>  fs/iomap/buffered-io.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index bcfc288dba3f..810f7dae11d9 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -53,7 +53,10 @@ iomap_page_create(struct inode *inode, struct page *page)
>         atomic_set(&iop->read_count, 0);
>         atomic_set(&iop->write_count, 0);
>         spin_lock_init(&iop->uptodate_lock);
> -       bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
> +       if (PageUptodate(page))
> +               bitmap_fill(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
> +       else
> +               bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
>
>         /*
>          * migrate_page_move_mapping() assumes that pages with private data have
> --
> 2.28.0
>

Great and thanks.

Can you send out a seperate patch and label it with "PATCH v5.9"?
I run:
$ git format-patch -1 --subject-prefix="PATCH v5.9" --signoff

Normally, I catch patches from any patchwork URL in mbox format.

Thanks.

- Sedat -

[1] https://patchwork.kernel.org/project/linux-fsdevel/list/
[2] https://patchwork.kernel.org/patch/11797229/ <--- XXX: [ PATCH v5.8 ]
Matthew Wilcox (Oracle) Sept. 24, 2020, 3:22 p.m. UTC | #8
On Thu, Sep 24, 2020 at 11:12:59AM -0400, Brian Foster wrote:
> On Thu, Sep 24, 2020 at 02:59:00PM +0100, Matthew Wilcox wrote:
> > On Thu, Sep 24, 2020 at 09:12:35AM -0400, Brian Foster wrote:
> > > On Thu, Sep 24, 2020 at 01:56:08PM +0100, Matthew Wilcox (Oracle) wrote:
> > > > For filesystems with block size < page size, we need to set all the
> > > > per-block uptodate bits if the page was already uptodate at the time
> > > > we create the per-block metadata.  This can happen if the page is
> > > > invalidated (eg by a write to drop_caches) but ultimately not removed
> > > > from the page cache.
> > > > 
> > > > This is a data corruption issue as page writeback skips blocks which
> > > > are marked !uptodate.
> > > 
> > > Thanks. Based on my testing of clearing PageUptodate here I suspect this
> > > will similarly prevent the problem, but I'll give this a test
> > > nonetheless. 
> > > 
> > > I am a little curious why we'd prefer to fill the iop here rather than
> > > just clear the page state if the iop data has been released. If the page
> > > is partially uptodate, then we end up having to re-read the page
> > > anyways, right? OTOH, I guess this behavior is more consistent with page
> > > size == block size filesystems where iop wouldn't exist and we just go
> > > by page state, so perhaps that makes more sense.
> > 
> > Well, it's _true_ ... the PageUptodate bit means that every byte in this
> > page is at least as new as every byte on storage.  There's no need to
> > re-read it, which is what we'll do if we ClearPageUptodate.
> 
> Yes, of course. I'm just noting the inconsistent behavior between a full
> and partially uptodate page.

Heh, well, we have no way of knowing.  We literally just threw away
the information about which blocks are uptodate.  So the best we can
do is work with the single bit we have.  We do know that there are no
dirty blocks left on the page at this point (... maybe we should add a
VM_BUG_ON(!PageUptodate && PageDirty)).

Something we could do is summarise the block uptodate information in
the 32/64 bits of page_private without setting PagePrivate.  That would
cause us to still allocate an iop so we can track reads/writes, but we
might be able to avoid a few reads.

But I don't think it's worth it.  Partially uptodate pages are not what
we should be optimising for; we should try to get & keep pages uptodate.
After all, it's a page cache ;-)
Matthew Wilcox (Oracle) Sept. 24, 2020, 3:27 p.m. UTC | #9
On Thu, Sep 24, 2020 at 05:21:00PM +0200, Sedat Dilek wrote:
> Great and thanks.
> 
> Can you send out a seperate patch and label it with "PATCH v5.9"?
> I run:
> $ git format-patch -1 --subject-prefix="PATCH v5.9" --signoff
> 
> Normally, I catch patches from any patchwork URL in mbox format.

Maybe wait a few hours for people to decide if they like the approach
taken to fix the bug before diving into producing backports?
Sedat Dilek Sept. 24, 2020, 4:19 p.m. UTC | #10
On Thu, Sep 24, 2020 at 5:27 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Sep 24, 2020 at 05:21:00PM +0200, Sedat Dilek wrote:
> > Great and thanks.
> >
> > Can you send out a seperate patch and label it with "PATCH v5.9"?
> > I run:
> > $ git format-patch -1 --subject-prefix="PATCH v5.9" --signoff
> >
> > Normally, I catch patches from any patchwork URL in mbox format.
>
> Maybe wait a few hours for people to decide if they like the approach
> taken to fix the bug before diving into producing backports?

That make sense.

You have a test-case for me?
I have here Linux-Test-Project and FIO available.

- Sedat -
Matthew Wilcox (Oracle) Sept. 24, 2020, 4:36 p.m. UTC | #11
On Thu, Sep 24, 2020 at 06:19:03PM +0200, Sedat Dilek wrote:
> On Thu, Sep 24, 2020 at 5:27 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Sep 24, 2020 at 05:21:00PM +0200, Sedat Dilek wrote:
> > > Great and thanks.
> > >
> > > Can you send out a seperate patch and label it with "PATCH v5.9"?
> > > I run:
> > > $ git format-patch -1 --subject-prefix="PATCH v5.9" --signoff
> > >
> > > Normally, I catch patches from any patchwork URL in mbox format.
> >
> > Maybe wait a few hours for people to decide if they like the approach
> > taken to fix the bug before diving into producing backports?
> 
> That make sense.
> 
> You have a test-case for me?
> I have here Linux-Test-Project and FIO available.

Qian reported preadv203.c could reproduce it easily on POWER and ARM.
They have 64kB pages, so it's easier to hit.  You need to have a
filesystem with block size < page size to hit the problem.

If you want to check that your test case hits the problem, stick a printk
in iomap_page_create().
Brian Foster Sept. 24, 2020, 5:26 p.m. UTC | #12
On Thu, Sep 24, 2020 at 04:22:11PM +0100, Matthew Wilcox wrote:
> On Thu, Sep 24, 2020 at 11:12:59AM -0400, Brian Foster wrote:
> > On Thu, Sep 24, 2020 at 02:59:00PM +0100, Matthew Wilcox wrote:
> > > On Thu, Sep 24, 2020 at 09:12:35AM -0400, Brian Foster wrote:
> > > > On Thu, Sep 24, 2020 at 01:56:08PM +0100, Matthew Wilcox (Oracle) wrote:
> > > > > For filesystems with block size < page size, we need to set all the
> > > > > per-block uptodate bits if the page was already uptodate at the time
> > > > > we create the per-block metadata.  This can happen if the page is
> > > > > invalidated (eg by a write to drop_caches) but ultimately not removed
> > > > > from the page cache.
> > > > > 
> > > > > This is a data corruption issue as page writeback skips blocks which
> > > > > are marked !uptodate.
> > > > 
> > > > Thanks. Based on my testing of clearing PageUptodate here I suspect this
> > > > will similarly prevent the problem, but I'll give this a test
> > > > nonetheless. 
> > > > 
> > > > I am a little curious why we'd prefer to fill the iop here rather than
> > > > just clear the page state if the iop data has been released. If the page
> > > > is partially uptodate, then we end up having to re-read the page
> > > > anyways, right? OTOH, I guess this behavior is more consistent with page
> > > > size == block size filesystems where iop wouldn't exist and we just go
> > > > by page state, so perhaps that makes more sense.
> > > 
> > > Well, it's _true_ ... the PageUptodate bit means that every byte in this
> > > page is at least as new as every byte on storage.  There's no need to
> > > re-read it, which is what we'll do if we ClearPageUptodate.
> > 
> > Yes, of course. I'm just noting the inconsistent behavior between a full
> > and partially uptodate page.
> 
> Heh, well, we have no way of knowing.  We literally just threw away
> the information about which blocks are uptodate.  So the best we can
> do is work with the single bit we have.  We do know that there are no
> dirty blocks left on the page at this point (... maybe we should add a
> VM_BUG_ON(!PageUptodate && PageDirty)).
> 

Right..

> Something we could do is summarise the block uptodate information in
> the 32/64 bits of page_private without setting PagePrivate.  That would
> cause us to still allocate an iop so we can track reads/writes, but we
> might be able to avoid a few reads.
> 
> But I don't think it's worth it.  Partially uptodate pages are not what
> we should be optimising for; we should try to get & keep pages uptodate.
> After all, it's a page cache ;-)
> 

Fair enough. I was thinking about whether we could ensure the page is
released if releasepage() effectively invalidated the page content (or
avoid the release if we know the mapping won't be removed), but that
appears to be nontrivial given the refcount interdependencies between
page private and removing the mapping. I.e., the private data can hold a
reference on the page and remove_mapping() wants to assume that the
caller and page cache hold the last references on the page.

Brian
Matthew Wilcox (Oracle) Sept. 24, 2020, 5:56 p.m. UTC | #13
On Thu, Sep 24, 2020 at 01:26:53PM -0400, Brian Foster wrote:
> On Thu, Sep 24, 2020 at 04:22:11PM +0100, Matthew Wilcox wrote:
> > On Thu, Sep 24, 2020 at 11:12:59AM -0400, Brian Foster wrote:
> > > On Thu, Sep 24, 2020 at 02:59:00PM +0100, Matthew Wilcox wrote:
> > > > On Thu, Sep 24, 2020 at 09:12:35AM -0400, Brian Foster wrote:
> > > > > On Thu, Sep 24, 2020 at 01:56:08PM +0100, Matthew Wilcox (Oracle) wrote:
> > > > > > For filesystems with block size < page size, we need to set all the
> > > > > > per-block uptodate bits if the page was already uptodate at the time
> > > > > > we create the per-block metadata.  This can happen if the page is
> > > > > > invalidated (eg by a write to drop_caches) but ultimately not removed
> > > > > > from the page cache.
> > > > > > 
> > > > > > This is a data corruption issue as page writeback skips blocks which
> > > > > > are marked !uptodate.
> > > > > 
> > > > > Thanks. Based on my testing of clearing PageUptodate here I suspect this
> > > > > will similarly prevent the problem, but I'll give this a test
> > > > > nonetheless. 
> > > > > 
> > > > > I am a little curious why we'd prefer to fill the iop here rather than
> > > > > just clear the page state if the iop data has been released. If the page
> > > > > is partially uptodate, then we end up having to re-read the page
> > > > > anyways, right? OTOH, I guess this behavior is more consistent with page
> > > > > size == block size filesystems where iop wouldn't exist and we just go
> > > > > by page state, so perhaps that makes more sense.
> > > > 
> > > > Well, it's _true_ ... the PageUptodate bit means that every byte in this
> > > > page is at least as new as every byte on storage.  There's no need to
> > > > re-read it, which is what we'll do if we ClearPageUptodate.
> > > 
> > > Yes, of course. I'm just noting the inconsistent behavior between a full
> > > and partially uptodate page.
> > 
> > Heh, well, we have no way of knowing.  We literally just threw away
> > the information about which blocks are uptodate.  So the best we can
> > do is work with the single bit we have.  We do know that there are no
> > dirty blocks left on the page at this point (... maybe we should add a
> > VM_BUG_ON(!PageUptodate && PageDirty)).
> > 
> 
> Right..
> 
> > Something we could do is summarise the block uptodate information in
> > the 32/64 bits of page_private without setting PagePrivate.  That would
> > cause us to still allocate an iop so we can track reads/writes, but we
> > might be able to avoid a few reads.
> > 
> > But I don't think it's worth it.  Partially uptodate pages are not what
> > we should be optimising for; we should try to get & keep pages uptodate.
> > After all, it's a page cache ;-)
> > 
> 
> Fair enough. I was thinking about whether we could ensure the page is
> released if releasepage() effectively invalidated the page content (or
> avoid the release if we know the mapping won't be removed), but that
> appears to be nontrivial given the refcount interdependencies between
> page private and removing the mapping. I.e., the private data can hold a
> reference on the page and remove_mapping() wants to assume that the
> caller and page cache hold the last references on the page.

We could fix that -- remove_mapping() could take into account
page_has_private() in its call to page_ref_freeze() -- ie:

-       refcount = 1 + compound_nr(page);
+	retcount = 1 + compound_nr(page) + page_has_private(page);

like some other parts of the VM do.  And then the filesystem could
detach_page_private() in its aops->freepage() (which XFS aops don't
currently use).

That change might be a little larger than would be appreciated for a
data corruption fix going back two years.  And there's already other
reasons for wanting to be able to create an iop for an Uptodate page
(ie the THP patchset).
Sedat Dilek Sept. 24, 2020, 6:27 p.m. UTC | #14
On Thu, Sep 24, 2020 at 6:36 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Sep 24, 2020 at 06:19:03PM +0200, Sedat Dilek wrote:
> > On Thu, Sep 24, 2020 at 5:27 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Sep 24, 2020 at 05:21:00PM +0200, Sedat Dilek wrote:
> > > > Great and thanks.
> > > >
> > > > Can you send out a seperate patch and label it with "PATCH v5.9"?
> > > > I run:
> > > > $ git format-patch -1 --subject-prefix="PATCH v5.9" --signoff
> > > >
> > > > Normally, I catch patches from any patchwork URL in mbox format.
> > >
> > > Maybe wait a few hours for people to decide if they like the approach
> > > taken to fix the bug before diving into producing backports?
> >
> > That make sense.
> >
> > You have a test-case for me?
> > I have here Linux-Test-Project and FIO available.
>
> Qian reported preadv203.c could reproduce it easily on POWER and ARM.
> They have 64kB pages, so it's easier to hit.  You need to have a
> filesystem with block size < page size to hit the problem.
>
> If you want to check that your test case hits the problem, stick a printk
> in iomap_page_create().

I run both linux-kernel on my Debian/unstable AMD64 host (means not in
a VM) with and without your patch.

Instructions:
cd /opt/ltp
./runltp -f syscalls -s preadv203

Unfortunately, the logs in the "results" directory have only the short summary.

Testcase                                           Result     Exit Value
--------                                           ------     ----------
preadv203                                          PASS       0
preadv203_64                                       PASS       0

So, I guess I am not hitting the issue?
Or do I miss some important kernel-config?

- Sedat -
Matthew Wilcox (Oracle) Sept. 24, 2020, 6:44 p.m. UTC | #15
On Thu, Sep 24, 2020 at 08:27:26PM +0200, Sedat Dilek wrote:
> On Thu, Sep 24, 2020 at 6:36 PM Matthew Wilcox <willy@infradead.org> wrote:
> > On Thu, Sep 24, 2020 at 06:19:03PM +0200, Sedat Dilek wrote:
> > > On Thu, Sep 24, 2020 at 5:27 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Thu, Sep 24, 2020 at 05:21:00PM +0200, Sedat Dilek wrote:
> > > > > Great and thanks.
> > > > >
> > > > > Can you send out a seperate patch and label it with "PATCH v5.9"?
> > > > > I run:
> > > > > $ git format-patch -1 --subject-prefix="PATCH v5.9" --signoff
> > > > >
> > > > > Normally, I catch patches from any patchwork URL in mbox format.
> > > >
> > > > Maybe wait a few hours for people to decide if they like the approach
> > > > taken to fix the bug before diving into producing backports?
> > >
> > > That make sense.
> > >
> > > You have a test-case for me?
> > > I have here Linux-Test-Project and FIO available.
> >
> > Qian reported preadv203.c could reproduce it easily on POWER and ARM.
> > They have 64kB pages, so it's easier to hit.  You need to have a
> > filesystem with block size < page size to hit the problem.
> >
> > If you want to check that your test case hits the problem, stick a printk
> > in iomap_page_create().
> 
> I run both linux-kernel on my Debian/unstable AMD64 host (means not in
> a VM) with and without your patch.
> 
> Instructions:
> cd /opt/ltp
> ./runltp -f syscalls -s preadv203
> 
> Unfortunately, the logs in the "results" directory have only the short summary.
> 
> Testcase                                           Result     Exit Value
> --------                                           ------     ----------
> preadv203                                          PASS       0
> preadv203_64                                       PASS       0
> 
> So, I guess I am not hitting the issue?
> Or do I miss some important kernel-config?

Probably.  I don't know how to tell LTP to use a 1kB block size
filesystem.  I tried running that testcase by itself and the logs made
it clear it was using a 4kB block size filesystem.
Qian Cai Sept. 24, 2020, 6:47 p.m. UTC | #16
On Thu, 2020-09-24 at 20:27 +0200, Sedat Dilek wrote:
> I run both linux-kernel on my Debian/unstable AMD64 host (means not in
> a VM) with and without your patch.
> 
> Instructions:
> cd /opt/ltp
> ./runltp -f syscalls -s preadv203
> 
> Unfortunately, the logs in the "results" directory have only the short
> summary.
> 
> Testcase                                           Result     Exit Value
> --------                                           ------     ----------
> preadv203                                          PASS       0
> preadv203_64                                       PASS       0
> 
> So, I guess I am not hitting the issue?
> Or do I miss some important kernel-config?

https://gitlab.com/cailca/linux-mm/-/blob/master/arm64.config

That is .config I used to reproduce. Then, I ran the linux-mm testsuite (lots of
hard-coded places because I only need to run this on RHEL8) first:

# git clone https://gitlab.com/cailca/linux-mm
# cd linux-mm; make
# ./random -k

Then, run the whole LTP syscalls:
# ./runltp -f syscalls

If that is still not triggered, it needs some syscall fuzzing:
# ./random -x 0-100 -f
Sedat Dilek Sept. 24, 2020, 7:54 p.m. UTC | #17
On Thu, Sep 24, 2020 at 8:47 PM Qian Cai <cai@redhat.com> wrote:
>
> On Thu, 2020-09-24 at 20:27 +0200, Sedat Dilek wrote:
> > I run both linux-kernel on my Debian/unstable AMD64 host (means not in
> > a VM) with and without your patch.
> >
> > Instructions:
> > cd /opt/ltp
> > ./runltp -f syscalls -s preadv203
> >
> > Unfortunately, the logs in the "results" directory have only the short
> > summary.
> >
> > Testcase                                           Result     Exit Value
> > --------                                           ------     ----------
> > preadv203                                          PASS       0
> > preadv203_64                                       PASS       0
> >
> > So, I guess I am not hitting the issue?
> > Or do I miss some important kernel-config?
>
> https://gitlab.com/cailca/linux-mm/-/blob/master/arm64.config
>
> That is .config I used to reproduce. Then, I ran the linux-mm testsuite (lots of
> hard-coded places because I only need to run this on RHEL8) first:
>
> # git clone https://gitlab.com/cailca/linux-mm
> # cd linux-mm; make
> # ./random -k
>
> Then, run the whole LTP syscalls:
> # ./runltp -f syscalls
>

Doing this right now.

> If that is still not triggered, it needs some syscall fuzzing:
> # ./random -x 0-100 -f
>

Out of curiosity:
You are named in "mm: fix misplaced unlock_page in do_wp_page()".
Is this here a different issue?

- Sedat -

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be068f29034fb00530a053d18b8cf140c32b12b3
Matthew Wilcox (Oracle) Sept. 24, 2020, 8:02 p.m. UTC | #18
On Thu, Sep 24, 2020 at 09:54:36PM +0200, Sedat Dilek wrote:
> You are named in "mm: fix misplaced unlock_page in do_wp_page()".
> Is this here a different issue?

Yes, completely different.  That bug is one Linus introduced in this
cycle; the bug that this patch fixes was introduced a couple of years
ago, and we only noticed now because I added an assertion to -next.
Maybe I should add the assertion for 5.9 too.
Sedat Dilek Sept. 24, 2020, 8:04 p.m. UTC | #19
On Thu, Sep 24, 2020 at 10:02 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Sep 24, 2020 at 09:54:36PM +0200, Sedat Dilek wrote:
> > You are named in "mm: fix misplaced unlock_page in do_wp_page()".
> > Is this here a different issue?
>
> Yes, completely different.  That bug is one Linus introduced in this
> cycle; the bug that this patch fixes was introduced a couple of years
> ago, and we only noticed now because I added an assertion to -next.
> Maybe I should add the assertion for 5.9 too.

Can you point me to this "assertion"?
Thanks.

- Sedat -
Matthew Wilcox (Oracle) Sept. 24, 2020, 11:57 p.m. UTC | #20
On Thu, Sep 24, 2020 at 10:04:40PM +0200, Sedat Dilek wrote:
> On Thu, Sep 24, 2020 at 10:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Sep 24, 2020 at 09:54:36PM +0200, Sedat Dilek wrote:
> > > You are named in "mm: fix misplaced unlock_page in do_wp_page()".
> > > Is this here a different issue?
> >
> > Yes, completely different.  That bug is one Linus introduced in this
> > cycle; the bug that this patch fixes was introduced a couple of years
> > ago, and we only noticed now because I added an assertion to -next.
> > Maybe I should add the assertion for 5.9 too.
> 
> Can you point me to this "assertion"?
> Thanks.

Here's the version against 5.8

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 810f7dae11d9..b421e4efc4bd 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -70,11 +70,15 @@ static void
 iomap_page_release(struct page *page)
 {
 	struct iomap_page *iop = detach_page_private(page);
+	unsigned int nr_blocks = PAGE_SIZE / i_blocksize(page->mapping->host);
 
 	if (!iop)
 		return;
 	WARN_ON_ONCE(atomic_read(&iop->read_count));
 	WARN_ON_ONCE(atomic_read(&iop->write_count));
+	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
+			PageUptodate(page));
+
 	kfree(iop);
 }
Sedat Dilek Sept. 25, 2020, 2:13 a.m. UTC | #21
On Fri, Sep 25, 2020 at 1:57 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Sep 24, 2020 at 10:04:40PM +0200, Sedat Dilek wrote:
> > On Thu, Sep 24, 2020 at 10:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Sep 24, 2020 at 09:54:36PM +0200, Sedat Dilek wrote:
> > > > You are named in "mm: fix misplaced unlock_page in do_wp_page()".
> > > > Is this here a different issue?
> > >
> > > Yes, completely different.  That bug is one Linus introduced in this
> > > cycle; the bug that this patch fixes was introduced a couple of years
> > > ago, and we only noticed now because I added an assertion to -next.
> > > Maybe I should add the assertion for 5.9 too.
> >
> > Can you point me to this "assertion"?
> > Thanks.
>
> Here's the version against 5.8
>

5.8? I am here on 5.9.

- Sedat -

> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 810f7dae11d9..b421e4efc4bd 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -70,11 +70,15 @@ static void
>  iomap_page_release(struct page *page)
>  {
>         struct iomap_page *iop = detach_page_private(page);
> +       unsigned int nr_blocks = PAGE_SIZE / i_blocksize(page->mapping->host);
>
>         if (!iop)
>                 return;
>         WARN_ON_ONCE(atomic_read(&iop->read_count));
>         WARN_ON_ONCE(atomic_read(&iop->write_count));
> +       WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> +                       PageUptodate(page));
> +
>         kfree(iop);
>  }
>
Sedat Dilek Sept. 25, 2020, 10:44 a.m. UTC | #22
On Fri, Sep 25, 2020 at 1:57 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Sep 24, 2020 at 10:04:40PM +0200, Sedat Dilek wrote:
> > On Thu, Sep 24, 2020 at 10:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Sep 24, 2020 at 09:54:36PM +0200, Sedat Dilek wrote:
> > > > You are named in "mm: fix misplaced unlock_page in do_wp_page()".
> > > > Is this here a different issue?
> > >
> > > Yes, completely different.  That bug is one Linus introduced in this
> > > cycle; the bug that this patch fixes was introduced a couple of years
> > > ago, and we only noticed now because I added an assertion to -next.
> > > Maybe I should add the assertion for 5.9 too.
> >
> > Can you point me to this "assertion"?
> > Thanks.
>
> Here's the version against 5.8
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 810f7dae11d9..b421e4efc4bd 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -70,11 +70,15 @@ static void
>  iomap_page_release(struct page *page)
>  {
>         struct iomap_page *iop = detach_page_private(page);
> +       unsigned int nr_blocks = PAGE_SIZE / i_blocksize(page->mapping->host);
>
>         if (!iop)
>                 return;
>         WARN_ON_ONCE(atomic_read(&iop->read_count));
>         WARN_ON_ONCE(atomic_read(&iop->write_count));
> +       WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> +                       PageUptodate(page));
> +

Are you sure this is "bitmap_full()" or should it be "bitmap_f*i*ll()"?

Both are available in include/linux/bitmap.h.

- Sedat -

>         kfree(iop);
>  }
>
Sedat Dilek Sept. 25, 2020, 11:12 a.m. UTC | #23
On Fri, Sep 25, 2020 at 12:44 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
>
> On Fri, Sep 25, 2020 at 1:57 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Sep 24, 2020 at 10:04:40PM +0200, Sedat Dilek wrote:
> > > On Thu, Sep 24, 2020 at 10:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Thu, Sep 24, 2020 at 09:54:36PM +0200, Sedat Dilek wrote:
> > > > > You are named in "mm: fix misplaced unlock_page in do_wp_page()".
> > > > > Is this here a different issue?
> > > >
> > > > Yes, completely different.  That bug is one Linus introduced in this
> > > > cycle; the bug that this patch fixes was introduced a couple of years
> > > > ago, and we only noticed now because I added an assertion to -next.
> > > > Maybe I should add the assertion for 5.9 too.
> > >
> > > Can you point me to this "assertion"?
> > > Thanks.
> >
> > Here's the version against 5.8
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 810f7dae11d9..b421e4efc4bd 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -70,11 +70,15 @@ static void
> >  iomap_page_release(struct page *page)
> >  {
> >         struct iomap_page *iop = detach_page_private(page);
> > +       unsigned int nr_blocks = PAGE_SIZE / i_blocksize(page->mapping->host);
> >
> >         if (!iop)
> >                 return;
> >         WARN_ON_ONCE(atomic_read(&iop->read_count));
> >         WARN_ON_ONCE(atomic_read(&iop->write_count));
> > +       WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> > +                       PageUptodate(page));
> > +
>
> Are you sure this is "bitmap_full()" or should it be "bitmap_f*i*ll()"?
>
> Both are available in include/linux/bitmap.h.
>

OK, I checked linux-next (next-20200925) and iomap_page_release() (see
[1] and [2]).
Cut-N-Paste is malformatting here in Gmail, so I add the links below.

I also looked into __gfs2_readpage() in fs/gfs2/aops.c:

if (i_blocksize(page->mapping->host) == PAGE_SIZE &&
    !page_has_buffers(page)) {
           error = iomap_readpage(page, &gfs2_iomap_ops);

Thanks.

- Sedat -

[1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/fs/iomap/buffered-io.c?h=next-20200925#n67
[2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/fs/iomap/buffered-io.c?h=next-20200925#n77
[3] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/fs/gfs2/aops.c?h=next-20200925#n471



>
> >         kfree(iop);
> >  }
> >
Sedat Dilek Sept. 25, 2020, 1:36 p.m. UTC | #24
On Fri, Sep 25, 2020 at 3:24 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
>
> On Fri, Sep 25, 2020 at 1:57 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Sep 24, 2020 at 10:04:40PM +0200, Sedat Dilek wrote:
> > > On Thu, Sep 24, 2020 at 10:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Thu, Sep 24, 2020 at 09:54:36PM +0200, Sedat Dilek wrote:
> > > > > You are named in "mm: fix misplaced unlock_page in do_wp_page()".
> > > > > Is this here a different issue?
> > > >
> > > > Yes, completely different.  That bug is one Linus introduced in this
> > > > cycle; the bug that this patch fixes was introduced a couple of years
> > > > ago, and we only noticed now because I added an assertion to -next.
> > > > Maybe I should add the assertion for 5.9 too.
> > >
> > > Can you point me to this "assertion"?
> > > Thanks.
> >
> > Here's the version against 5.8
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 810f7dae11d9..b421e4efc4bd 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -70,11 +70,15 @@ static void
> >  iomap_page_release(struct page *page)
> >  {
> >         struct iomap_page *iop = detach_page_private(page);
> > +       unsigned int nr_blocks = PAGE_SIZE / i_blocksize(page->mapping->host);
> >
> >         if (!iop)
> >                 return;
> >         WARN_ON_ONCE(atomic_read(&iop->read_count));
> >         WARN_ON_ONCE(atomic_read(&iop->write_count));
> > +       WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> > +                       PageUptodate(page));
> > +
> >         kfree(iop);
> >  }
> >
>
> I have applied your diff on top of Linux v5.9-rc6+ together with
> "iomap: Set all uptodate bits for an Uptodate page".
>
> Run LTP tests:
>
> #1: syscalls (all)
> #2: syscalls/preadv203
> #3: syscalls/dirtyc0w
>
> With #1 I see some failures with madvise0x tests.
>
> I have attached the excerpt of my kernel-log and my kernel-config.
>

I have run syscalls/madvise01..syscalls/madvise10, see attached
(reduced) dmesg-log.

- Sedat -
[Fri Sep 25 15:29:32 2020] LTP: starting madvise01
[Fri Sep 25 15:29:32 2020] Injecting memory failure for pfn 0x17d70a at process virtual address 0x7f6f1d5ed000
[Fri Sep 25 15:29:32 2020] Memory failure: 0x17d70a: recovery action for clean LRU page: Recovered
[Fri Sep 25 15:29:32 2020] Injecting memory failure for pfn 0x1bf021 at process virtual address 0x7f6f1d5ee000
[Fri Sep 25 15:29:32 2020] Memory failure: 0x1bf021: recovery action for clean LRU page: Recovered
[Fri Sep 25 15:29:32 2020] Injecting memory failure for pfn 0x1f1bff at process virtual address 0x7f6f1d5ef000
[Fri Sep 25 15:29:32 2020] Memory failure: 0x1f1bff: recovery action for clean LRU page: Recovered
[Fri Sep 25 15:29:32 2020] Injecting memory failure for pfn 0x1ecd4e at process virtual address 0x7f6f1d5f0000
[Fri Sep 25 15:29:32 2020] Memory failure: 0x1ecd4e: recovery action for clean LRU page: Recovered
[Fri Sep 25 15:29:32 2020] Injecting memory failure for pfn 0x2090c2 at process virtual address 0x7f6f1d5f1000
[Fri Sep 25 15:29:32 2020] Memory failure: 0x2090c2: recovery action for clean LRU page: Recovered
[Fri Sep 25 15:29:32 2020] Injecting memory failure for pfn 0x115260 at process virtual address 0x7f6f1d5f2000
[Fri Sep 25 15:29:32 2020] Memory failure: 0x115260: recovery action for clean LRU page: Recovered
[Fri Sep 25 15:29:32 2020] Injecting memory failure for pfn 0x182902 at process virtual address 0x7f6f1d5f3000
[Fri Sep 25 15:29:32 2020] Memory failure: 0x182902: recovery action for clean LRU page: Recovered
[Fri Sep 25 15:29:32 2020] Injecting memory failure for pfn 0x125c05 at process virtual address 0x7f6f1d5f4000
[Fri Sep 25 15:29:32 2020] Memory failure: 0x125c05: recovery action for clean LRU page: Recovered
[Fri Sep 25 15:29:32 2020] Injecting memory failure for pfn 0x127d91 at process virtual address 0x7f6f1d5f5000
[Fri Sep 25 15:29:32 2020] Memory failure: 0x127d91: recovery action for clean LRU page: Recovered
[Fri Sep 25 15:29:32 2020] Injecting memory failure for pfn 0x187df3 at process virtual address 0x7f6f1d5f6000
[Fri Sep 25 15:29:32 2020] Memory failure: 0x187df3: recovery action for clean LRU page: Recovered
[Fri Sep 25 15:29:32 2020] LTP: starting madvise02
[Fri Sep 25 15:29:33 2020] LTP: starting madvise05
[Fri Sep 25 15:29:33 2020] LTP: starting madvise06
[Fri Sep 25 15:29:34 2020] madvise06 (539269): drop_caches: 3
[Fri Sep 25 15:29:46 2020] LTP: starting madvise07
[Fri Sep 25 15:29:46 2020] Injecting memory failure for pfn 0x1f7993 at process virtual address 0x7fbe67bfb000
[Fri Sep 25 15:29:46 2020] Memory failure: 0x1f7993: recovery action for dirty LRU page: Recovered
[Fri Sep 25 15:29:46 2020] MCE: Killing madvise07:539405 due to hardware memory corruption fault at 7fbe67bfb000
[Fri Sep 25 15:29:46 2020] LTP: starting madvise08
[Fri Sep 25 15:29:46 2020] LTP: starting madvise09
[Fri Sep 25 15:29:47 2020] madvise09 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[Fri Sep 25 15:29:47 2020] CPU: 1 PID: 539680 Comm: madvise09 Tainted: G            E     5.9.0-rc6-5-amd64-clang-cfi #5~bullseye+dileks1
[Fri Sep 25 15:29:47 2020] Hardware name: SAMSUNG ELECTRONICS CO., LTD. 530U3BI/530U4BI/530U4BH/530U3BI/530U4BI/530U4BH, BIOS 13XK 03/28/2013
[Fri Sep 25 15:29:47 2020] Call Trace:
[Fri Sep 25 15:29:47 2020]  dump_stack+0x64/0x9b
[Fri Sep 25 15:29:47 2020]  dump_header+0x50/0x230
[Fri Sep 25 15:29:47 2020]  oom_kill_process+0xa1/0x170
[Fri Sep 25 15:29:47 2020]  out_of_memory+0x265/0x330
[Fri Sep 25 15:29:47 2020]  mem_cgroup_oom+0x313/0x360
[Fri Sep 25 15:29:47 2020]  try_charge+0x51f/0x730
[Fri Sep 25 15:29:47 2020]  mem_cgroup_charge+0x100/0x300
[Fri Sep 25 15:29:47 2020]  do_anonymous_page+0x229/0x690
[Fri Sep 25 15:29:47 2020]  ? __schedule+0x41a/0x7c0
[Fri Sep 25 15:29:47 2020]  ? dma_get_required_mask.cfi_jt+0x10/0x10
[Fri Sep 25 15:29:47 2020]  handle_mm_fault+0x8b0/0xcc0
[Fri Sep 25 15:29:47 2020]  do_user_addr_fault+0x201/0x3a0
[Fri Sep 25 15:29:47 2020]  ? asm_exc_page_fault+0x8/0x30
[Fri Sep 25 15:29:47 2020]  exc_page_fault+0x7a/0x270
[Fri Sep 25 15:29:47 2020]  ? asm_exc_page_fault+0x8/0x30
[Fri Sep 25 15:29:47 2020]  asm_exc_page_fault+0x1e/0x30
[Fri Sep 25 15:29:47 2020] RIP: 0033:0x55b8f6d6655b
[Fri Sep 25 15:29:47 2020] Code: 48 c1 ea 02 48 89 d0 48 f7 e5 48 89 d0 48 83 e2 fc 48 c1 e8 02 48 01 c2 48 8d 04 92 4c 89 f2 49 83 c6 01 48 c1 e0 02 48 29 c2 <41> 88 17 49 01 df e8 3a f5 ff ff 49 81 fe f4 01 00 00 75 b9 44 8b
[Fri Sep 25 15:29:47 2020] RSP: 002b:00007ffed18e2ba0 EFLAGS: 00010202
[Fri Sep 25 15:29:47 2020] RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000
[Fri Sep 25 15:29:47 2020] RDX: 000000000000001f RSI: 0000000000000000 RDI: 0000000000000001
[Fri Sep 25 15:29:47 2020] RBP: 28f5c28f5c28f5c3 R08: 00000000ffffffff R09: 0000000000000000
[Fri Sep 25 15:29:47 2020] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffed18e2bd0
[Fri Sep 25 15:29:47 2020] R13: 00000000001f4000 R14: 0000000000000020 R15: 00007f2bed355000
[Fri Sep 25 15:29:47 2020] memory: usage 8192kB, limit 8192kB, failcnt 485
[Fri Sep 25 15:29:47 2020] memory+swap: usage 16316kB, limit 16384kB, failcnt 15
[Fri Sep 25 15:29:47 2020] kmem: usage 124kB, limit 9007199254740988kB, failcnt 0
[Fri Sep 25 15:29:47 2020] Memory cgroup stats for /ltp_madvise09_539679:
[Fri Sep 25 15:29:47 2020] anon 8110080
[Fri Sep 25 15:29:47 2020] Tasks state (memory values in pages):
[Fri Sep 25 15:29:47 2020] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Fri Sep 25 15:29:47 2020] [ 539679]     0 539679      746      310    45056        8             0 madvise09
[Fri Sep 25 15:29:47 2020] [ 539680]     0 539680     5246     2163    81920     2039             0 madvise09
[Fri Sep 25 15:29:47 2020] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/ltp_madvise09_539679,task_memcg=/ltp_madvise09_539679,task=madvise09,pid=539680,uid=0
[Fri Sep 25 15:29:47 2020] Memory cgroup out of memory: Killed process 539680 (madvise09) total-vm:20984kB, anon-rss:8024kB, file-rss:628kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
[Fri Sep 25 15:29:47 2020] oom_reaper: reaped process 539680 (madvise09), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[Fri Sep 25 15:29:47 2020] LTP: starting madvise10
Matthew Wilcox (Oracle) Sept. 25, 2020, 1:46 p.m. UTC | #25
On Fri, Sep 25, 2020 at 03:36:01PM +0200, Sedat Dilek wrote:
> > I have applied your diff on top of Linux v5.9-rc6+ together with
> > "iomap: Set all uptodate bits for an Uptodate page".
> >
> > Run LTP tests:
> >
> > #1: syscalls (all)
> > #2: syscalls/preadv203
> > #3: syscalls/dirtyc0w
> >
> > With #1 I see some failures with madvise0x tests.

Why do you think these failures are related to my patches?

> [Fri Sep 25 15:29:46 2020] LTP: starting madvise09
> [Fri Sep 25 15:29:47 2020] madvise09 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> [Fri Sep 25 15:29:47 2020] CPU: 1 PID: 539680 Comm: madvise09 Tainted: G            E     5.9.0-rc6-5-amd64-clang-cfi #5~bullseye+dileks1
> [Fri Sep 25 15:29:47 2020] Hardware name: SAMSUNG ELECTRONICS CO., LTD. 530U3BI/530U4BI/530U4BH/530U3BI/530U4BI/530U4BH, BIOS 13XK 03/28/2013
> [Fri Sep 25 15:29:47 2020] Call Trace:
> [Fri Sep 25 15:29:47 2020]  dump_stack+0x64/0x9b
> [Fri Sep 25 15:29:47 2020]  dump_header+0x50/0x230
> [Fri Sep 25 15:29:47 2020]  oom_kill_process+0xa1/0x170
> [Fri Sep 25 15:29:47 2020]  out_of_memory+0x265/0x330
> [Fri Sep 25 15:29:47 2020]  mem_cgroup_oom+0x313/0x360
> [Fri Sep 25 15:29:47 2020]  try_charge+0x51f/0x730
> [Fri Sep 25 15:29:47 2020]  mem_cgroup_charge+0x100/0x300
> [Fri Sep 25 15:29:47 2020]  do_anonymous_page+0x229/0x690

... madvise09 took a page fault and was killed by mem_cgroup_oom

> [Fri Sep 25 15:29:47 2020] memory: usage 8192kB, limit 8192kB, failcnt 485

because it had used up all the memory it was allowed to
Sedat Dilek Sept. 25, 2020, 2:01 p.m. UTC | #26
On Fri, Sep 25, 2020 at 3:46 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Sep 25, 2020 at 03:36:01PM +0200, Sedat Dilek wrote:
> > > I have applied your diff on top of Linux v5.9-rc6+ together with
> > > "iomap: Set all uptodate bits for an Uptodate page".
> > >
> > > Run LTP tests:
> > >
> > > #1: syscalls (all)
> > > #2: syscalls/preadv203
> > > #3: syscalls/dirtyc0w
> > >
> > > With #1 I see some failures with madvise0x tests.
>
> Why do you think these failures are related to my patches?
>

Oh sorry, I was not saying it is related to your patches and I am not
familiar with all syscalls LTP tests.

You said:
> Qian reported preadv203.c could reproduce it easily on POWER and ARM.
> They have 64kB pages, so it's easier to hit.  You need to have a
> filesystem with block size < page size to hit the problem.

Here on my x86-64 Debian host I use Ext4-FS.
I can setup a new partition with a different filesystem if this helps.
Any recommendations?

How does the assertion look like in the logs?
You have an example.

Here on Debian I switched over to a newer kernel-libc and
kernel-headers - I guess it's good to recompile a recent LTP from Git
against this new stuff.
In my install-logs I have seen some packages where re-built against
new kernel-libc (capabilities etc.).

- Sedat -


- Sedat -
Matthew Wilcox (Oracle) Sept. 25, 2020, 3:53 p.m. UTC | #27
On Fri, Sep 25, 2020 at 04:01:02PM +0200, Sedat Dilek wrote:
> On Fri, Sep 25, 2020 at 3:46 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Sep 25, 2020 at 03:36:01PM +0200, Sedat Dilek wrote:
> > > > I have applied your diff on top of Linux v5.9-rc6+ together with
> > > > "iomap: Set all uptodate bits for an Uptodate page".
> > > >
> > > > Run LTP tests:
> > > >
> > > > #1: syscalls (all)
> > > > #2: syscalls/preadv203
> > > > #3: syscalls/dirtyc0w
> > > >
> > > > With #1 I see some failures with madvise0x tests.
> >
> > Why do you think these failures are related to my patches?
> 
> Oh sorry, I was not saying it is related to your patches and I am not
> familiar with all syscalls LTP tests.

It's probably a good idea to become familiar with the tests.  I'm not,
but a good way to work with any test-suite is to run it against a
presumed-good kernel, then against a kernel with changes and see whether
the failures change.

> You said:
> > Qian reported preadv203.c could reproduce it easily on POWER and ARM.
> > They have 64kB pages, so it's easier to hit.  You need to have a
> > filesystem with block size < page size to hit the problem.
> 
> Here on my x86-64 Debian host I use Ext4-FS.
> I can setup a new partition with a different filesystem if this helps.
> Any recommendations?

If I understand the output from preadv203 correctly, it sets up a loop
block device with a new filesystem on it, so it doesn't matter what your
host fs is.  What I don't know is how to change the block size for that
filesystem.

> How does the assertion look like in the logs?
> You have an example.

I happen to have one from my testing last night:

0006 ------------[ cut here ]------------
0006 WARNING: CPU: 5 PID: 1417 at fs/iomap/buffered-io.c:80 iomap_page_release+0xb1/0xc0
0006 bam!
0006 Modules linked in:
0006 CPU: 5 PID: 1417 Comm: fio Kdump: loaded Not tainted 5.8.0-00001-g51f85a97ccdd-dirty #54
0006 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
0006 RIP: 0010:iomap_page_release+0xb1/0xc0
0006 Code: 45 d9 48 8b 03 48 c1 e8 02 83 e0 01 75 13 38 d0 75 18 4c 89 ef e8 1f 6a f8 ff 5b 41 5c 41 5d 5d c3 eb eb e8 e1 07 f4 ff eb 8c <0f> 0b eb e4 0f 0b eb a8 0f 0b eb ac 0f 1f 00 55 48 89 e5 41 56 41
0006 RSP: 0018:ffffc90001ed3a40 EFLAGS: 00010202
0006 RAX: 0000000000000001 RBX: ffffea0001458ec0 RCX: ffffffff81cf75a7
0006 RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8880727d1f90
0006 RBP: ffffc90001ed3a58 R08: 0000000000000000 R09: ffff888051ddd6e8
0006 R10: 0000000000000005 R11: 0000000000000230 R12: 0000000000000004
0006 R13: ffff8880727d1f80 R14: 0000000000000005 R15: ffffea0001458ec0
0006 FS:  00007fe4bdd9df00(0000) GS:ffff88807f540000(0000) knlGS:0000000000000000
0006 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
0006 CR2: 00007fe4bdd50000 CR3: 000000006f7e6005 CR4: 0000000000360ea0
0006 Call Trace:
0006  iomap_releasepage+0x58/0xc0
0006  try_to_release_page+0x4b/0x60
0006  invalidate_inode_pages2_range+0x38b/0x3f0

I would suggest that you try applying just the assertion to Linus'
kernel, then try to make it fire.  Then apply the fix and see if you
can still make the assertion fire.

FWIW, I got it to fire with generic/095 from the xfstests test suite.
Darrick J. Wong Sept. 25, 2020, 6:17 p.m. UTC | #28
On Thu, Sep 24, 2020 at 01:56:08PM +0100, Matthew Wilcox (Oracle) wrote:
> For filesystems with block size < page size, we need to set all the
> per-block uptodate bits if the page was already uptodate at the time
> we create the per-block metadata.  This can happen if the page is
> invalidated (eg by a write to drop_caches) but ultimately not removed
> from the page cache.
> 
> This is a data corruption issue as page writeback skips blocks which
> are marked !uptodate.
> 
> Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reported-by: Qian Cai <cai@redhat.com>
> Cc: Brian Foster <bfoster@redhat.com>

Makes sense to me,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/iomap/buffered-io.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 8b6cca7e34e4..8180061b9e16 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -60,6 +60,8 @@ iomap_page_create(struct inode *inode, struct page *page)
>  	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
>  			GFP_NOFS | __GFP_NOFAIL);
>  	spin_lock_init(&iop->uptodate_lock);
> +	if (PageUptodate(page))
> +		bitmap_fill(iop->uptodate, nr_blocks);
>  	attach_page_private(page, iop);
>  	return iop;
>  }
> -- 
> 2.28.0
>
Sedat Dilek Sept. 26, 2020, 7:12 p.m. UTC | #29
On Fri, Sep 25, 2020 at 5:53 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Sep 25, 2020 at 04:01:02PM +0200, Sedat Dilek wrote:
> > On Fri, Sep 25, 2020 at 3:46 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, Sep 25, 2020 at 03:36:01PM +0200, Sedat Dilek wrote:
> > > > > I have applied your diff on top of Linux v5.9-rc6+ together with
> > > > > "iomap: Set all uptodate bits for an Uptodate page".
> > > > >
> > > > > Run LTP tests:
> > > > >
> > > > > #1: syscalls (all)
> > > > > #2: syscalls/preadv203
> > > > > #3: syscalls/dirtyc0w
> > > > >
> > > > > With #1 I see some failures with madvise0x tests.
> > >
> > > Why do you think these failures are related to my patches?
> >
> > Oh sorry, I was not saying it is related to your patches and I am not
> > familiar with all syscalls LTP tests.
>
> It's probably a good idea to become familiar with the tests.  I'm not,
> but a good way to work with any test-suite is to run it against a
> presumed-good kernel, then against a kernel with changes and see whether
> the failures change.
>
> > You said:
> > > Qian reported preadv203.c could reproduce it easily on POWER and ARM.
> > > They have 64kB pages, so it's easier to hit.  You need to have a
> > > filesystem with block size < page size to hit the problem.
> >
> > Here on my x86-64 Debian host I use Ext4-FS.
> > I can setup a new partition with a different filesystem if this helps.
> > Any recommendations?
>
> If I understand the output from preadv203 correctly, it sets up a loop
> block device with a new filesystem on it, so it doesn't matter what your
> host fs is.  What I don't know is how to change the block size for that
> filesystem.
>
> > How does the assertion look like in the logs?
> > You have an example.
>
> I happen to have one from my testing last night:
>
> 0006 ------------[ cut here ]------------
> 0006 WARNING: CPU: 5 PID: 1417 at fs/iomap/buffered-io.c:80 iomap_page_release+0xb1/0xc0
> 0006 bam!
> 0006 Modules linked in:
> 0006 CPU: 5 PID: 1417 Comm: fio Kdump: loaded Not tainted 5.8.0-00001-g51f85a97ccdd-dirty #54
> 0006 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
> 0006 RIP: 0010:iomap_page_release+0xb1/0xc0
> 0006 Code: 45 d9 48 8b 03 48 c1 e8 02 83 e0 01 75 13 38 d0 75 18 4c 89 ef e8 1f 6a f8 ff 5b 41 5c 41 5d 5d c3 eb eb e8 e1 07 f4 ff eb 8c <0f> 0b eb e4 0f 0b eb a8 0f 0b eb ac 0f 1f 00 55 48 89 e5 41 56 41
> 0006 RSP: 0018:ffffc90001ed3a40 EFLAGS: 00010202
> 0006 RAX: 0000000000000001 RBX: ffffea0001458ec0 RCX: ffffffff81cf75a7
> 0006 RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8880727d1f90
> 0006 RBP: ffffc90001ed3a58 R08: 0000000000000000 R09: ffff888051ddd6e8
> 0006 R10: 0000000000000005 R11: 0000000000000230 R12: 0000000000000004
> 0006 R13: ffff8880727d1f80 R14: 0000000000000005 R15: ffffea0001458ec0
> 0006 FS:  00007fe4bdd9df00(0000) GS:ffff88807f540000(0000) knlGS:0000000000000000
> 0006 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 0006 CR2: 00007fe4bdd50000 CR3: 000000006f7e6005 CR4: 0000000000360ea0
> 0006 Call Trace:
> 0006  iomap_releasepage+0x58/0xc0
> 0006  try_to_release_page+0x4b/0x60
> 0006  invalidate_inode_pages2_range+0x38b/0x3f0
>
> I would suggest that you try applying just the assertion to Linus'
> kernel, then try to make it fire.  Then apply the fix and see if you
> can still make the assertion fire.
>

Thanks.

In the meantime I built a new LTP-from-Git and Linux-kernel v5.9-rc6+.
Applied only with the iomap-assertion diff.

Unfortunately, I do not hit the assertion when running all LTP syscalls tests.

> FWIW, I got it to fire with generic/095 from the xfstests test suite.

Good to know.
It's been a long time since I used xfstests.
Is it clone xfstests Git or do I need to compile?
Can I run only one single test like generic/095 (and how)?

- Sedat -
Sedat Dilek Sept. 27, 2020, 11:31 a.m. UTC | #30
On Fri, Sep 25, 2020 at 5:53 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Sep 25, 2020 at 04:01:02PM +0200, Sedat Dilek wrote:
> > On Fri, Sep 25, 2020 at 3:46 PM Matthew Wilcox <willy@infradead.org> wrote:

[...]

> > How does the assertion look like in the logs?
> > You have an example.
>
> I happen to have one from my testing last night:
>
> 0006 ------------[ cut here ]------------
> 0006 WARNING: CPU: 5 PID: 1417 at fs/iomap/buffered-io.c:80 iomap_page_release+0xb1/0xc0
> 0006 bam!
> 0006 Modules linked in:
> 0006 CPU: 5 PID: 1417 Comm: fio Kdump: loaded Not tainted 5.8.0-00001-g51f85a97ccdd-dirty #54
> 0006 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
> 0006 RIP: 0010:iomap_page_release+0xb1/0xc0
> 0006 Code: 45 d9 48 8b 03 48 c1 e8 02 83 e0 01 75 13 38 d0 75 18 4c 89 ef e8 1f 6a f8 ff 5b 41 5c 41 5d 5d c3 eb eb e8 e1 07 f4 ff eb 8c <0f> 0b eb e4 0f 0b eb a8 0f 0b eb ac 0f 1f 00 55 48 89 e5 41 56 41
> 0006 RSP: 0018:ffffc90001ed3a40 EFLAGS: 00010202
> 0006 RAX: 0000000000000001 RBX: ffffea0001458ec0 RCX: ffffffff81cf75a7
> 0006 RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8880727d1f90
> 0006 RBP: ffffc90001ed3a58 R08: 0000000000000000 R09: ffff888051ddd6e8
> 0006 R10: 0000000000000005 R11: 0000000000000230 R12: 0000000000000004
> 0006 R13: ffff8880727d1f80 R14: 0000000000000005 R15: ffffea0001458ec0
> 0006 FS:  00007fe4bdd9df00(0000) GS:ffff88807f540000(0000) knlGS:0000000000000000
> 0006 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 0006 CR2: 00007fe4bdd50000 CR3: 000000006f7e6005 CR4: 0000000000360ea0
> 0006 Call Trace:
> 0006  iomap_releasepage+0x58/0xc0
> 0006  try_to_release_page+0x4b/0x60
> 0006  invalidate_inode_pages2_range+0x38b/0x3f0
>
> I would suggest that you try applying just the assertion to Linus'
> kernel, then try to make it fire.  Then apply the fix and see if you
> can still make the assertion fire.
>
> FWIW, I got it to fire with generic/095 from the xfstests test suite.

With...

Linux v5.9-rc6+ up to commit a1bf fa48 745a ("Merge tag 'scsi-fixes'
of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi")
...and...

 xfstests-dev up to commit 75bd80f900ea ("src/t_mmap_dio: do not build
if !HAVE_AIO")

...I have seen in my first run of...

[ generic/095 ]

dileks@iniza:~/src/xfstests-dev/git$ sudo ./check generic/095
FSTYP         -- ext4
PLATFORM      -- Linux/x86_64 iniza 5.9.0-rc6-7-amd64-clang-cfi
#7~bullseye+dileks1 SMP 2020-
09-27
MKFS_OPTIONS  -- /dev/sdb1
MOUNT_OPTIONS -- -o acl,user_xattr /dev/sdb1 /mnt/scratch

generic/095      24s
Ran: generic/095
Passed all 1 tests

[ LC_ALL=C dmesg -T ]

[Sun Sep 27 13:05:17 2020] run fstests generic/095 at 2020-09-27 13:05:17
[Sun Sep 27 13:05:28 2020] EXT4-fs (sdb1): mounted filesystem with
ordered data mode. Opts: acl,user_xattr
[Sun Sep 27 13:05:29 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[Sun Sep 27 13:05:29 2020] File: /mnt/scratch/file1 PID: 21389 Comm: kworker/3:1
[Sun Sep 27 13:05:29 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[Sun Sep 27 13:05:29 2020] File: /mnt/scratch/file2 PID: 21389 Comm: kworker/3:1
[Sun Sep 27 13:05:29 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[Sun Sep 27 13:05:29 2020] File: /mnt/scratch/file1 PID: 22228 Comm: kworker/3:2
[Sun Sep 27 13:05:29 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[Sun Sep 27 13:05:29 2020] File: /mnt/scratch/file1 PID: 22228 Comm: kworker/3:2
[Sun Sep 27 13:05:29 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[Sun Sep 27 13:05:29 2020] File: /mnt/scratch/file1 PID: 22578 Comm: kworker/3:3
[Sun Sep 27 13:05:29 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[Sun Sep 27 13:05:29 2020] File: /mnt/scratch/file1 PID: 22579 Comm: kworker/3:4
[Sun Sep 27 13:05:29 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[Sun Sep 27 13:05:29 2020] File: /mnt/scratch/file1 PID: 21389 Comm: kworker/3:1
[Sun Sep 27 13:05:29 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[Sun Sep 27 13:05:29 2020] File: /mnt/scratch/file1 PID: 22578 Comm: kworker/3:3
[Sun Sep 27 13:05:29 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[Sun Sep 27 13:05:29 2020] File: /mnt/scratch/file1 PID: 22558 Comm: fio
[Sun Sep 27 13:05:29 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[Sun Sep 27 13:05:29 2020] File: /mnt/scratch/file1 PID: 22527 Comm: fio

In the next 5 runs I have not seen similar in my logs.

- Sedat -

P.S.: My local.config

# Ideally define at least these 4 to match your environment
# The first 2 are required.
# See README for other variables which can be set.
#
# Note: SCRATCH_DEV >will< get overwritten!

export TEST_DEV=/dev/sdc5
export TEST_DIR=/mnt/test
export SCRATCH_DEV=/dev/sdb1
export SCRATCH_MNT=/mnt/scratch

- EOT -
Matthew Wilcox (Oracle) Sept. 27, 2020, 12:04 p.m. UTC | #31
On Sun, Sep 27, 2020 at 01:31:15PM +0200, Sedat Dilek wrote:
> > I would suggest that you try applying just the assertion to Linus'
> > kernel, then try to make it fire.  Then apply the fix and see if you
> > can still make the assertion fire.
> >
> > FWIW, I got it to fire with generic/095 from the xfstests test suite.
> 
> With...
> 
> Linux v5.9-rc6+ up to commit a1bf fa48 745a ("Merge tag 'scsi-fixes'
> of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi")
> ...and...
> 
>  xfstests-dev up to commit 75bd80f900ea ("src/t_mmap_dio: do not build
> if !HAVE_AIO")
> 
> ...I have seen in my first run of...
> 
> [ generic/095 ]
> 
> dileks@iniza:~/src/xfstests-dev/git$ sudo ./check generic/095
> FSTYP         -- ext4

There's the first problem in your setup; you need to be checking XFS
to see this problem.  ext4 doesn't use iomap for buffered IO yet.

> PLATFORM      -- Linux/x86_64 iniza 5.9.0-rc6-7-amd64-clang-cfi
> #7~bullseye+dileks1 SMP 2020-
> 09-27
> MKFS_OPTIONS  -- /dev/sdb1

I'm using "-m reflink=1,rmapbt=1 -i sparse=1 -b size=1024"
Sedat Dilek Sept. 27, 2020, 12:34 p.m. UTC | #32
On Sun, Sep 27, 2020 at 2:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Sep 27, 2020 at 01:31:15PM +0200, Sedat Dilek wrote:
> > > I would suggest that you try applying just the assertion to Linus'
> > > kernel, then try to make it fire.  Then apply the fix and see if you
> > > can still make the assertion fire.
> > >
> > > FWIW, I got it to fire with generic/095 from the xfstests test suite.
> >
> > With...
> >
> > Linux v5.9-rc6+ up to commit a1bf fa48 745a ("Merge tag 'scsi-fixes'
> > of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi")
> > ...and...
> >
> >  xfstests-dev up to commit 75bd80f900ea ("src/t_mmap_dio: do not build
> > if !HAVE_AIO")
> >
> > ...I have seen in my first run of...
> >
> > [ generic/095 ]
> >
> > dileks@iniza:~/src/xfstests-dev/git$ sudo ./check generic/095
> > FSTYP         -- ext4
>
> There's the first problem in your setup; you need to be checking XFS
> to see this problem.  ext4 doesn't use iomap for buffered IO yet.
>
> > PLATFORM      -- Linux/x86_64 iniza 5.9.0-rc6-7-amd64-clang-cfi
> > #7~bullseye+dileks1 SMP 2020-
> > 09-27
> > MKFS_OPTIONS  -- /dev/sdb1
>
> I'm using "-m reflink=1,rmapbt=1 -i sparse=1 -b size=1024"
>

OK, with XFS and your recommended MKFS_OPTIONS I hit it once.

dileks@iniza:~/src/xfstests-dev/git$ sudo ./check generic/095
FSTYP         -- xfs (non-debug)
PLATFORM      -- Linux/x86_64 iniza 5.9.0-rc6-7-amd64-clang-cfi
#7~bullseye+dileks1 SMP 2020-
09-27
MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdb1
MOUNT_OPTIONS -- /dev/sdb1 /mnt/scratch

generic/095 19s ... [failed, exit status 1]- output mismatch (see
/home/dileks/src/xfstests-dev/git/results//generic/095.out.bad)
   --- tests/generic/095.out   2020-09-27 12:37:22.094208071 +0200
   +++ /home/dileks/src/xfstests-dev/git/results//generic/095.out.bad
2020-09-27 14:27:53.103222500 +0200
   @@ -1,2 +1,3 @@
    QA output created by 095
    Silence is golden
   +_check_dmesg: something found in dmesg (see
/home/dileks/src/xfstests-dev/git/results//generic/095.dmesg)
   ...
   (Run 'diff -u
/home/dileks/src/xfstests-dev/git/tests/generic/095.out
/home/dileks/src/xfstests-dev/git/results//generic/095.out.bad'  to
see the entire diff)
Ran: generic/095
Failures: generic/095
Failed 1 of 1 tests

dileks@iniza:~/src/xfstests-dev/git$ diff -u
/home/dileks/src/xfstests-dev/git/tests/generic/095.out
/home/dileks/src/xfstests-dev/git/results//generic/095.out.bad
--- /home/dileks/src/xfstests-dev/git/tests/generic/095.out
2020-09-27 12:37:22.094208071 +0200
+++ /home/dileks/src/xfstests-dev/git/results//generic/095.out.bad
 2020-09-27 14:27:53.103222500 +0200
@@ -1,2 +1,3 @@
QA output created by 095
Silence is golden
+_check_dmesg: something found in dmesg (see
/home/dileks/src/xfstests-dev/git/results//generic/095.dmesg)

dileks@iniza:~/src/xfstests-dev/git$ cat
/home/dileks/src/xfstests-dev/git/results//generic/095.dmesg
[ 2740.514321] run fstests generic/095 at 2020-09-27 14:27:34
[ 2746.274226] XFS (sdb1): Mounting V5 Filesystem
[ 2746.332612] XFS (sdb1): Ending clean mount
[ 2746.337449] xfs filesystem being mounted at /mnt/scratch supports
timestamps until 2038 (0x7fffffff)
[ 2747.790334] ------------[ cut here ]------------
[ 2747.790348] WARNING: CPU: 1 PID: 8678 at fs/iomap/buffered-io.c:77
iomap_page_release+0xd7/0x130
[ 2747.790350] Modules linked in: xfs(E) ppp_deflate(E) bsd_comp(E)
ppp_async(E) ppp_generic(E) slhc(E) bnep(E) snd_seq_dummy(E)
snd_hrtimer(E) snd_seq(E) snd_seq_device(E) fuse(E) intel_rapl
_msr(E) intel_rapl_common(E) btusb(E) uvcvideo(E)
x86_pkg_temp_thermal(E) snd_hda_codec_hdmi(E) intel_powerclamp(E)
btrtl(E) snd_hda_codec_realtek(E) btintel(E) videobuf2_vmalloc(E)
snd_hda_c
odec_generic(E) videobuf2_memops(E) btbcm(E) videobuf2_v4l2(E)
coretemp(E) bluetooth(E) videobuf2_common(E) iwldvm(E)
ledtrig_audio(E) kvm_intel(E) mac80211(E) videodev(E) snd_hda_intel(E)
op
tion(E) kvm(E) zram(E) i915(E) mc(E) usb_wwan(E) msr(E)
snd_intel_dspcfg(E) usbserial(E) cdc_ether(E) usbnet(E)
jitterentropy_rng(E) zsmalloc(E) snd_hda_codec(E) libarc4(E) mii(E)
irqbypass(E
) drbg(E) snd_hda_core(E) ghash_clmulni_intel(E) ansi_cprng(E)
ecdh_generic(E) iwlwifi(E) ecc(E) snd_hwdep(E) aesni_intel(E)
snd_pcm(E) cfg80211(E) libaes(E) at24(E) glue_helper(E) mei_me(E)
snd_timer(E) crypto_simd(E) iTCO_wdt(E)
[ 2747.790398]  samsung_laptop(E) drm_kms_helper(E) snd(E)
intel_pmc_bxt(E) cryptd(E) mei(E) sg(E) joydev(E) cec(E) evdev(E)
i2c_algo_bit(E) iTCO_vendor_support(E) rfkill(E) serio_raw(E) rapl
(E) soundcore(E) intel_cstate(E) watchdog(E) intel_uncore(E) ac(E)
pcspkr(E) button(E) binfmt_misc(E) parport_pc(E) ppdev(E) lp(E) drm(E)
parport(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E
) mbcache(E) crc16(E) jbd2(E) btrfs(E) raid6_pq(E) xor(E) libcrc32c(E)
crc32c_generic(E) sr_mod(E) cdrom(E) sd_mod(E) t10_pi(E) crc_t10dif(E)
crct10dif_generic(E) hid_generic(E) usbhid(E) hid
(E) uas(E) usb_storage(E) crct10dif_pclmul(E) crct10dif_common(E)
ahci(E) libahci(E) r8169(E) xhci_pci(E) crc32_pclmul(E) ehci_pci(E)
realtek(E) i2c_i801(E) xhci_hcd(E) ehci_hcd(E) libata(E)
crc32c_intel(E) psmouse(E) mdio_devres(E) lpc_ich(E) i2c_smbus(E)
scsi_mod(E) libphy(E) usbcore(E) fan(E) battery(E) wmi(E) video(E)
[ 2747.790452] CPU: 1 PID: 8678 Comm: fio Tainted: G            E
5.9.0-rc6-7-amd64-clang-cfi #7~bullseye+dileks1
[ 2747.790454] Hardware name: SAMSUNG ELECTRONICS CO., LTD.
530U3BI/530U4BI/530U4BH/530U3BI/530U4BI/530U4BH, BIOS 13XK 03/28/2013
[ 2747.790458] RIP: 0010:iomap_page_release+0xd7/0x130
[ 2747.790461] Code: 29 4c 89 f7 5b 41 5e 41 5f e9 05 62 f2 ff 5b 41
5e 41 5f c3 0f 0b 41 83 7e 04 00 74 af 0f 0b eb ab 48 83 c2 ff 49 89
d7 eb c4 <0f> 0b eb d3 48 83 c0 ff 48 89 c7 66 66 66
66 90 e9 5b ff ff ff 48
[ 2747.790463] RSP: 0018:ffffbf4bcc5d7988 EFLAGS: 00010297
[ 2747.790465] RAX: 0000000000000001 RBX: 0000000000000004 RCX: 0000000000000000
[ 2747.790466] RDX: ffffe561c50018c8 RSI: 0000000000000000 RDI: ffff987c5a007ad0
[ 2747.790468] RBP: ffff987c3817eea8 R08: ffff987c8eb54a80 R09: ffff987c8eb54a80
[ 2747.790469] R10: 0000000000000000 R11: ffffffffc1778290 R12: ffff987c3817eea8
[ 2747.790471] R13: 0000000000000000 R14: ffff987c5a007ac0 R15: ffffe561c5001c80
[ 2747.790473] FS:  00007f3beb952580(0000) GS:ffff987c97a40000(0000)
knlGS:0000000000000000
[ 2747.790475] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2747.790477] CR2: 00007f3b9edfa168 CR3: 000000013fe04003 CR4: 00000000000606e0
[ 2747.790478] Call Trace:
[ 2747.790485]  iomap_releasepage+0x64/0x120
[ 2747.790491]  invalidate_complete_page2+0x3b/0x190
[ 2747.790494]  invalidate_inode_pages2_range+0xf3/0x460
[ 2747.790498]  ? pagevec_lookup_range_tag+0x24/0x30
[ 2747.790502]  ? __filemap_fdatawait_range.llvm.8376122975785580473+0xa5/0x110
[ 2747.790505]  ? __filemap_fdatawrite_range+0x108/0x130
[ 2747.790509]  iomap_dio_rw+0x24f/0x450
[ 2747.790574]  xfs_file_dio_aio_write+0x227/0x340 [xfs]
[ 2747.790619]  ? xfs_file_read_iter+0x8/0x8 [xfs]
[ 2747.790661]  xfs_file_write_iter+0x9c/0xd0 [xfs]
[ 2747.790666]  aio_write+0x148/0x220
[ 2747.790670]  ? file_update_time+0x38/0x1f0
[ 2747.790674]  ? __check_object_size+0x5a/0x1e0
[ 2747.790676]  __io_submit_one+0x3c8/0x600
[ 2747.790679]  ? kmem_cache_alloc+0x116/0x290
[ 2747.790682]  io_submit_one+0x13c/0x520
[ 2747.790686]  __do_sys_io_submit+0x8a/0x180
[ 2747.790688]  ? __x64_sys_io_getevents+0x69/0xb0
[ 2747.790693]  ? __ia32_sys_pipe2.cfi_jt+0x8/0x8
[ 2747.790695]  do_syscall_64+0x53/0x90
[ 2747.790699]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2747.790703] RIP: 0033:0x7f3bf53d1a79
[ 2747.790706] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e7 53
0c 00 f7 d8 64 89 01 48
[ 2747.790707] RSP: 002b:00007ffc214375b8 EFLAGS: 00000246 ORIG_RAX:
00000000000000d1
[ 2747.790710] RAX: ffffffffffffffda RBX: 00007f3beb950718 RCX: 00007f3bf53d1a79
[ 2747.790711] RDX: 000055e5c234f328 RSI: 0000000000000001 RDI: 00007f3beb919000
[ 2747.790712] RBP: 00007f3beb919000 R08: 0000000000000000 R09: 0000000000000008
[ 2747.790714] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[ 2747.790715] R13: 0000000000000000 R14: 000055e5c234f328 R15: 000055e5c2339a20
[ 2747.790717] ---[ end trace b283df62bfe0869f ]---
[ 2758.555914] XFS (sdb1): Unmounting Filesystem

Again, I hit this one time.

I will try to reproduce and later test with your other patch.

- Sedat -

P.S.: Updated local.config with MKFS_OPTIONS recommended by willy

[ local.config ]
# Ideally define at least these 4 to match your environment
# The first 2 are required.
# See README for other variables which can be set.
#
# Note: SCRATCH_DEV >will< get overwritten!

export TEST_DEV=/dev/sdc3
##export TEST_DEV=/dev/sdb1
export TEST_DIR=/mnt/test
export SCRATCH_DEV=/dev/sdb1
##export SCRATCH_DEV=/dev/sdc5
export SCRATCH_MNT=/mnt/scratch
MKFS_OPTIONS="-m reflink=1,rmapbt=1 -i sparse=1 -b size=1024"
Sedat Dilek Sept. 27, 2020, 12:45 p.m. UTC | #33
On Sun, Sep 27, 2020 at 2:34 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
>
> On Sun, Sep 27, 2020 at 2:04 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sun, Sep 27, 2020 at 01:31:15PM +0200, Sedat Dilek wrote:
> > > > I would suggest that you try applying just the assertion to Linus'
> > > > kernel, then try to make it fire.  Then apply the fix and see if you
> > > > can still make the assertion fire.
> > > >
> > > > FWIW, I got it to fire with generic/095 from the xfstests test suite.
> > >
> > > With...
> > >
> > > Linux v5.9-rc6+ up to commit a1bf fa48 745a ("Merge tag 'scsi-fixes'
> > > of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi")
> > > ...and...
> > >
> > >  xfstests-dev up to commit 75bd80f900ea ("src/t_mmap_dio: do not build
> > > if !HAVE_AIO")
> > >
> > > ...I have seen in my first run of...
> > >
> > > [ generic/095 ]
> > >
> > > dileks@iniza:~/src/xfstests-dev/git$ sudo ./check generic/095
> > > FSTYP         -- ext4
> >
> > There's the first problem in your setup; you need to be checking XFS
> > to see this problem.  ext4 doesn't use iomap for buffered IO yet.
> >
> > > PLATFORM      -- Linux/x86_64 iniza 5.9.0-rc6-7-amd64-clang-cfi
> > > #7~bullseye+dileks1 SMP 2020-
> > > 09-27
> > > MKFS_OPTIONS  -- /dev/sdb1
> >
> > I'm using "-m reflink=1,rmapbt=1 -i sparse=1 -b size=1024"
> >
>
> OK, with XFS and your recommended MKFS_OPTIONS I hit it once.
>
> dileks@iniza:~/src/xfstests-dev/git$ sudo ./check generic/095
> FSTYP         -- xfs (non-debug)
> PLATFORM      -- Linux/x86_64 iniza 5.9.0-rc6-7-amd64-clang-cfi
> #7~bullseye+dileks1 SMP 2020-
> 09-27
> MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdb1
> MOUNT_OPTIONS -- /dev/sdb1 /mnt/scratch
>
> generic/095 19s ... [failed, exit status 1]- output mismatch (see
> /home/dileks/src/xfstests-dev/git/results//generic/095.out.bad)
>    --- tests/generic/095.out   2020-09-27 12:37:22.094208071 +0200
>    +++ /home/dileks/src/xfstests-dev/git/results//generic/095.out.bad
> 2020-09-27 14:27:53.103222500 +0200
>    @@ -1,2 +1,3 @@
>     QA output created by 095
>     Silence is golden
>    +_check_dmesg: something found in dmesg (see
> /home/dileks/src/xfstests-dev/git/results//generic/095.dmesg)
>    ...
>    (Run 'diff -u
> /home/dileks/src/xfstests-dev/git/tests/generic/095.out
> /home/dileks/src/xfstests-dev/git/results//generic/095.out.bad'  to
> see the entire diff)
> Ran: generic/095
> Failures: generic/095
> Failed 1 of 1 tests
>
> dileks@iniza:~/src/xfstests-dev/git$ diff -u
> /home/dileks/src/xfstests-dev/git/tests/generic/095.out
> /home/dileks/src/xfstests-dev/git/results//generic/095.out.bad
> --- /home/dileks/src/xfstests-dev/git/tests/generic/095.out
> 2020-09-27 12:37:22.094208071 +0200
> +++ /home/dileks/src/xfstests-dev/git/results//generic/095.out.bad
>  2020-09-27 14:27:53.103222500 +0200
> @@ -1,2 +1,3 @@
> QA output created by 095
> Silence is golden
> +_check_dmesg: something found in dmesg (see
> /home/dileks/src/xfstests-dev/git/results//generic/095.dmesg)
>
> dileks@iniza:~/src/xfstests-dev/git$ cat
> /home/dileks/src/xfstests-dev/git/results//generic/095.dmesg
> [ 2740.514321] run fstests generic/095 at 2020-09-27 14:27:34
> [ 2746.274226] XFS (sdb1): Mounting V5 Filesystem
> [ 2746.332612] XFS (sdb1): Ending clean mount
> [ 2746.337449] xfs filesystem being mounted at /mnt/scratch supports
> timestamps until 2038 (0x7fffffff)
> [ 2747.790334] ------------[ cut here ]------------
> [ 2747.790348] WARNING: CPU: 1 PID: 8678 at fs/iomap/buffered-io.c:77
> iomap_page_release+0xd7/0x130
> [ 2747.790350] Modules linked in: xfs(E) ppp_deflate(E) bsd_comp(E)
> ppp_async(E) ppp_generic(E) slhc(E) bnep(E) snd_seq_dummy(E)
> snd_hrtimer(E) snd_seq(E) snd_seq_device(E) fuse(E) intel_rapl
> _msr(E) intel_rapl_common(E) btusb(E) uvcvideo(E)
> x86_pkg_temp_thermal(E) snd_hda_codec_hdmi(E) intel_powerclamp(E)
> btrtl(E) snd_hda_codec_realtek(E) btintel(E) videobuf2_vmalloc(E)
> snd_hda_c
> odec_generic(E) videobuf2_memops(E) btbcm(E) videobuf2_v4l2(E)
> coretemp(E) bluetooth(E) videobuf2_common(E) iwldvm(E)
> ledtrig_audio(E) kvm_intel(E) mac80211(E) videodev(E) snd_hda_intel(E)
> op
> tion(E) kvm(E) zram(E) i915(E) mc(E) usb_wwan(E) msr(E)
> snd_intel_dspcfg(E) usbserial(E) cdc_ether(E) usbnet(E)
> jitterentropy_rng(E) zsmalloc(E) snd_hda_codec(E) libarc4(E) mii(E)
> irqbypass(E
> ) drbg(E) snd_hda_core(E) ghash_clmulni_intel(E) ansi_cprng(E)
> ecdh_generic(E) iwlwifi(E) ecc(E) snd_hwdep(E) aesni_intel(E)
> snd_pcm(E) cfg80211(E) libaes(E) at24(E) glue_helper(E) mei_me(E)
> snd_timer(E) crypto_simd(E) iTCO_wdt(E)
> [ 2747.790398]  samsung_laptop(E) drm_kms_helper(E) snd(E)
> intel_pmc_bxt(E) cryptd(E) mei(E) sg(E) joydev(E) cec(E) evdev(E)
> i2c_algo_bit(E) iTCO_vendor_support(E) rfkill(E) serio_raw(E) rapl
> (E) soundcore(E) intel_cstate(E) watchdog(E) intel_uncore(E) ac(E)
> pcspkr(E) button(E) binfmt_misc(E) parport_pc(E) ppdev(E) lp(E) drm(E)
> parport(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E
> ) mbcache(E) crc16(E) jbd2(E) btrfs(E) raid6_pq(E) xor(E) libcrc32c(E)
> crc32c_generic(E) sr_mod(E) cdrom(E) sd_mod(E) t10_pi(E) crc_t10dif(E)
> crct10dif_generic(E) hid_generic(E) usbhid(E) hid
> (E) uas(E) usb_storage(E) crct10dif_pclmul(E) crct10dif_common(E)
> ahci(E) libahci(E) r8169(E) xhci_pci(E) crc32_pclmul(E) ehci_pci(E)
> realtek(E) i2c_i801(E) xhci_hcd(E) ehci_hcd(E) libata(E)
> crc32c_intel(E) psmouse(E) mdio_devres(E) lpc_ich(E) i2c_smbus(E)
> scsi_mod(E) libphy(E) usbcore(E) fan(E) battery(E) wmi(E) video(E)
> [ 2747.790452] CPU: 1 PID: 8678 Comm: fio Tainted: G            E
> 5.9.0-rc6-7-amd64-clang-cfi #7~bullseye+dileks1
> [ 2747.790454] Hardware name: SAMSUNG ELECTRONICS CO., LTD.
> 530U3BI/530U4BI/530U4BH/530U3BI/530U4BI/530U4BH, BIOS 13XK 03/28/2013
> [ 2747.790458] RIP: 0010:iomap_page_release+0xd7/0x130
> [ 2747.790461] Code: 29 4c 89 f7 5b 41 5e 41 5f e9 05 62 f2 ff 5b 41
> 5e 41 5f c3 0f 0b 41 83 7e 04 00 74 af 0f 0b eb ab 48 83 c2 ff 49 89
> d7 eb c4 <0f> 0b eb d3 48 83 c0 ff 48 89 c7 66 66 66
> 66 90 e9 5b ff ff ff 48
> [ 2747.790463] RSP: 0018:ffffbf4bcc5d7988 EFLAGS: 00010297
> [ 2747.790465] RAX: 0000000000000001 RBX: 0000000000000004 RCX: 0000000000000000
> [ 2747.790466] RDX: ffffe561c50018c8 RSI: 0000000000000000 RDI: ffff987c5a007ad0
> [ 2747.790468] RBP: ffff987c3817eea8 R08: ffff987c8eb54a80 R09: ffff987c8eb54a80
> [ 2747.790469] R10: 0000000000000000 R11: ffffffffc1778290 R12: ffff987c3817eea8
> [ 2747.790471] R13: 0000000000000000 R14: ffff987c5a007ac0 R15: ffffe561c5001c80
> [ 2747.790473] FS:  00007f3beb952580(0000) GS:ffff987c97a40000(0000)
> knlGS:0000000000000000
> [ 2747.790475] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2747.790477] CR2: 00007f3b9edfa168 CR3: 000000013fe04003 CR4: 00000000000606e0
> [ 2747.790478] Call Trace:
> [ 2747.790485]  iomap_releasepage+0x64/0x120
> [ 2747.790491]  invalidate_complete_page2+0x3b/0x190
> [ 2747.790494]  invalidate_inode_pages2_range+0xf3/0x460
> [ 2747.790498]  ? pagevec_lookup_range_tag+0x24/0x30
> [ 2747.790502]  ? __filemap_fdatawait_range.llvm.8376122975785580473+0xa5/0x110
> [ 2747.790505]  ? __filemap_fdatawrite_range+0x108/0x130
> [ 2747.790509]  iomap_dio_rw+0x24f/0x450
> [ 2747.790574]  xfs_file_dio_aio_write+0x227/0x340 [xfs]
> [ 2747.790619]  ? xfs_file_read_iter+0x8/0x8 [xfs]
> [ 2747.790661]  xfs_file_write_iter+0x9c/0xd0 [xfs]
> [ 2747.790666]  aio_write+0x148/0x220
> [ 2747.790670]  ? file_update_time+0x38/0x1f0
> [ 2747.790674]  ? __check_object_size+0x5a/0x1e0
> [ 2747.790676]  __io_submit_one+0x3c8/0x600
> [ 2747.790679]  ? kmem_cache_alloc+0x116/0x290
> [ 2747.790682]  io_submit_one+0x13c/0x520
> [ 2747.790686]  __do_sys_io_submit+0x8a/0x180
> [ 2747.790688]  ? __x64_sys_io_getevents+0x69/0xb0
> [ 2747.790693]  ? __ia32_sys_pipe2.cfi_jt+0x8/0x8
> [ 2747.790695]  do_syscall_64+0x53/0x90
> [ 2747.790699]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 2747.790703] RIP: 0033:0x7f3bf53d1a79
> [ 2747.790706] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
> 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
> 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e7 53
> 0c 00 f7 d8 64 89 01 48
> [ 2747.790707] RSP: 002b:00007ffc214375b8 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000d1
> [ 2747.790710] RAX: ffffffffffffffda RBX: 00007f3beb950718 RCX: 00007f3bf53d1a79
> [ 2747.790711] RDX: 000055e5c234f328 RSI: 0000000000000001 RDI: 00007f3beb919000
> [ 2747.790712] RBP: 00007f3beb919000 R08: 0000000000000000 R09: 0000000000000008
> [ 2747.790714] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> [ 2747.790715] R13: 0000000000000000 R14: 000055e5c234f328 R15: 000055e5c2339a20
> [ 2747.790717] ---[ end trace b283df62bfe0869f ]---
> [ 2758.555914] XFS (sdb1): Unmounting Filesystem
>
> Again, I hit this one time.
>
> I will try to reproduce and later test with your other patch.
>

2x reproduced
total hits: 3

- Sedat -

>
> P.S.: Updated local.config with MKFS_OPTIONS recommended by willy
>
> [ local.config ]
> # Ideally define at least these 4 to match your environment
> # The first 2 are required.
> # See README for other variables which can be set.
> #
> # Note: SCRATCH_DEV >will< get overwritten!
>
> export TEST_DEV=/dev/sdc3
> ##export TEST_DEV=/dev/sdb1
> export TEST_DIR=/mnt/test
> export SCRATCH_DEV=/dev/sdb1
> ##export SCRATCH_DEV=/dev/sdc5
> export SCRATCH_MNT=/mnt/scratch
> MKFS_OPTIONS="-m reflink=1,rmapbt=1 -i sparse=1 -b size=1024"
Sedat Dilek Sept. 27, 2020, 1:48 p.m. UTC | #34
On Sun, Sep 27, 2020 at 2:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Sep 27, 2020 at 01:31:15PM +0200, Sedat Dilek wrote:
> > > I would suggest that you try applying just the assertion to Linus'
> > > kernel, then try to make it fire.  Then apply the fix and see if you
> > > can still make the assertion fire.
> > >
> > > FWIW, I got it to fire with generic/095 from the xfstests test suite.
> >
> > With...
> >
> > Linux v5.9-rc6+ up to commit a1bf fa48 745a ("Merge tag 'scsi-fixes'
> > of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi")
> > ...and...
> >
> >  xfstests-dev up to commit 75bd80f900ea ("src/t_mmap_dio: do not build
> > if !HAVE_AIO")
> >
> > ...I have seen in my first run of...
> >
> > [ generic/095 ]
> >
> > dileks@iniza:~/src/xfstests-dev/git$ sudo ./check generic/095
> > FSTYP         -- ext4
>
> There's the first problem in your setup; you need to be checking XFS
> to see this problem.  ext4 doesn't use iomap for buffered IO yet.
>
> > PLATFORM      -- Linux/x86_64 iniza 5.9.0-rc6-7-amd64-clang-cfi
> > #7~bullseye+dileks1 SMP 2020-
> > 09-27
> > MKFS_OPTIONS  -- /dev/sdb1
>
> I'm using "-m reflink=1,rmapbt=1 -i sparse=1 -b size=1024"
>

Sorry, for being pedantic...

With your patch and assertion diff I hit the same issue like with Ext4-FS...

[So Sep 27 15:40:18 2020] run fstests generic/095 at 2020-09-27 15:40:19
[So Sep 27 15:40:26 2020] XFS (sdb1): Mounting V5 Filesystem
[So Sep 27 15:40:26 2020] XFS (sdb1): Ending clean mount
[So Sep 27 15:40:26 2020] xfs filesystem being mounted at /mnt/scratch
supports timestamps until 2038 (0x7fffffff)
[So Sep 27 15:40:28 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[So Sep 27 15:40:28 2020] File: /mnt/scratch/file1 PID: 12 Comm: kworker/0:1
[So Sep 27 15:40:29 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[So Sep 27 15:40:29 2020] File: /mnt/scratch/file1 PID: 73 Comm: kworker/0:2
[So Sep 27 15:40:30 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[So Sep 27 15:40:30 2020] File: /mnt/scratch/file2 PID: 12 Comm: kworker/0:1
[So Sep 27 15:40:30 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[So Sep 27 15:40:30 2020] File: /mnt/scratch/file2 PID: 3271 Comm: fio
[So Sep 27 15:40:30 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[So Sep 27 15:40:30 2020] File: /mnt/scratch/file2 PID: 3273 Comm: fio
[So Sep 27 15:40:31 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[So Sep 27 15:40:31 2020] File: /mnt/scratch/file1 PID: 3308 Comm: fio
[So Sep 27 15:40:36 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[So Sep 27 15:40:36 2020] File: /mnt/scratch/file1 PID: 73 Comm: kworker/0:2
[So Sep 27 15:40:43 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[So Sep 27 15:40:43 2020] File: /mnt/scratch/file1 PID: 73 Comm: kworker/0:2
[So Sep 27 15:40:52 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[So Sep 27 15:40:52 2020] File: /mnt/scratch/file2 PID: 73 Comm: kworker/0:2
[So Sep 27 15:40:56 2020] Page cache invalidation failure on direct
I/O.  Possible data corruption due to collision with buffered I/O!
[So Sep 27 15:40:56 2020] File: /mnt/scratch/file2 PID: 12 Comm: kworker/0:1

Is that a different issue?

- Sedat -

P.S.: Current git diff

dileks@iniza:~/src/linux-kernel/git$ git diff
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index bcfc288dba3f..b421e4efc4bd 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -53,7 +53,10 @@ iomap_page_create(struct inode *inode, struct page *page)
       atomic_set(&iop->read_count, 0);
       atomic_set(&iop->write_count, 0);
       spin_lock_init(&iop->uptodate_lock);
-       bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
+       if (PageUptodate(page))
+               bitmap_fill(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
+       else
+               bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);

       /*
        * migrate_page_move_mapping() assumes that pages with private data have
@@ -67,11 +70,15 @@ static void
iomap_page_release(struct page *page)
{
       struct iomap_page *iop = detach_page_private(page);
+       unsigned int nr_blocks = PAGE_SIZE / i_blocksize(page->mapping->host);

       if (!iop)
               return;
       WARN_ON_ONCE(atomic_read(&iop->read_count));
       WARN_ON_ONCE(atomic_read(&iop->write_count));
+       WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
+                       PageUptodate(page));
+
       kfree(iop);
}

- EOT -
Matthew Wilcox (Oracle) Sept. 27, 2020, 1:54 p.m. UTC | #35
On Sun, Sep 27, 2020 at 03:48:39PM +0200, Sedat Dilek wrote:
> With your patch and assertion diff I hit the same issue like with Ext4-FS...
> 
> [So Sep 27 15:40:18 2020] run fstests generic/095 at 2020-09-27 15:40:19
> [So Sep 27 15:40:26 2020] XFS (sdb1): Mounting V5 Filesystem
> [So Sep 27 15:40:26 2020] XFS (sdb1): Ending clean mount
> [So Sep 27 15:40:26 2020] xfs filesystem being mounted at /mnt/scratch
> supports timestamps until 2038 (0x7fffffff)
> [So Sep 27 15:40:28 2020] Page cache invalidation failure on direct
> I/O.  Possible data corruption due to collision with buffered I/O!
> [So Sep 27 15:40:28 2020] File: /mnt/scratch/file1 PID: 12 Comm: kworker/0:1
> [So Sep 27 15:40:29 2020] Page cache invalidation failure on direct
> I/O.  Possible data corruption due to collision with buffered I/O!
> [So Sep 27 15:40:29 2020] File: /mnt/scratch/file1 PID: 73 Comm: kworker/0:2
> [So Sep 27 15:40:30 2020] Page cache invalidation failure on direct
> I/O.  Possible data corruption due to collision with buffered I/O!
> [So Sep 27 15:40:30 2020] File: /mnt/scratch/file2 PID: 12 Comm: kworker/0:1
> [So Sep 27 15:40:30 2020] Page cache invalidation failure on direct
> I/O.  Possible data corruption due to collision with buffered I/O!
> [So Sep 27 15:40:30 2020] File: /mnt/scratch/file2 PID: 3271 Comm: fio
> [So Sep 27 15:40:30 2020] Page cache invalidation failure on direct
> I/O.  Possible data corruption due to collision with buffered I/O!
> [So Sep 27 15:40:30 2020] File: /mnt/scratch/file2 PID: 3273 Comm: fio
> [So Sep 27 15:40:31 2020] Page cache invalidation failure on direct
> I/O.  Possible data corruption due to collision with buffered I/O!
> [So Sep 27 15:40:31 2020] File: /mnt/scratch/file1 PID: 3308 Comm: fio
> [So Sep 27 15:40:36 2020] Page cache invalidation failure on direct
> I/O.  Possible data corruption due to collision with buffered I/O!
> [So Sep 27 15:40:36 2020] File: /mnt/scratch/file1 PID: 73 Comm: kworker/0:2
> [So Sep 27 15:40:43 2020] Page cache invalidation failure on direct
> I/O.  Possible data corruption due to collision with buffered I/O!
> [So Sep 27 15:40:43 2020] File: /mnt/scratch/file1 PID: 73 Comm: kworker/0:2
> [So Sep 27 15:40:52 2020] Page cache invalidation failure on direct
> I/O.  Possible data corruption due to collision with buffered I/O!
> [So Sep 27 15:40:52 2020] File: /mnt/scratch/file2 PID: 73 Comm: kworker/0:2
> [So Sep 27 15:40:56 2020] Page cache invalidation failure on direct
> I/O.  Possible data corruption due to collision with buffered I/O!
> [So Sep 27 15:40:56 2020] File: /mnt/scratch/file2 PID: 12 Comm: kworker/0:1
> 
> Is that a different issue?

The test is expected to emit those messages; userspace has done something
so utterly bonkers (direct I/O to an mmaped, mlocked page) that we can't
provide the normal guarantees of data integrity.
Sedat Dilek Sept. 27, 2020, 2:02 p.m. UTC | #36
On Sun, Sep 27, 2020 at 3:54 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Sep 27, 2020 at 03:48:39PM +0200, Sedat Dilek wrote:
> > With your patch and assertion diff I hit the same issue like with Ext4-FS...
> >
> > [So Sep 27 15:40:18 2020] run fstests generic/095 at 2020-09-27 15:40:19
> > [So Sep 27 15:40:26 2020] XFS (sdb1): Mounting V5 Filesystem
> > [So Sep 27 15:40:26 2020] XFS (sdb1): Ending clean mount
> > [So Sep 27 15:40:26 2020] xfs filesystem being mounted at /mnt/scratch
> > supports timestamps until 2038 (0x7fffffff)
> > [So Sep 27 15:40:28 2020] Page cache invalidation failure on direct
> > I/O.  Possible data corruption due to collision with buffered I/O!
> > [So Sep 27 15:40:28 2020] File: /mnt/scratch/file1 PID: 12 Comm: kworker/0:1
> > [So Sep 27 15:40:29 2020] Page cache invalidation failure on direct
> > I/O.  Possible data corruption due to collision with buffered I/O!
> > [So Sep 27 15:40:29 2020] File: /mnt/scratch/file1 PID: 73 Comm: kworker/0:2
> > [So Sep 27 15:40:30 2020] Page cache invalidation failure on direct
> > I/O.  Possible data corruption due to collision with buffered I/O!
> > [So Sep 27 15:40:30 2020] File: /mnt/scratch/file2 PID: 12 Comm: kworker/0:1
> > [So Sep 27 15:40:30 2020] Page cache invalidation failure on direct
> > I/O.  Possible data corruption due to collision with buffered I/O!
> > [So Sep 27 15:40:30 2020] File: /mnt/scratch/file2 PID: 3271 Comm: fio
> > [So Sep 27 15:40:30 2020] Page cache invalidation failure on direct
> > I/O.  Possible data corruption due to collision with buffered I/O!
> > [So Sep 27 15:40:30 2020] File: /mnt/scratch/file2 PID: 3273 Comm: fio
> > [So Sep 27 15:40:31 2020] Page cache invalidation failure on direct
> > I/O.  Possible data corruption due to collision with buffered I/O!
> > [So Sep 27 15:40:31 2020] File: /mnt/scratch/file1 PID: 3308 Comm: fio
> > [So Sep 27 15:40:36 2020] Page cache invalidation failure on direct
> > I/O.  Possible data corruption due to collision with buffered I/O!
> > [So Sep 27 15:40:36 2020] File: /mnt/scratch/file1 PID: 73 Comm: kworker/0:2
> > [So Sep 27 15:40:43 2020] Page cache invalidation failure on direct
> > I/O.  Possible data corruption due to collision with buffered I/O!
> > [So Sep 27 15:40:43 2020] File: /mnt/scratch/file1 PID: 73 Comm: kworker/0:2
> > [So Sep 27 15:40:52 2020] Page cache invalidation failure on direct
> > I/O.  Possible data corruption due to collision with buffered I/O!
> > [So Sep 27 15:40:52 2020] File: /mnt/scratch/file2 PID: 73 Comm: kworker/0:2
> > [So Sep 27 15:40:56 2020] Page cache invalidation failure on direct
> > I/O.  Possible data corruption due to collision with buffered I/O!
> > [So Sep 27 15:40:56 2020] File: /mnt/scratch/file2 PID: 12 Comm: kworker/0:1
> >
> > Is that a different issue?
>
> The test is expected to emit those messages; userspace has done something
> so utterly bonkers (direct I/O to an mmaped, mlocked page) that we can't
> provide the normal guarantees of data integrity.

Thanks for the explanations.

Will run 25 iterations with your iomap patch and assertion-diff using
XFS and your MKFS_OPTIONS="-m reflink=1,rmapbt=1 -i sparse=1 -b
size=1024"...

$ sudo ./check -i 25 generic/095

...and report later.

- Sedat -
Sedat Dilek Sept. 27, 2020, 3:19 p.m. UTC | #37
On Sun, Sep 27, 2020 at 4:02 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
>
> On Sun, Sep 27, 2020 at 3:54 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sun, Sep 27, 2020 at 03:48:39PM +0200, Sedat Dilek wrote:
> > > With your patch and assertion diff I hit the same issue like with Ext4-FS...
> > >
> > > [So Sep 27 15:40:18 2020] run fstests generic/095 at 2020-09-27 15:40:19
> > > [So Sep 27 15:40:26 2020] XFS (sdb1): Mounting V5 Filesystem
> > > [So Sep 27 15:40:26 2020] XFS (sdb1): Ending clean mount
> > > [So Sep 27 15:40:26 2020] xfs filesystem being mounted at /mnt/scratch
> > > supports timestamps until 2038 (0x7fffffff)
> > > [So Sep 27 15:40:28 2020] Page cache invalidation failure on direct
> > > I/O.  Possible data corruption due to collision with buffered I/O!
> > > [So Sep 27 15:40:28 2020] File: /mnt/scratch/file1 PID: 12 Comm: kworker/0:1
> > > [So Sep 27 15:40:29 2020] Page cache invalidation failure on direct
> > > I/O.  Possible data corruption due to collision with buffered I/O!
> > > [So Sep 27 15:40:29 2020] File: /mnt/scratch/file1 PID: 73 Comm: kworker/0:2
> > > [So Sep 27 15:40:30 2020] Page cache invalidation failure on direct
> > > I/O.  Possible data corruption due to collision with buffered I/O!
> > > [So Sep 27 15:40:30 2020] File: /mnt/scratch/file2 PID: 12 Comm: kworker/0:1
> > > [So Sep 27 15:40:30 2020] Page cache invalidation failure on direct
> > > I/O.  Possible data corruption due to collision with buffered I/O!
> > > [So Sep 27 15:40:30 2020] File: /mnt/scratch/file2 PID: 3271 Comm: fio
> > > [So Sep 27 15:40:30 2020] Page cache invalidation failure on direct
> > > I/O.  Possible data corruption due to collision with buffered I/O!
> > > [So Sep 27 15:40:30 2020] File: /mnt/scratch/file2 PID: 3273 Comm: fio
> > > [So Sep 27 15:40:31 2020] Page cache invalidation failure on direct
> > > I/O.  Possible data corruption due to collision with buffered I/O!
> > > [So Sep 27 15:40:31 2020] File: /mnt/scratch/file1 PID: 3308 Comm: fio
> > > [So Sep 27 15:40:36 2020] Page cache invalidation failure on direct
> > > I/O.  Possible data corruption due to collision with buffered I/O!
> > > [So Sep 27 15:40:36 2020] File: /mnt/scratch/file1 PID: 73 Comm: kworker/0:2
> > > [So Sep 27 15:40:43 2020] Page cache invalidation failure on direct
> > > I/O.  Possible data corruption due to collision with buffered I/O!
> > > [So Sep 27 15:40:43 2020] File: /mnt/scratch/file1 PID: 73 Comm: kworker/0:2
> > > [So Sep 27 15:40:52 2020] Page cache invalidation failure on direct
> > > I/O.  Possible data corruption due to collision with buffered I/O!
> > > [So Sep 27 15:40:52 2020] File: /mnt/scratch/file2 PID: 73 Comm: kworker/0:2
> > > [So Sep 27 15:40:56 2020] Page cache invalidation failure on direct
> > > I/O.  Possible data corruption due to collision with buffered I/O!
> > > [So Sep 27 15:40:56 2020] File: /mnt/scratch/file2 PID: 12 Comm: kworker/0:1
> > >
> > > Is that a different issue?
> >
> > The test is expected to emit those messages; userspace has done something
> > so utterly bonkers (direct I/O to an mmaped, mlocked page) that we can't
> > provide the normal guarantees of data integrity.
>
> Thanks for the explanations.
>
> Will run 25 iterations with your iomap patch and assertion-diff using
> XFS and your MKFS_OPTIONS="-m reflink=1,rmapbt=1 -i sparse=1 -b
> size=1024"...
>
> $ sudo ./check -i 25 generic/095
>
> ...and report later.
>

Hi Matthew,

After one hour all 25 iterations passed fine - no issues.

If you send a patch for Linux v5.9 feel free to add my...

     Tested-by: Sedat Dilek <sedat.dilek@gmail.com>

Thanks for your patience and all the background information.

Regards,
- Sedat -
Christoph Hellwig Sept. 28, 2020, 6:41 a.m. UTC | #38
On Thu, Sep 24, 2020 at 01:56:08PM +0100, Matthew Wilcox (Oracle) wrote:
> For filesystems with block size < page size, we need to set all the
> per-block uptodate bits if the page was already uptodate at the time
> we create the per-block metadata.  This can happen if the page is
> invalidated (eg by a write to drop_caches) but ultimately not removed
> from the page cache.
> 
> This is a data corruption issue as page writeback skips blocks which
> are marked !uptodate.
> 
> Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reported-by: Qian Cai <cai@redhat.com>
> Cc: Brian Foster <bfoster@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>
Sedat Dilek Oct. 3, 2020, 6:52 p.m. UTC | #39
On Fri, Sep 25, 2020 at 1:57 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Sep 24, 2020 at 10:04:40PM +0200, Sedat Dilek wrote:
> > On Thu, Sep 24, 2020 at 10:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Sep 24, 2020 at 09:54:36PM +0200, Sedat Dilek wrote:
> > > > You are named in "mm: fix misplaced unlock_page in do_wp_page()".
> > > > Is this here a different issue?
> > >
> > > Yes, completely different.  That bug is one Linus introduced in this
> > > cycle; the bug that this patch fixes was introduced a couple of years
> > > ago, and we only noticed now because I added an assertion to -next.
> > > Maybe I should add the assertion for 5.9 too.
> >
> > Can you point me to this "assertion"?
> > Thanks.
>
> Here's the version against 5.8
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 810f7dae11d9..b421e4efc4bd 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -70,11 +70,15 @@ static void
>  iomap_page_release(struct page *page)
>  {
>         struct iomap_page *iop = detach_page_private(page);
> +       unsigned int nr_blocks = PAGE_SIZE / i_blocksize(page->mapping->host);
>
>         if (!iop)
>                 return;
>         WARN_ON_ONCE(atomic_read(&iop->read_count));
>         WARN_ON_ONCE(atomic_read(&iop->write_count));
> +       WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> +                       PageUptodate(page));
> +
>         kfree(iop);
>  }
>

Will you send this as a separate patch or fold into the original?

I have tested the original patch plus this (debug) assertion diff on
top of Linux v5.9-rc7.

- Sedat -
Matthew Wilcox (Oracle) Oct. 4, 2020, 4:13 a.m. UTC | #40
On Sat, Oct 03, 2020 at 08:52:55PM +0200, Sedat Dilek wrote:
> Will you send this as a separate patch or fold into the original?
> 
> I have tested the original patch plus this (debug) assertion diff on
> top of Linux v5.9-rc7.

I'm going to wait for the merge window to open, Darrick's tree to be
merged, then send a backport of all the accumulated fixes to Greg for
the 5.9-stable tree.  I'm also going to do a 5.4 backport.
Sedat Dilek Oct. 4, 2020, 10:35 a.m. UTC | #41
On Sun, Oct 4, 2020 at 6:13 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Oct 03, 2020 at 08:52:55PM +0200, Sedat Dilek wrote:
> > Will you send this as a separate patch or fold into the original?
> >
> > I have tested the original patch plus this (debug) assertion diff on
> > top of Linux v5.9-rc7.
>
> I'm going to wait for the merge window to open, Darrick's tree to be
> merged, then send a backport of all the accumulated fixes to Greg for
> the 5.9-stable tree.  I'm also going to do a 5.4 backport.

OK.

In "iomap: Support arbitrarily many blocks per page" [1] the debugging
assertions are folded in, so I hope you will do this in the "backport
of all the accumulated fixes".

- Sedat -

[1] https://git.infradead.org/users/willy/pagecache.git/commitdiff/e799a897db4b1913a2f5279a9ee1b342bda05a98
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8b6cca7e34e4..8180061b9e16 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -60,6 +60,8 @@  iomap_page_create(struct inode *inode, struct page *page)
 	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
 			GFP_NOFS | __GFP_NOFAIL);
 	spin_lock_init(&iop->uptodate_lock);
+	if (PageUptodate(page))
+		bitmap_fill(iop->uptodate, nr_blocks);
 	attach_page_private(page, iop);
 	return iop;
 }