mbox series

[0/2] iomap: small block problems

Message ID 20210628172727.1894503-1-agruenba@redhat.com (mailing list archive)
Headers show
Series iomap: small block problems | expand

Message

Andreas Gruenbacher June 28, 2021, 5:27 p.m. UTC
Hi Christoph and all,

the recent gfs2 switch from buffer heads to iomap for managing the
block-to-page mappings in commit 2164f9b91869 ("gfs2: use iomap for
buffered I/O in ordered and writeback mode") broke filesystems with a
block size smaller than the page size.  This haüüens when iomap_page
objects aren't attached to pages in all the right circumstances, or the
iop->upodate bitmap isn't kept in sync with the PageUptodata flag.
There are three instances of this problem:

(1) In iomap_readpage_actor, an iomap_page is attached to the page even
for inline inodes.  This is unnecessary because inline inodes don't need
iomap_page objects.  That alone wouldn't cause any real issues, but when
iomap_read_inline_data copies the inline data into the page, it sets the
PageUptodate flag without setting iop->uptodate, causing an
inconsistency between the two.  This will trigger a WARN_ON in
iomap_page_release.  The fix should be not to allocate iomap_page
objects when reading from inline inodes (patch 1).

(2) When un-inlining an inode, we must allocate a page with an attached
iomap_page object (iomap_page_create) and initialize the iop->uptodate
bitmap (iomap_set_range_uptodate).  We can't currently do that because
iomap_page_create and iomap_set_range_uptodate are not exported.  That
could be fixed by exporting those functions, or by implementing an
additional helper as in patch 2.  Which of the two would you prefer?

(3) We're not yet using iomap_page_mkwrite, so iomap_page objects don't
get created on .page_mkwrite, either.  Part of the reason is that
iomap_page_mkwrite locks the page and then calls into the filesystem for
uninlining and for allocating backing blocks.  This conflicts with the
gfs2 locking order: on gfs2, transactions must be started before locking
any pages.  We can fix that by calling iomap_page_create from
gfs2_page_mkwrite, or by doing the uninlining and allocations before
calling iomap_page_mkwrite.  I've implemented option 2 for now; see
here:

https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/log/?h=for-next.iomap-page-mkwrite

Any thoughts?

Thanks,
Andreas

Andreas Gruenbacher (1):
  iomap: Don't create iomap_page objects for inline files

Bob Peterson (1):
  iomap: Add helper for un-inlining an inline inode

 fs/iomap/buffered-io.c | 28 +++++++++++++++++++++++++++-
 include/linux/iomap.h  |  2 ++
 2 files changed, 29 insertions(+), 1 deletion(-)

Comments

Matthew Wilcox June 28, 2021, 5:39 p.m. UTC | #1
On Mon, Jun 28, 2021 at 07:27:25PM +0200, Andreas Gruenbacher wrote:
> (1) In iomap_readpage_actor, an iomap_page is attached to the page even
> for inline inodes.  This is unnecessary because inline inodes don't need
> iomap_page objects.  That alone wouldn't cause any real issues, but when
> iomap_read_inline_data copies the inline data into the page, it sets the
> PageUptodate flag without setting iop->uptodate, causing an
> inconsistency between the two.  This will trigger a WARN_ON in
> iomap_page_release.  The fix should be not to allocate iomap_page
> objects when reading from inline inodes (patch 1).

I don't have a problem with this patch.

> (2) When un-inlining an inode, we must allocate a page with an attached
> iomap_page object (iomap_page_create) and initialize the iop->uptodate
> bitmap (iomap_set_range_uptodate).  We can't currently do that because
> iomap_page_create and iomap_set_range_uptodate are not exported.  That
> could be fixed by exporting those functions, or by implementing an
> additional helper as in patch 2.  Which of the two would you prefer?

Not hugely happy with either of these options, tbh.  I'd rather we apply
a patch akin to this one (plucked from the folio tree), so won't apply:

@@ -1305,7 +1311,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
                struct writeback_control *wbc, struct inode *inode,
                struct folio *folio, loff_t end_pos)
 {
-       struct iomap_page *iop = to_iomap_page(folio);
+       struct iomap_page *iop = iomap_page_create(inode, folio);
        struct iomap_ioend *ioend, *next;
        unsigned len = i_blocksize(inode);
        unsigned nblocks = i_blocks_per_folio(inode, folio);
@@ -1313,7 +1319,6 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
        int error = 0, count = 0, i;
        LIST_HEAD(submit_list);

-       WARN_ON_ONCE(nblocks > 1 && !iop);
        WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);

        /*

so permit pages without an iop to enter writeback and create an iop
*then*.  Would that solve your problem?

> (3) We're not yet using iomap_page_mkwrite, so iomap_page objects don't
> get created on .page_mkwrite, either.  Part of the reason is that
> iomap_page_mkwrite locks the page and then calls into the filesystem for
> uninlining and for allocating backing blocks.  This conflicts with the
> gfs2 locking order: on gfs2, transactions must be started before locking
> any pages.  We can fix that by calling iomap_page_create from
> gfs2_page_mkwrite, or by doing the uninlining and allocations before
> calling iomap_page_mkwrite.  I've implemented option 2 for now; see
> here:

I think this might also solve this problem?
Christoph Hellwig June 28, 2021, 5:47 p.m. UTC | #2
On Mon, Jun 28, 2021 at 06:39:09PM +0100, Matthew Wilcox wrote:
> Not hugely happy with either of these options, tbh.  I'd rather we apply
> a patch akin to this one (plucked from the folio tree), so won't apply:

> so permit pages without an iop to enter writeback and create an iop
> *then*.  Would that solve your problem?

It is the right thing to do, especially when combined with a feature
patch to not bother to create the iomap_page structure on small
block size file systems when the extent covers the whole page.

> 
> > (3) We're not yet using iomap_page_mkwrite, so iomap_page objects don't
> > get created on .page_mkwrite, either.  Part of the reason is that
> > iomap_page_mkwrite locks the page and then calls into the filesystem for
> > uninlining and for allocating backing blocks.  This conflicts with the
> > gfs2 locking order: on gfs2, transactions must be started before locking
> > any pages.  We can fix that by calling iomap_page_create from
> > gfs2_page_mkwrite, or by doing the uninlining and allocations before
> > calling iomap_page_mkwrite.  I've implemented option 2 for now; see
> > here:
> 
> I think this might also solve this problem?

We'll still need to create the iomap_page structure for page_mkwrite
if there is an extent boundary inside the page.
Andreas Gruenbacher June 28, 2021, 5:55 p.m. UTC | #3
On Mon, Jun 28, 2021 at 7:40 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Mon, Jun 28, 2021 at 07:27:25PM +0200, Andreas Gruenbacher wrote:
> > (1) In iomap_readpage_actor, an iomap_page is attached to the page even
> > for inline inodes.  This is unnecessary because inline inodes don't need
> > iomap_page objects.  That alone wouldn't cause any real issues, but when
> > iomap_read_inline_data copies the inline data into the page, it sets the
> > PageUptodate flag without setting iop->uptodate, causing an
> > inconsistency between the two.  This will trigger a WARN_ON in
> > iomap_page_release.  The fix should be not to allocate iomap_page
> > objects when reading from inline inodes (patch 1).
>
> I don't have a problem with this patch.
>
> > (2) When un-inlining an inode, we must allocate a page with an attached
> > iomap_page object (iomap_page_create) and initialize the iop->uptodate
> > bitmap (iomap_set_range_uptodate).  We can't currently do that because
> > iomap_page_create and iomap_set_range_uptodate are not exported.  That
> > could be fixed by exporting those functions, or by implementing an
> > additional helper as in patch 2.  Which of the two would you prefer?
>
> Not hugely happy with either of these options, tbh.  I'd rather we apply
> a patch akin to this one (plucked from the folio tree), so won't apply:
>
> @@ -1305,7 +1311,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>                 struct writeback_control *wbc, struct inode *inode,
>                 struct folio *folio, loff_t end_pos)
>  {
> -       struct iomap_page *iop = to_iomap_page(folio);
> +       struct iomap_page *iop = iomap_page_create(inode, folio);
>         struct iomap_ioend *ioend, *next;
>         unsigned len = i_blocksize(inode);
>         unsigned nblocks = i_blocks_per_folio(inode, folio);
> @@ -1313,7 +1319,6 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>         int error = 0, count = 0, i;
>         LIST_HEAD(submit_list);
>
> -       WARN_ON_ONCE(nblocks > 1 && !iop);
>         WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
>
>         /*
>
> so permit pages without an iop to enter writeback and create an iop
> *then*.  Would that solve your problem?

It probably would. Let me do some testing based on that.

> > (3) We're not yet using iomap_page_mkwrite, so iomap_page objects don't
> > get created on .page_mkwrite, either.  Part of the reason is that
> > iomap_page_mkwrite locks the page and then calls into the filesystem for
> > uninlining and for allocating backing blocks.  This conflicts with the
> > gfs2 locking order: on gfs2, transactions must be started before locking
> > any pages.  We can fix that by calling iomap_page_create from
> > gfs2_page_mkwrite, or by doing the uninlining and allocations before
> > calling iomap_page_mkwrite.  I've implemented option 2 for now; see
> > here:
>
> I think this might also solve this problem?

Probably yes.

Thanks,
Andreas
Andreas Gruenbacher June 28, 2021, 9:28 p.m. UTC | #4
On Mon, Jun 28, 2021 at 8:07 PM Christoph Hellwig <hch@infradead.org> wrote:
> On Mon, Jun 28, 2021 at 06:39:09PM +0100, Matthew Wilcox wrote:
> > Not hugely happy with either of these options, tbh.  I'd rather we apply
> > a patch akin to this one (plucked from the folio tree), so won't apply:
>
> > so permit pages without an iop to enter writeback and create an iop
> > *then*.  Would that solve your problem?
>
> It is the right thing to do, especially when combined with a feature
> patch to not bother to create the iomap_page structure on small
> block size file systems when the extent covers the whole page.
>
> >
> > > (3) We're not yet using iomap_page_mkwrite, so iomap_page objects don't
> > > get created on .page_mkwrite, either.  Part of the reason is that
> > > iomap_page_mkwrite locks the page and then calls into the filesystem for
> > > uninlining and for allocating backing blocks.  This conflicts with the
> > > gfs2 locking order: on gfs2, transactions must be started before locking
> > > any pages.  We can fix that by calling iomap_page_create from
> > > gfs2_page_mkwrite, or by doing the uninlining and allocations before
> > > calling iomap_page_mkwrite.  I've implemented option 2 for now; see
> > > here:
> >
> > I think this might also solve this problem?
>
> We'll still need to create the iomap_page structure for page_mkwrite
> if there is an extent boundary inside the page.

Yes, but the iop wouldn't need to be allocated in page_mkwrite; that
would be taken care of by iomap_writepage / iomap_writepages as in the
patch suggested by Matthew, right?

Thanks,
Andreas
Matthew Wilcox June 28, 2021, 9:59 p.m. UTC | #5
On Mon, Jun 28, 2021 at 06:47:58PM +0100, Christoph Hellwig wrote:
> On Mon, Jun 28, 2021 at 06:39:09PM +0100, Matthew Wilcox wrote:
> > Not hugely happy with either of these options, tbh.  I'd rather we apply
> > a patch akin to this one (plucked from the folio tree), so won't apply:
> 
> > so permit pages without an iop to enter writeback and create an iop
> > *then*.  Would that solve your problem?
> 
> It is the right thing to do, especially when combined with a feature
> patch to not bother to create the iomap_page structure on small
> block size file systems when the extent covers the whole page.

We don't know the extent layout at the point where *this* patch creates
iomap_pages during writeback.  I imagine we can delay creating one until
we find out what our destination layout will be?
Christoph Hellwig June 29, 2021, 5:29 a.m. UTC | #6
On Mon, Jun 28, 2021 at 10:59:55PM +0100, Matthew Wilcox wrote:
> > > so permit pages without an iop to enter writeback and create an iop
> > > *then*.  Would that solve your problem?
> > 
> > It is the right thing to do, especially when combined with a feature
> > patch to not bother to create the iomap_page structure on small
> > block size file systems when the extent covers the whole page.
> 
> We don't know the extent layout at the point where *this* patch creates
> iomap_pages during writeback.  I imagine we can delay creating one until
> we find out what our destination layout will be?

Hmm.  Actually ->page_mkwrite is always is always called on an uptodate
page and we even assert that.  I should have remembered the whole page
fault path better.

So yeah, I think we should take patch 1 from Andreas, then a non-folio
version of your patch as a start.  The next steps then would be in
approximate order:

 1. remove the iomap_page_create in iomap_page_mkwrite_actor as it
    clearly is not needed at that point
 2. don't bother to create an iomap_page in iomap_readpage_actor when
    the iomap spans the whole page
 3. don't create the iomap_page in __iomap_write_begin when the
    page is marked uptodate or the write covers the whole page 

delaying the creation further in iomap_writepage_map will be harder
as the loop around iomap_add_to_ioend is still fundamentally block
based.
Christoph Hellwig June 29, 2021, 5:42 a.m. UTC | #7
On Tue, Jun 29, 2021 at 06:29:48AM +0100, Christoph Hellwig wrote:
> Hmm.  Actually ->page_mkwrite is always is always called on an uptodate
> page and we even assert that.  I should have remembered the whole page
> fault path better.
> 
> So yeah, I think we should take patch 1 from Andreas, then a non-folio
> version of your patch as a start.  The next steps then would be in
> approximate order:
> 
>  1. remove the iomap_page_create in iomap_page_mkwrite_actor as it
>     clearly is not needed at that point
>  2. don't bother to create an iomap_page in iomap_readpage_actor when
>     the iomap spans the whole page
>  3. don't create the iomap_page in __iomap_write_begin when the
>     page is marked uptodate or the write covers the whole page 

Further thoughts for a better series:

 1. create iomap_page if needed in iomap_writepage_map
 2. do not create the iomap_page at all in iomap_readpage_actor.
    ->readahead is always called on newly allocated pages, and
    ->readpage either on a clean !uptodate page or on one that
    has seen a write leading to a partial uptodate state.  That
    is for the case that cares about the iomap_page it is present
    already
 3. don't create the iomap_page in iomap_page_mkwrite_actor

I think this is the simple initial series that should solve Andreas'
problem.  Then we can look into optimizing __iomap_write_begin
and iomap_writepage_map further as needed.
Andreas Gruenbacher June 29, 2021, 9:12 a.m. UTC | #8
Below is a version of your patch on top of v5.13 which has passed some
local testing here.

Thanks,
Andreas

--

iomap: Permit pages without an iop to enter writeback

Permit pages without an iop to enter writeback and create an iop *then*.  This
allows filesystems to mark pages dirty without having to worry about how the
iop block tracking is implemented.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/buffered-io.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 03537ecb2a94..6330dabc451e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1336,14 +1336,13 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct page *page, u64 end_offset)
 {
-	struct iomap_page *iop = to_iomap_page(page);
+	struct iomap_page *iop = iomap_page_create(inode, page);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	u64 file_offset; /* file offset of page */
 	int error = 0, count = 0, i;
 	LIST_HEAD(submit_list);
 
-	WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
 
 	/*
Andreas Gruenbacher June 30, 2021, 12:29 p.m. UTC | #9
Darrick,

On Tue, Jun 29, 2021 at 7:47 AM Christoph Hellwig <hch@infradead.org> wrote:
> On Mon, Jun 28, 2021 at 10:59:55PM +0100, Matthew Wilcox wrote:
> > > > so permit pages without an iop to enter writeback and create an iop
> > > > *then*.  Would that solve your problem?
> > >
> > > It is the right thing to do, especially when combined with a feature
> > > patch to not bother to create the iomap_page structure on small
> > > block size file systems when the extent covers the whole page.
> >
> > We don't know the extent layout at the point where *this* patch creates
> > iomap_pages during writeback.  I imagine we can delay creating one until
> > we find out what our destination layout will be?
>
> Hmm.  Actually ->page_mkwrite is always is always called on an uptodate
> page and we even assert that.  I should have remembered the whole page
> fault path better.
>
> So yeah, I think we should take patch 1 from Andreas, then a non-folio
> version of your patch as a start.

will you pick up those two patches and push them to Linus? They both
seem pretty safe.

Thanks,
Andreas
Matthew Wilcox June 30, 2021, 2:08 p.m. UTC | #10
On Tue, Jun 29, 2021 at 11:12:39AM +0200, Andreas Gruenbacher wrote:
> Below is a version of your patch on top of v5.13 which has passed some
> local testing here.
> 
> Thanks,
> Andreas
> 
> --
> 
> iomap: Permit pages without an iop to enter writeback
> 
> Permit pages without an iop to enter writeback and create an iop *then*.  This
> allows filesystems to mark pages dirty without having to worry about how the
> iop block tracking is implemented.

How about ...

Create an iop in the writeback path if one doesn't exist.  This allows
us to avoid creating the iop in some cases.  The only current case we
do that for is pages with inline data, but it can be extended to pages
which are entirely within an extent.  It also allows for an iop to be
removed from pages in the future (eg page split).

> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

Co-developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Andreas Gruenbacher June 30, 2021, 2:45 p.m. UTC | #11
On Wed, Jun 30, 2021 at 4:09 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Tue, Jun 29, 2021 at 11:12:39AM +0200, Andreas Gruenbacher wrote:
> > Below is a version of your patch on top of v5.13 which has passed some
> > local testing here.
> >
> > Thanks,
> > Andreas
> >
> > --
> >
> > iomap: Permit pages without an iop to enter writeback
> >
> > Permit pages without an iop to enter writeback and create an iop *then*.  This
> > allows filesystems to mark pages dirty without having to worry about how the
> > iop block tracking is implemented.
>
> How about ...
>
> Create an iop in the writeback path if one doesn't exist.  This allows
> us to avoid creating the iop in some cases.  The only current case we
> do that for is pages with inline data, but it can be extended to pages
> which are entirely within an extent.  It also allows for an iop to be
> removed from pages in the future (eg page split).
>
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
>
> Co-developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Sure, that works for me.

Thanks,
Andreas
Andreas Gruenbacher July 5, 2021, 3:51 p.m. UTC | #12
On Wed, Jun 30, 2021 at 2:29 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> Darrick,
>
> will you pick up those two patches and push them to Linus? They both
> seem pretty safe.

Hello, is there anybody out there?

I've put the two patches here with the sign-offs they've received:

https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/log/?h=for-next.iomap

Thanks,
Andreas
Christoph Hellwig July 5, 2021, 4:55 p.m. UTC | #13
On Mon, Jul 05, 2021 at 05:51:22PM +0200, Andreas Gruenbacher wrote:
> On Wed, Jun 30, 2021 at 2:29 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > Darrick,
> >
> > will you pick up those two patches and push them to Linus? They both
> > seem pretty safe.
> 
> Hello, is there anybody out there?
> 
> I've put the two patches here with the sign-offs they've received:

Please send out an updated series likey everyone else.