diff mbox series

iomap: Return zero in case of unsuccessful pagecache invalidation before DIO

Message ID 20200528192103.xm45qoxqmkw7i5yl@fiona (mailing list archive)
State New, archived
Headers show
Series iomap: Return zero in case of unsuccessful pagecache invalidation before DIO | expand

Commit Message

Goldwyn Rodrigues May 28, 2020, 7:21 p.m. UTC
Filesystems such as btrfs are unable to guarantee page invalidation
because pages could be locked as a part of the extent. Return zero
in case a page cache invalidation is unsuccessful so filesystems can
fallback to buffered I/O. This is similar to
generic_file_direct_write().

This takes care of the following invalidation warning during btrfs
mixed buffered and direct I/O using iomap_dio_rw():

Page cache invalidation failure on direct I/O.  Possible data
corruption due to collision with buffered I/O!

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

Comments

Darrick J. Wong May 29, 2020, 12:23 a.m. UTC | #1
On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote:
> 
> Filesystems such as btrfs are unable to guarantee page invalidation
> because pages could be locked as a part of the extent. Return zero

Locked for what?  filemap_write_and_wait_range should have just cleaned
them off.

> in case a page cache invalidation is unsuccessful so filesystems can
> fallback to buffered I/O. This is similar to
> generic_file_direct_write().
> 
> This takes care of the following invalidation warning during btrfs
> mixed buffered and direct I/O using iomap_dio_rw():
> 
> Page cache invalidation failure on direct I/O.  Possible data
> corruption due to collision with buffered I/O!
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index e4addfc58107..215315be6233 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	 */
>  	ret = invalidate_inode_pages2_range(mapping,
>  			pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -	if (ret)
> -		dio_warn_stale_pagecache(iocb->ki_filp);
> -	ret = 0;
> +	/*
> +	 * If a page can not be invalidated, return 0 to fall back
> +	 * to buffered write.
> +	 */
> +	if (ret) {
> +		if (ret == -EBUSY)
> +			ret = 0;
> +		goto out_free_dio;

XFS doesn't fall back to buffered io when directio fails, which means
this will cause a regression there.

Granted mixing write types is bogus...

--D

> +	}
>  
>  	if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
>  	    !inode->i_sb->s_dio_done_wq) {
> 
> -- 
> Goldwyn
Filipe Manana May 29, 2020, 10:55 a.m. UTC | #2
On Fri, May 29, 2020 at 1:23 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote:
> >
> > Filesystems such as btrfs are unable to guarantee page invalidation
> > because pages could be locked as a part of the extent. Return zero
>
> Locked for what?  filemap_write_and_wait_range should have just cleaned
> them off.

Yes, it will be confusing even for someone more familiar with btrfs.
The changelog could be more detailed to make it clear what's happening and why.

So what happens:

1) iomap_dio_rw() calls filemap_write_and_wait_range().
    That starts delalloc for all dirty pages in the range and then
waits for writeback to complete.
    This is enough for most filesystems at least (if not all except btrfs).

2) However, in btrfs once writeback finishes, a job is queued to run
on a dedicated workqueue, to execute the function
btrfs_finish_ordered_io().
    So that job will be run after filemap_write_and_wait_range() returns.
    That function locks the file range (using a btrfs specific data
structure), does a bunch of things (updating several btrees), and then
unlocks the file range.

3) While iomap calls invalidate_inode_pages2_range(), which ends up
calling the btrfs callback btfs_releasepage(),
    btrfs_finish_ordered_io() is running and has the file range locked
(this is what Goldwyn means by "pages could be locked", which is
confusing because it's not about any locked struct page).

4) Because the file range is locked, btrfs_releasepage() returns 0
(page can't be released), this happens in the helper function
try_release_extent_state().
    Any page in that range is not dirty nor under writeback anymore
and, in fact, btrfs_finished_ordered_io() doesn't do anything with the
pages, it's only updating metadata.

    btrfs_releasepage() in this case could release the pages, but
there are other contextes where the file range is locked, the pages
are still not dirty and not under writeback, where this would not be
safe to do.

5) So because of that invalidate_inode_pages2_range() returns
non-zero, the iomap code prints that warning message and then proceeds
with doing a direct IO write anyway.

What happens currently in btrfs, before Goldwyn's patchset:

1) generic_file_direct_write() also calls filemap_write_and_wait_range().
2) After that it calls invalidate_inode_pages2_range() too, but if
that returns non-zero, it doesn't print any warning and falls back to
a buffered write.

So Goldwyn here is effectively adding that behaviour from
generic_file_direct_write() to iomap.

Thanks.

>
> > in case a page cache invalidation is unsuccessful so filesystems can
> > fallback to buffered I/O. This is similar to
> > generic_file_direct_write().
> >
> > This takes care of the following invalidation warning during btrfs
> > mixed buffered and direct I/O using iomap_dio_rw():
> >
> > Page cache invalidation failure on direct I/O.  Possible data
> > corruption due to collision with buffered I/O!
> >
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index e4addfc58107..215315be6233 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >        */
> >       ret = invalidate_inode_pages2_range(mapping,
> >                       pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > -     if (ret)
> > -             dio_warn_stale_pagecache(iocb->ki_filp);
> > -     ret = 0;
> > +     /*
> > +      * If a page can not be invalidated, return 0 to fall back
> > +      * to buffered write.
> > +      */
> > +     if (ret) {
> > +             if (ret == -EBUSY)
> > +                     ret = 0;
> > +             goto out_free_dio;
>
> XFS doesn't fall back to buffered io when directio fails, which means
> this will cause a regression there.
>
> Granted mixing write types is bogus...
>
> --D
>
> > +     }
> >
> >       if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> >           !inode->i_sb->s_dio_done_wq) {
> >
> > --
> > Goldwyn
Matthew Wilcox (Oracle) May 29, 2020, 11:31 a.m. UTC | #3
On Fri, May 29, 2020 at 11:55:33AM +0100, Filipe Manana wrote:
> On Fri, May 29, 2020 at 1:23 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote:
> > >
> > > Filesystems such as btrfs are unable to guarantee page invalidation
> > > because pages could be locked as a part of the extent. Return zero
> >
> > Locked for what?  filemap_write_and_wait_range should have just cleaned
> > them off.
> 
> Yes, it will be confusing even for someone more familiar with btrfs.
> The changelog could be more detailed to make it clear what's happening and why.
> 
> So what happens:
> 
> 1) iomap_dio_rw() calls filemap_write_and_wait_range().
>     That starts delalloc for all dirty pages in the range and then
> waits for writeback to complete.
>     This is enough for most filesystems at least (if not all except btrfs).
> 
> 2) However, in btrfs once writeback finishes, a job is queued to run
> on a dedicated workqueue, to execute the function
> btrfs_finish_ordered_io().
>     So that job will be run after filemap_write_and_wait_range() returns.
>     That function locks the file range (using a btrfs specific data
> structure), does a bunch of things (updating several btrees), and then
> unlocks the file range.
> 
> 3) While iomap calls invalidate_inode_pages2_range(), which ends up
> calling the btrfs callback btfs_releasepage(),
>     btrfs_finish_ordered_io() is running and has the file range locked
> (this is what Goldwyn means by "pages could be locked", which is
> confusing because it's not about any locked struct page).
> 
> 4) Because the file range is locked, btrfs_releasepage() returns 0
> (page can't be released), this happens in the helper function
> try_release_extent_state().
>     Any page in that range is not dirty nor under writeback anymore
> and, in fact, btrfs_finished_ordered_io() doesn't do anything with the
> pages, it's only updating metadata.
> 
>     btrfs_releasepage() in this case could release the pages, but
> there are other contextes where the file range is locked, the pages
> are still not dirty and not under writeback, where this would not be
> safe to do.

Isn't this the bug, though?  Rather than returning "page can't be
released", shouldn't ->releasepage sleep on the extent state, at least
if the GFP mask indicates you can sleep?

> 5) So because of that invalidate_inode_pages2_range() returns
> non-zero, the iomap code prints that warning message and then proceeds
> with doing a direct IO write anyway.
> 
> What happens currently in btrfs, before Goldwyn's patchset:
> 
> 1) generic_file_direct_write() also calls filemap_write_and_wait_range().
> 2) After that it calls invalidate_inode_pages2_range() too, but if
> that returns non-zero, it doesn't print any warning and falls back to
> a buffered write.
> 
> So Goldwyn here is effectively adding that behaviour from
> generic_file_direct_write() to iomap.
> 
> Thanks.
> 
> >
> > > in case a page cache invalidation is unsuccessful so filesystems can
> > > fallback to buffered I/O. This is similar to
> > > generic_file_direct_write().
> > >
> > > This takes care of the following invalidation warning during btrfs
> > > mixed buffered and direct I/O using iomap_dio_rw():
> > >
> > > Page cache invalidation failure on direct I/O.  Possible data
> > > corruption due to collision with buffered I/O!
> > >
> > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > >
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index e4addfc58107..215315be6233 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >        */
> > >       ret = invalidate_inode_pages2_range(mapping,
> > >                       pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > -     if (ret)
> > > -             dio_warn_stale_pagecache(iocb->ki_filp);
> > > -     ret = 0;
> > > +     /*
> > > +      * If a page can not be invalidated, return 0 to fall back
> > > +      * to buffered write.
> > > +      */
> > > +     if (ret) {
> > > +             if (ret == -EBUSY)
> > > +                     ret = 0;
> > > +             goto out_free_dio;
> >
> > XFS doesn't fall back to buffered io when directio fails, which means
> > this will cause a regression there.
> >
> > Granted mixing write types is bogus...
> >
> > --D
> >
> > > +     }
> > >
> > >       if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > >           !inode->i_sb->s_dio_done_wq) {
> > >
> > > --
> > > Goldwyn
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”
Filipe Manana May 29, 2020, 11:50 a.m. UTC | #4
On Fri, May 29, 2020 at 12:31 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, May 29, 2020 at 11:55:33AM +0100, Filipe Manana wrote:
> > On Fri, May 29, 2020 at 1:23 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > >
> > > On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote:
> > > >
> > > > Filesystems such as btrfs are unable to guarantee page invalidation
> > > > because pages could be locked as a part of the extent. Return zero
> > >
> > > Locked for what?  filemap_write_and_wait_range should have just cleaned
> > > them off.
> >
> > Yes, it will be confusing even for someone more familiar with btrfs.
> > The changelog could be more detailed to make it clear what's happening and why.
> >
> > So what happens:
> >
> > 1) iomap_dio_rw() calls filemap_write_and_wait_range().
> >     That starts delalloc for all dirty pages in the range and then
> > waits for writeback to complete.
> >     This is enough for most filesystems at least (if not all except btrfs).
> >
> > 2) However, in btrfs once writeback finishes, a job is queued to run
> > on a dedicated workqueue, to execute the function
> > btrfs_finish_ordered_io().
> >     So that job will be run after filemap_write_and_wait_range() returns.
> >     That function locks the file range (using a btrfs specific data
> > structure), does a bunch of things (updating several btrees), and then
> > unlocks the file range.
> >
> > 3) While iomap calls invalidate_inode_pages2_range(), which ends up
> > calling the btrfs callback btfs_releasepage(),
> >     btrfs_finish_ordered_io() is running and has the file range locked
> > (this is what Goldwyn means by "pages could be locked", which is
> > confusing because it's not about any locked struct page).
> >
> > 4) Because the file range is locked, btrfs_releasepage() returns 0
> > (page can't be released), this happens in the helper function
> > try_release_extent_state().
> >     Any page in that range is not dirty nor under writeback anymore
> > and, in fact, btrfs_finished_ordered_io() doesn't do anything with the
> > pages, it's only updating metadata.
> >
> >     btrfs_releasepage() in this case could release the pages, but
> > there are other contextes where the file range is locked, the pages
> > are still not dirty and not under writeback, where this would not be
> > safe to do.
>
> Isn't this the bug, though?  Rather than returning "page can't be
> released", shouldn't ->releasepage sleep on the extent state, at least
> if the GFP mask indicates you can sleep?

Goldwyn mentioned in another thread that he had tried that with the
following patch:

https://patchwork.kernel.org/patch/11275063/

But he mentioned it didn't work though, caused some locking problems.
I don't know the details and I haven't tried the patchset yet.
Goldwyn?

>
> > 5) So because of that invalidate_inode_pages2_range() returns
> > non-zero, the iomap code prints that warning message and then proceeds
> > with doing a direct IO write anyway.
> >
> > What happens currently in btrfs, before Goldwyn's patchset:
> >
> > 1) generic_file_direct_write() also calls filemap_write_and_wait_range().
> > 2) After that it calls invalidate_inode_pages2_range() too, but if
> > that returns non-zero, it doesn't print any warning and falls back to
> > a buffered write.
> >
> > So Goldwyn here is effectively adding that behaviour from
> > generic_file_direct_write() to iomap.
> >
> > Thanks.
> >
> > >
> > > > in case a page cache invalidation is unsuccessful so filesystems can
> > > > fallback to buffered I/O. This is similar to
> > > > generic_file_direct_write().
> > > >
> > > > This takes care of the following invalidation warning during btrfs
> > > > mixed buffered and direct I/O using iomap_dio_rw():
> > > >
> > > > Page cache invalidation failure on direct I/O.  Possible data
> > > > corruption due to collision with buffered I/O!
> > > >
> > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > >
> > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > index e4addfc58107..215315be6233 100644
> > > > --- a/fs/iomap/direct-io.c
> > > > +++ b/fs/iomap/direct-io.c
> > > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > > >        */
> > > >       ret = invalidate_inode_pages2_range(mapping,
> > > >                       pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > -     if (ret)
> > > > -             dio_warn_stale_pagecache(iocb->ki_filp);
> > > > -     ret = 0;
> > > > +     /*
> > > > +      * If a page can not be invalidated, return 0 to fall back
> > > > +      * to buffered write.
> > > > +      */
> > > > +     if (ret) {
> > > > +             if (ret == -EBUSY)
> > > > +                     ret = 0;
> > > > +             goto out_free_dio;
> > >
> > > XFS doesn't fall back to buffered io when directio fails, which means
> > > this will cause a regression there.
> > >
> > > Granted mixing write types is bogus...
> > >
> > > --D
> > >
> > > > +     }
> > > >
> > > >       if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > > >           !inode->i_sb->s_dio_done_wq) {
> > > >
> > > > --
> > > > Goldwyn
> >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”
Goldwyn Rodrigues May 29, 2020, 12:45 p.m. UTC | #5
On 12:50 29/05, Filipe Manana wrote:
> On Fri, May 29, 2020 at 12:31 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, May 29, 2020 at 11:55:33AM +0100, Filipe Manana wrote:
> > > On Fri, May 29, 2020 at 1:23 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > >
> > > > On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote:
> > > > >
> > > > > Filesystems such as btrfs are unable to guarantee page invalidation
> > > > > because pages could be locked as a part of the extent. Return zero
> > > >
> > > > Locked for what?  filemap_write_and_wait_range should have just cleaned
> > > > them off.
> > >
> > > Yes, it will be confusing even for someone more familiar with btrfs.
> > > The changelog could be more detailed to make it clear what's happening and why.
> > >
> > > So what happens:
> > >
> > > 1) iomap_dio_rw() calls filemap_write_and_wait_range().
> > >     That starts delalloc for all dirty pages in the range and then
> > > waits for writeback to complete.
> > >     This is enough for most filesystems at least (if not all except btrfs).
> > >
> > > 2) However, in btrfs once writeback finishes, a job is queued to run
> > > on a dedicated workqueue, to execute the function
> > > btrfs_finish_ordered_io().
> > >     So that job will be run after filemap_write_and_wait_range() returns.
> > >     That function locks the file range (using a btrfs specific data
> > > structure), does a bunch of things (updating several btrees), and then
> > > unlocks the file range.
> > >
> > > 3) While iomap calls invalidate_inode_pages2_range(), which ends up
> > > calling the btrfs callback btfs_releasepage(),
> > >     btrfs_finish_ordered_io() is running and has the file range locked
> > > (this is what Goldwyn means by "pages could be locked", which is
> > > confusing because it's not about any locked struct page).
> > >
> > > 4) Because the file range is locked, btrfs_releasepage() returns 0
> > > (page can't be released), this happens in the helper function
> > > try_release_extent_state().
> > >     Any page in that range is not dirty nor under writeback anymore
> > > and, in fact, btrfs_finished_ordered_io() doesn't do anything with the
> > > pages, it's only updating metadata.
> > >
> > >     btrfs_releasepage() in this case could release the pages, but
> > > there are other contextes where the file range is locked, the pages
> > > are still not dirty and not under writeback, where this would not be
> > > safe to do.
> >
> > Isn't this the bug, though?  Rather than returning "page can't be
> > released", shouldn't ->releasepage sleep on the extent state, at least
> > if the GFP mask indicates you can sleep?
> 
> Goldwyn mentioned in another thread that he had tried that with the
> following patch:
> 
> https://patchwork.kernel.org/patch/11275063/
> 
> But he mentioned it didn't work though, caused some locking problems.
> I don't know the details and I haven't tried the patchset yet.
> Goldwyn?
> 

Yes, direct I/O would wait for extent bits to be unlocked forever and hang.
I think it was against an fsync call, but I don't remember. In the light
of new developments, I would pursue this further. This should be valid
even in the current (before iomap patches) source.
Goldwyn Rodrigues June 1, 2020, 3:16 p.m. UTC | #6
On 17:23 28/05, Darrick J. Wong wrote:
> On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote:
> > 
> > Filesystems such as btrfs are unable to guarantee page invalidation
> > because pages could be locked as a part of the extent. Return zero
> 
> Locked for what?  filemap_write_and_wait_range should have just cleaned
> them off.
> 
> > in case a page cache invalidation is unsuccessful so filesystems can
> > fallback to buffered I/O. This is similar to
> > generic_file_direct_write().
> > 
> > This takes care of the following invalidation warning during btrfs
> > mixed buffered and direct I/O using iomap_dio_rw():
> > 
> > Page cache invalidation failure on direct I/O.  Possible data
> > corruption due to collision with buffered I/O!
> > 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index e4addfc58107..215315be6233 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  	 */
> >  	ret = invalidate_inode_pages2_range(mapping,
> >  			pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > -	if (ret)
> > -		dio_warn_stale_pagecache(iocb->ki_filp);
> > -	ret = 0;
> > +	/*
> > +	 * If a page can not be invalidated, return 0 to fall back
> > +	 * to buffered write.
> > +	 */
> > +	if (ret) {
> > +		if (ret == -EBUSY)
> > +			ret = 0;
> > +		goto out_free_dio;
> 
> XFS doesn't fall back to buffered io when directio fails, which means
> this will cause a regression there.
> 
> Granted mixing write types is bogus...
> 

I have not seen page invalidation failure errors on XFS, but what should
happen hypothetically if they do occur? Carry on with the direct I/O?
Would an error return like -ENOTBLK be better?
Filipe Manana June 3, 2020, 11:23 a.m. UTC | #7
On Mon, Jun 1, 2020 at 4:16 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
>
> On 17:23 28/05, Darrick J. Wong wrote:
> > On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote:
> > >
> > > Filesystems such as btrfs are unable to guarantee page invalidation
> > > because pages could be locked as a part of the extent. Return zero
> >
> > Locked for what?  filemap_write_and_wait_range should have just cleaned
> > them off.
> >
> > > in case a page cache invalidation is unsuccessful so filesystems can
> > > fallback to buffered I/O. This is similar to
> > > generic_file_direct_write().
> > >
> > > This takes care of the following invalidation warning during btrfs
> > > mixed buffered and direct I/O using iomap_dio_rw():
> > >
> > > Page cache invalidation failure on direct I/O.  Possible data
> > > corruption due to collision with buffered I/O!
> > >
> > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > >
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index e4addfc58107..215315be6233 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >      */
> > >     ret = invalidate_inode_pages2_range(mapping,
> > >                     pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > -   if (ret)
> > > -           dio_warn_stale_pagecache(iocb->ki_filp);
> > > -   ret = 0;
> > > +   /*
> > > +    * If a page can not be invalidated, return 0 to fall back
> > > +    * to buffered write.
> > > +    */
> > > +   if (ret) {
> > > +           if (ret == -EBUSY)
> > > +                   ret = 0;
> > > +           goto out_free_dio;
> >
> > XFS doesn't fall back to buffered io when directio fails, which means
> > this will cause a regression there.
> >
> > Granted mixing write types is bogus...
> >
>
> I have not seen page invalidation failure errors on XFS, but what should
> happen hypothetically if they do occur? Carry on with the direct I/O?
> Would an error return like -ENOTBLK be better?

It doesn't make much to me to emit the warning and then proceed to the
direct IO write path anyway, as if nothing happened.
If we are concerned about possible corruption, we should either return
an error or fallback to buffered IO just like
generic_file_direct_write() did, and not allow the possibility for
corruptions.

Btw, this is causing a regression in Btrfs now. The problem is that
dio_warn_stale_pagecache() sets an EIO error in the inode's mapping:

errseq_set(&inode->i_mapping->wb_err, -EIO);

So the next fsync on the file will return that error, despite the
fsync having completed successfully with any errors.

Since patchset to make btrfs direct IO use iomap is already in Linus'
tree, we need to fix this somehow.
This makes generic/547 fail often for example - buffered write against
file + direct IO write + fsync - the later returns -EIO.

Thanks.

>
> --
> Goldwyn
Filipe Manana June 3, 2020, 11:32 a.m. UTC | #8
On Wed, Jun 3, 2020 at 12:23 PM Filipe Manana <fdmanana@gmail.com> wrote:
>
> On Mon, Jun 1, 2020 at 4:16 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> >
> > On 17:23 28/05, Darrick J. Wong wrote:
> > > On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote:
> > > >
> > > > Filesystems such as btrfs are unable to guarantee page invalidation
> > > > because pages could be locked as a part of the extent. Return zero
> > >
> > > Locked for what?  filemap_write_and_wait_range should have just cleaned
> > > them off.
> > >
> > > > in case a page cache invalidation is unsuccessful so filesystems can
> > > > fallback to buffered I/O. This is similar to
> > > > generic_file_direct_write().
> > > >
> > > > This takes care of the following invalidation warning during btrfs
> > > > mixed buffered and direct I/O using iomap_dio_rw():
> > > >
> > > > Page cache invalidation failure on direct I/O.  Possible data
> > > > corruption due to collision with buffered I/O!
> > > >
> > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > >
> > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > index e4addfc58107..215315be6233 100644
> > > > --- a/fs/iomap/direct-io.c
> > > > +++ b/fs/iomap/direct-io.c
> > > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > > >      */
> > > >     ret = invalidate_inode_pages2_range(mapping,
> > > >                     pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > -   if (ret)
> > > > -           dio_warn_stale_pagecache(iocb->ki_filp);
> > > > -   ret = 0;
> > > > +   /*
> > > > +    * If a page can not be invalidated, return 0 to fall back
> > > > +    * to buffered write.
> > > > +    */
> > > > +   if (ret) {
> > > > +           if (ret == -EBUSY)
> > > > +                   ret = 0;
> > > > +           goto out_free_dio;
> > >
> > > XFS doesn't fall back to buffered io when directio fails, which means
> > > this will cause a regression there.
> > >
> > > Granted mixing write types is bogus...
> > >
> >
> > I have not seen page invalidation failure errors on XFS, but what should
> > happen hypothetically if they do occur? Carry on with the direct I/O?
> > Would an error return like -ENOTBLK be better?
>
> It doesn't make much to me to emit the warning and then proceed to the
> direct IO write path anyway, as if nothing happened.
> If we are concerned about possible corruption, we should either return
> an error or fallback to buffered IO just like
> generic_file_direct_write() did, and not allow the possibility for
> corruptions.
>
> Btw, this is causing a regression in Btrfs now. The problem is that
> dio_warn_stale_pagecache() sets an EIO error in the inode's mapping:
>
> errseq_set(&inode->i_mapping->wb_err, -EIO);
>
> So the next fsync on the file will return that error, despite the
> fsync having completed successfully with any errors.
>
> Since patchset to make btrfs direct IO use iomap is already in Linus'
> tree, we need to fix this somehow.
> This makes generic/547 fail often for example - buffered write against
> file + direct IO write + fsync - the later returns -EIO.

Just to make it clear, despite the -EIO error, there was actually no
data loss or corruption (generic/547 checks that),
since the direct IO write path in btrfs figures out there's a buffered
write still ongoing and waits for it to complete before proceeding
with the dio write.

Nevertheless, it's still a regression, -EIO shouldn't be returned as
everything went fine.

>
> Thanks.
>
> >
> > --
> > Goldwyn
>
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”
Darrick J. Wong June 3, 2020, 7:02 p.m. UTC | #9
On Wed, Jun 03, 2020 at 12:32:15PM +0100, Filipe Manana wrote:
> On Wed, Jun 3, 2020 at 12:23 PM Filipe Manana <fdmanana@gmail.com> wrote:
> >
> > On Mon, Jun 1, 2020 at 4:16 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > >
> > > On 17:23 28/05, Darrick J. Wong wrote:
> > > > On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote:
> > > > >
> > > > > Filesystems such as btrfs are unable to guarantee page invalidation
> > > > > because pages could be locked as a part of the extent. Return zero
> > > >
> > > > Locked for what?  filemap_write_and_wait_range should have just cleaned
> > > > them off.
> > > >
> > > > > in case a page cache invalidation is unsuccessful so filesystems can
> > > > > fallback to buffered I/O. This is similar to
> > > > > generic_file_direct_write().
> > > > >
> > > > > This takes care of the following invalidation warning during btrfs
> > > > > mixed buffered and direct I/O using iomap_dio_rw():
> > > > >
> > > > > Page cache invalidation failure on direct I/O.  Possible data
> > > > > corruption due to collision with buffered I/O!
> > > > >
> > > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > >
> > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > > index e4addfc58107..215315be6233 100644
> > > > > --- a/fs/iomap/direct-io.c
> > > > > +++ b/fs/iomap/direct-io.c
> > > > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > > > >      */
> > > > >     ret = invalidate_inode_pages2_range(mapping,
> > > > >                     pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > > -   if (ret)
> > > > > -           dio_warn_stale_pagecache(iocb->ki_filp);
> > > > > -   ret = 0;
> > > > > +   /*
> > > > > +    * If a page can not be invalidated, return 0 to fall back
> > > > > +    * to buffered write.
> > > > > +    */
> > > > > +   if (ret) {
> > > > > +           if (ret == -EBUSY)
> > > > > +                   ret = 0;
> > > > > +           goto out_free_dio;
> > > >
> > > > XFS doesn't fall back to buffered io when directio fails, which means
> > > > this will cause a regression there.
> > > >
> > > > Granted mixing write types is bogus...
> > > >
> > >
> > > I have not seen page invalidation failure errors on XFS, but what should

What happens if you try to dirty an mmap page at the same time as
issuing a directio?

> > > happen hypothetically if they do occur? Carry on with the direct I/O?
> > > Would an error return like -ENOTBLK be better?

In the old days, we would only WARN when we encountered collisions like
this.  How about only setting EIO in the mapping if we fail the
pagecache invalidation in directio completion?  If a buffered write
dirties the page after the direct write thread flushes the dirty pages
but before invalidation, we can argue that we didn't lose anything; the
direct write simply happened after the buffered write.

XFS doesn't implement buffered write fallback, and it never has.  Either
the entire directio succeeds, or it returns a negative error code.  Some
of the iomap_dio_rw callers (ext4, jfs2) will notice a short direct
write and try to finish the rest with buffered io, but xfs and zonefs do
not.

The net effect of this (on xfs anyway) is that when buffered and direct
writes collide, before we'd make the buffered writer lose, now we make
the direct writer lose.

You also /could/ propose teaching xfs how to fall back to an
invalidating synchronous buffered write like ext4 does, but that's not
part of this patch set, and that's not a behavior I want to introduce
suddenly during the merge window.

> > It doesn't make much to me to emit the warning and then proceed to the
> > direct IO write path anyway, as if nothing happened.
> > If we are concerned about possible corruption, we should either return
> > an error or fallback to buffered IO just like
> > generic_file_direct_write() did, and not allow the possibility for
> > corruptions.
> >
> > Btw, this is causing a regression in Btrfs now. The problem is that
> > dio_warn_stale_pagecache() sets an EIO error in the inode's mapping:
> >
> > errseq_set(&inode->i_mapping->wb_err, -EIO);
> >
> > So the next fsync on the file will return that error, despite the
> > fsync having completed successfully with any errors.
> >
> > Since patchset to make btrfs direct IO use iomap is already in Linus'
> > tree, we need to fix this somehow.

Y'all /just/ sent the pull request containing that conversion 2 days
ago.  Why did you move forward with that when you knew there were
unresolved fstests failures?

> > This makes generic/547 fail often for example - buffered write against
> > file + direct IO write + fsync - the later returns -EIO.
> 
> Just to make it clear, despite the -EIO error, there was actually no
> data loss or corruption (generic/547 checks that),
> since the direct IO write path in btrfs figures out there's a buffered
> write still ongoing and waits for it to complete before proceeding
> with the dio write.
> 
> Nevertheless, it's still a regression, -EIO shouldn't be returned as
> everything went fine.

Now I'm annoyed because I feel like you're trying to strong-arm me into
making last minute changes to iomap when you could have held off for
another cycle.

While I'm pretty sure your analysis is correct that we could /probably/
get away with only setting EIO if we can't invalidate the cache after
we've already written the disk blocks directly (because then we really
did lose something), I /really/ want these kinds of behavioral changes
to shared code to soak in for-next for long enough that we can work out
the kinks without exposing the upstream kernel to unnecessary risk.

This conversation would be /much/ different had you not just merged the
btrfs directio iomap conversion yesterday.

--D

> 
> >
> > Thanks.
> >
> > >
> > > --
> > > Goldwyn
> >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”
Filipe Manana June 3, 2020, 7:10 p.m. UTC | #10
On Wed, Jun 3, 2020 at 8:02 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Wed, Jun 03, 2020 at 12:32:15PM +0100, Filipe Manana wrote:
> > On Wed, Jun 3, 2020 at 12:23 PM Filipe Manana <fdmanana@gmail.com> wrote:
> > >
> > > On Mon, Jun 1, 2020 at 4:16 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > >
> > > > On 17:23 28/05, Darrick J. Wong wrote:
> > > > > On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote:
> > > > > >
> > > > > > Filesystems such as btrfs are unable to guarantee page invalidation
> > > > > > because pages could be locked as a part of the extent. Return zero
> > > > >
> > > > > Locked for what?  filemap_write_and_wait_range should have just cleaned
> > > > > them off.
> > > > >
> > > > > > in case a page cache invalidation is unsuccessful so filesystems can
> > > > > > fallback to buffered I/O. This is similar to
> > > > > > generic_file_direct_write().
> > > > > >
> > > > > > This takes care of the following invalidation warning during btrfs
> > > > > > mixed buffered and direct I/O using iomap_dio_rw():
> > > > > >
> > > > > > Page cache invalidation failure on direct I/O.  Possible data
> > > > > > corruption due to collision with buffered I/O!
> > > > > >
> > > > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > > >
> > > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > > > index e4addfc58107..215315be6233 100644
> > > > > > --- a/fs/iomap/direct-io.c
> > > > > > +++ b/fs/iomap/direct-io.c
> > > > > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > > > > >      */
> > > > > >     ret = invalidate_inode_pages2_range(mapping,
> > > > > >                     pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > > > -   if (ret)
> > > > > > -           dio_warn_stale_pagecache(iocb->ki_filp);
> > > > > > -   ret = 0;
> > > > > > +   /*
> > > > > > +    * If a page can not be invalidated, return 0 to fall back
> > > > > > +    * to buffered write.
> > > > > > +    */
> > > > > > +   if (ret) {
> > > > > > +           if (ret == -EBUSY)
> > > > > > +                   ret = 0;
> > > > > > +           goto out_free_dio;
> > > > >
> > > > > XFS doesn't fall back to buffered io when directio fails, which means
> > > > > this will cause a regression there.
> > > > >
> > > > > Granted mixing write types is bogus...
> > > > >
> > > >
> > > > I have not seen page invalidation failure errors on XFS, but what should
>
> What happens if you try to dirty an mmap page at the same time as
> issuing a directio?
>
> > > > happen hypothetically if they do occur? Carry on with the direct I/O?
> > > > Would an error return like -ENOTBLK be better?
>
> In the old days, we would only WARN when we encountered collisions like
> this.  How about only setting EIO in the mapping if we fail the
> pagecache invalidation in directio completion?  If a buffered write
> dirties the page after the direct write thread flushes the dirty pages
> but before invalidation, we can argue that we didn't lose anything; the
> direct write simply happened after the buffered write.
>
> XFS doesn't implement buffered write fallback, and it never has.  Either
> the entire directio succeeds, or it returns a negative error code.  Some
> of the iomap_dio_rw callers (ext4, jfs2) will notice a short direct
> write and try to finish the rest with buffered io, but xfs and zonefs do
> not.
>
> The net effect of this (on xfs anyway) is that when buffered and direct
> writes collide, before we'd make the buffered writer lose, now we make
> the direct writer lose.
>
> You also /could/ propose teaching xfs how to fall back to an
> invalidating synchronous buffered write like ext4 does, but that's not
> part of this patch set, and that's not a behavior I want to introduce
> suddenly during the merge window.
>
> > > It doesn't make much to me to emit the warning and then proceed to the
> > > direct IO write path anyway, as if nothing happened.
> > > If we are concerned about possible corruption, we should either return
> > > an error or fallback to buffered IO just like
> > > generic_file_direct_write() did, and not allow the possibility for
> > > corruptions.
> > >
> > > Btw, this is causing a regression in Btrfs now. The problem is that
> > > dio_warn_stale_pagecache() sets an EIO error in the inode's mapping:
> > >
> > > errseq_set(&inode->i_mapping->wb_err, -EIO);
> > >
> > > So the next fsync on the file will return that error, despite the
> > > fsync having completed successfully with any errors.
> > >
> > > Since patchset to make btrfs direct IO use iomap is already in Linus'
> > > tree, we need to fix this somehow.
>
> Y'all /just/ sent the pull request containing that conversion 2 days
> ago.  Why did you move forward with that when you knew there were
> unresolved fstests failures?
>
> > > This makes generic/547 fail often for example - buffered write against
> > > file + direct IO write + fsync - the later returns -EIO.
> >
> > Just to make it clear, despite the -EIO error, there was actually no
> > data loss or corruption (generic/547 checks that),
> > since the direct IO write path in btrfs figures out there's a buffered
> > write still ongoing and waits for it to complete before proceeding
> > with the dio write.
> >
> > Nevertheless, it's still a regression, -EIO shouldn't be returned as
> > everything went fine.
>
> Now I'm annoyed because I feel like you're trying to strong-arm me into
> making last minute changes to iomap when you could have held off for
> another cycle.

If you are talking to me, I'm not trying to strong-arm anyone nor
point a fingers.
I'm just reporting a problem that I found earlier today while testing
some work I was doing.

Thanks.

>
> While I'm pretty sure your analysis is correct that we could /probably/
> get away with only setting EIO if we can't invalidate the cache after
> we've already written the disk blocks directly (because then we really
> did lose something), I /really/ want these kinds of behavioral changes
> to shared code to soak in for-next for long enough that we can work out
> the kinks without exposing the upstream kernel to unnecessary risk.
>
> This conversation would be /much/ different had you not just merged the
> btrfs directio iomap conversion yesterday.
>
> --D
>
> >
> > >
> > > Thanks.
> > >
> > > >
> > > > --
> > > > Goldwyn
> > >
> > >
> > >
> > > --
> > > Filipe David Manana,
> > >
> > > “Whether you think you can, or you think you can't — you're right.”
> >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”
Matthew Wilcox (Oracle) June 3, 2020, 7:18 p.m. UTC | #11
On Wed, Jun 03, 2020 at 08:10:50PM +0100, Filipe Manana wrote:
> On Wed, Jun 3, 2020 at 8:02 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Wed, Jun 03, 2020 at 12:32:15PM +0100, Filipe Manana wrote:
> > > On Wed, Jun 3, 2020 at 12:23 PM Filipe Manana <fdmanana@gmail.com> wrote:
> > > > Btw, this is causing a regression in Btrfs now. The problem is that
> > > > dio_warn_stale_pagecache() sets an EIO error in the inode's mapping:
> > > >
> > > > errseq_set(&inode->i_mapping->wb_err, -EIO);
> > > >
> > > > So the next fsync on the file will return that error, despite the
> > > > fsync having completed successfully with any errors.
> > > >
> > > > Since patchset to make btrfs direct IO use iomap is already in Linus'
> > > > tree, we need to fix this somehow.
> >
> > Y'all /just/ sent the pull request containing that conversion 2 days
> > ago.  Why did you move forward with that when you knew there were
> > unresolved fstests failures?
> >
> > Now I'm annoyed because I feel like you're trying to strong-arm me into
> > making last minute changes to iomap when you could have held off for
> > another cycle.
> 
> If you are talking to me, I'm not trying to strong-arm anyone nor
> point a fingers.
> I'm just reporting a problem that I found earlier today while testing
> some work I was doing.

I think the correct response to having just found the bug is to back the
btrfs-to-iomap conversion out of Linus' tree.  I don't think changing
the iomap code at this time is the right approach.
Goldwyn Rodrigues June 3, 2020, 9:07 p.m. UTC | #12
On 12:02 03/06, Darrick J. Wong wrote:
> On Wed, Jun 03, 2020 at 12:32:15PM +0100, Filipe Manana wrote:
> > On Wed, Jun 3, 2020 at 12:23 PM Filipe Manana <fdmanana@gmail.com> wrote:
> > >
> > > On Mon, Jun 1, 2020 at 4:16 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > >
> > > > On 17:23 28/05, Darrick J. Wong wrote:
> > > > > On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote:
> > > > > >
> > > > > > Filesystems such as btrfs are unable to guarantee page invalidation
> > > > > > because pages could be locked as a part of the extent. Return zero
> > > > >
> > > > > Locked for what?  filemap_write_and_wait_range should have just cleaned
> > > > > them off.
> > > > >
> > > > > > in case a page cache invalidation is unsuccessful so filesystems can
> > > > > > fallback to buffered I/O. This is similar to
> > > > > > generic_file_direct_write().
> > > > > >
> > > > > > This takes care of the following invalidation warning during btrfs
> > > > > > mixed buffered and direct I/O using iomap_dio_rw():
> > > > > >
> > > > > > Page cache invalidation failure on direct I/O.  Possible data
> > > > > > corruption due to collision with buffered I/O!
> > > > > >
> > > > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > > >
> > > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > > > index e4addfc58107..215315be6233 100644
> > > > > > --- a/fs/iomap/direct-io.c
> > > > > > +++ b/fs/iomap/direct-io.c
> > > > > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > > > > >      */
> > > > > >     ret = invalidate_inode_pages2_range(mapping,
> > > > > >                     pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > > > -   if (ret)
> > > > > > -           dio_warn_stale_pagecache(iocb->ki_filp);
> > > > > > -   ret = 0;
> > > > > > +   /*
> > > > > > +    * If a page can not be invalidated, return 0 to fall back
> > > > > > +    * to buffered write.
> > > > > > +    */
> > > > > > +   if (ret) {
> > > > > > +           if (ret == -EBUSY)
> > > > > > +                   ret = 0;
> > > > > > +           goto out_free_dio;
> > > > >
> > > > > XFS doesn't fall back to buffered io when directio fails, which means
> > > > > this will cause a regression there.
> > > > >
> > > > > Granted mixing write types is bogus...
> > > > >
> > > >
> > > > I have not seen page invalidation failure errors on XFS, but what should
> 
> What happens if you try to dirty an mmap page at the same time as
> issuing a directio?

I did not think of that scenario. But in this case, is mmap working on
stale data? and is it considered a writeback error?

> 
> > > > happen hypothetically if they do occur? Carry on with the direct I/O?
> > > > Would an error return like -ENOTBLK be better?
> 
> In the old days, we would only WARN when we encountered collisions like
> this.  How about only setting EIO in the mapping if we fail the
> pagecache invalidation in directio completion?  If a buffered write
> dirties the page after the direct write thread flushes the dirty pages
> but before invalidation, we can argue that we didn't lose anything; the
> direct write simply happened after the buffered write.

This error will finally be returned by iomap_dio_rw(), and EIO would
mean there is a device error, and not a transient error from which it
can recover. We could return -ENOTBLK, but that is used temporarily for
buffered write fallbacks such as in ext4. iomap still returns zero in
case of such transient errors.

> 
> XFS doesn't implement buffered write fallback, and it never has.  Either
> the entire directio succeeds, or it returns a negative error code.  Some
> of the iomap_dio_rw callers (ext4, jfs2) will notice a short direct
> write and try to finish the rest with buffered io, but xfs and zonefs do
> not.
> 
> The net effect of this (on xfs anyway) is that when buffered and direct
> writes collide, before we'd make the buffered writer lose, now we make
> the direct writer lose.
> 
> You also /could/ propose teaching xfs how to fall back to an
> invalidating synchronous buffered write like ext4 does, but that's not
> part of this patch set, and that's not a behavior I want to introduce
> suddenly during the merge window.

So does that mean XFS would be open to fallback to buffered write?
That would make things much simpler!
David Sterba June 4, 2020, 1:55 p.m. UTC | #13
On Wed, Jun 03, 2020 at 12:02:52PM -0700, Darrick J. Wong wrote:
> > > So the next fsync on the file will return that error, despite the
> > > fsync having completed successfully with any errors.
> > >
> > > Since patchset to make btrfs direct IO use iomap is already in Linus'
> > > tree, we need to fix this somehow.
> 
> Y'all /just/ sent the pull request containing that conversion 2 days
> ago.  Why did you move forward with that when you knew there were
> unresolved fstests failures?

Because we didn't know that. And the whole mixed buffered io and dio is
considered obscure, documented as 'do not do that', that tests that
report the warning are more of an annyonance (btrfs/004).

That the test generic/547 sometimes returns EIO on fsync is a
regression, reported after the pull request had been merged, but with a
proposed fix that is not that intrusive, so this all counts as a normal
development.

There is always some risk merging code the like dio-iomap and it was
known but with an ultimate fallback plan to revert it in case we
encounter problems that are not solvable before release. But we're not
there yet.

> > > This makes generic/547 fail often for example - buffered write against
> > > file + direct IO write + fsync - the later returns -EIO.
> > 
> > Just to make it clear, despite the -EIO error, there was actually no
> > data loss or corruption (generic/547 checks that),
> > since the direct IO write path in btrfs figures out there's a buffered
> > write still ongoing and waits for it to complete before proceeding
> > with the dio write.
> > 
> > Nevertheless, it's still a regression, -EIO shouldn't be returned as
> > everything went fine.
> 
> Now I'm annoyed because I feel like you're trying to strong-arm me into
> making last minute changes to iomap when you could have held off for
> another cycle.

The patchset was held off for several releases, gradually making into
state where it can be merged, assuming we will be able to fix potential
regressions. Besides SUSE people involved in the patchset, Christoph
asked why it's not merged and how can he help to move it forward. He's
listed as iomap maintainer so it's not like we were pushing code without
maintainers' involved.

Regarding the last minute change, that's not something we'd ask you to
do without testing first. There are 4 filesystems using iomap for
direct io, making sure it does not regress on them is something I'd
consider necessary before asking you to merge it.

This patchset is lacking that but it started a discussion to understand
the full extent of the bug. We're not in rc5 where calling it 'last
minute' would be appropriate.

The big-hammer option to revert 4 patches is still there. If the fix
turns out to require changes beyond iomap and btrfs code, I'd consider
that as a good reason and I'm ready to do the revert (say rc2 at the
latest).
diff mbox series

Patch

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index e4addfc58107..215315be6233 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -483,9 +483,15 @@  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	 */
 	ret = invalidate_inode_pages2_range(mapping,
 			pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
-	if (ret)
-		dio_warn_stale_pagecache(iocb->ki_filp);
-	ret = 0;
+	/*
+	 * If a page can not be invalidated, return 0 to fall back
+	 * to buffered write.
+	 */
+	if (ret) {
+		if (ret == -EBUSY)
+			ret = 0;
+		goto out_free_dio;
+	}
 
 	if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
 	    !inode->i_sb->s_dio_done_wq) {