diff mbox series

[04/15] iomap: Call inode_dio_end() before generic_write_sync()

Message ID 20200921144353.31319-5-rgoldwyn@suse.de
State New
Headers show
Series BTRFS DIO inode locking/D_SYNC fix | expand

Commit Message

Goldwyn Rodrigues Sept. 21, 2020, 2:43 p.m. UTC
From: Goldwyn Rodrigues <rgoldwyn@suse.com>

iomap complete routine can deadlock with btrfs_fallocate because of the
call to generic_write_sync().

P0                      P1
inode_lock()            fallocate(FALLOC_FL_ZERO_RANGE)
__iomap_dio_rw()        inode_lock()
                        <block>
<submits IO>
<completes IO>
inode_unlock()
                        <gets inode_lock()>
                        inode_dio_wait()
iomap_dio_complete()
  generic_write_sync()
    btrfs_file_fsync()
      inode_lock()
      <deadlock>

inode_dio_end() is used to notify the end of DIO data in order
to synchronize with truncate. Call inode_dio_end() before calling
generic_write_sync(), so filesystems can lock i_rwsem during a sync.

---
 fs/iomap/direct-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Johannes Thumshirn Sept. 21, 2020, 3:11 p.m. UTC | #1
On 21/09/2020 16:44, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> iomap complete routine can deadlock with btrfs_fallocate because of the
> call to generic_write_sync().
> 
> P0                      P1
> inode_lock()            fallocate(FALLOC_FL_ZERO_RANGE)
> __iomap_dio_rw()        inode_lock()
>                         <block>
> <submits IO>
> <completes IO>
> inode_unlock()
>                         <gets inode_lock()>
>                         inode_dio_wait()
> iomap_dio_complete()
>   generic_write_sync()
>     btrfs_file_fsync()
>       inode_lock()
>       <deadlock>
> 
> inode_dio_end() is used to notify the end of DIO data in order
> to synchronize with truncate. Call inode_dio_end() before calling
> generic_write_sync(), so filesystems can lock i_rwsem during a sync.
> 
> ---
>  fs/iomap/direct-io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)


Missing S-o-b as well.
Christoph Hellwig Sept. 22, 2020, 1:21 p.m. UTC | #2
On Mon, Sep 21, 2020 at 09:43:42AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> iomap complete routine can deadlock with btrfs_fallocate because of the
> call to generic_write_sync().
> 
> P0                      P1
> inode_lock()            fallocate(FALLOC_FL_ZERO_RANGE)
> __iomap_dio_rw()        inode_lock()
>                         <block>
> <submits IO>
> <completes IO>
> inode_unlock()
>                         <gets inode_lock()>
>                         inode_dio_wait()
> iomap_dio_complete()
>   generic_write_sync()
>     btrfs_file_fsync()
>       inode_lock()
>       <deadlock>
> 
> inode_dio_end() is used to notify the end of DIO data in order
> to synchronize with truncate. Call inode_dio_end() before calling
> generic_write_sync(), so filesystems can lock i_rwsem during a sync.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

You might want to mention in the commit log that this matches the
sequence in the old fs/direct-io.c implementation.
Josef Bacik Sept. 22, 2020, 2:20 p.m. UTC | #3
On 9/21/20 10:43 AM, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> iomap complete routine can deadlock with btrfs_fallocate because of the
> call to generic_write_sync().
> 
> P0                      P1
> inode_lock()            fallocate(FALLOC_FL_ZERO_RANGE)
> __iomap_dio_rw()        inode_lock()
>                          <block>
> <submits IO>
> <completes IO>
> inode_unlock()
>                          <gets inode_lock()>
>                          inode_dio_wait()
> iomap_dio_complete()
>    generic_write_sync()
>      btrfs_file_fsync()
>        inode_lock()
>        <deadlock>
> 
> inode_dio_end() is used to notify the end of DIO data in order
> to synchronize with truncate. Call inode_dio_end() before calling
> generic_write_sync(), so filesystems can lock i_rwsem during a sync.
> 
> ---
>   fs/iomap/direct-io.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index d970c6bbbe11..e01f81e7b76f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -118,6 +118,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
>   			dio_warn_stale_pagecache(iocb->ki_filp);
>   	}
>   
> +	inode_dio_end(file_inode(iocb->ki_filp));
>   	/*
>   	 * If this is a DSYNC write, make sure we push it to stable storage now
>   	 * that we've written data.
> @@ -125,7 +126,6 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
>   	if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
>   		ret = generic_write_sync(iocb, ret);
>   
> -	inode_dio_end(file_inode(iocb->ki_filp));
>   	kfree(dio);
>   
>   	return ret;
> 

Did you verify that xfs or ext4 don't rely on the inode_dio_end() happening 
before the generic_write_sync()?  I wouldn't expect that they would, but we've 
already run into problems making those kind of assumptions.  If it's fine you 
can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef
Darrick J. Wong Sept. 22, 2020, 4:31 p.m. UTC | #4
On Tue, Sep 22, 2020 at 10:20:11AM -0400, Josef Bacik wrote:
> On 9/21/20 10:43 AM, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > iomap complete routine can deadlock with btrfs_fallocate because of the
> > call to generic_write_sync().
> > 
> > P0                      P1
> > inode_lock()            fallocate(FALLOC_FL_ZERO_RANGE)
> > __iomap_dio_rw()        inode_lock()
> >                          <block>
> > <submits IO>
> > <completes IO>
> > inode_unlock()
> >                          <gets inode_lock()>
> >                          inode_dio_wait()
> > iomap_dio_complete()
> >    generic_write_sync()
> >      btrfs_file_fsync()
> >        inode_lock()
> >        <deadlock>
> > 
> > inode_dio_end() is used to notify the end of DIO data in order
> > to synchronize with truncate. Call inode_dio_end() before calling
> > generic_write_sync(), so filesystems can lock i_rwsem during a sync.
> > 
> > ---
> >   fs/iomap/direct-io.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index d970c6bbbe11..e01f81e7b76f 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -118,6 +118,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> >   			dio_warn_stale_pagecache(iocb->ki_filp);
> >   	}
> > +	inode_dio_end(file_inode(iocb->ki_filp));
> >   	/*
> >   	 * If this is a DSYNC write, make sure we push it to stable storage now
> >   	 * that we've written data.
> > @@ -125,7 +126,6 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> >   	if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> >   		ret = generic_write_sync(iocb, ret);
> > -	inode_dio_end(file_inode(iocb->ki_filp));
> >   	kfree(dio);
> >   	return ret;
> > 
> 
> Did you verify that xfs or ext4 don't rely on the inode_dio_end() happening
> before the generic_write_sync()?  I wouldn't expect that they would, but
> we've already run into problems making those kind of assumptions.  If it's
> fine you can add

I was gonna ask the same question, but as there's no SoB on this patch I
hadn't really looked at it yet. ;)

Operations that rely on inode_dio_wait to have blocked until all the
directios are complete could get tripped up by iomap not having done the
generic_write_sync to stabilise the metadata, but I /think/ most
operations that do that also themselves flush the file.  But I don't
really know if there's a subtlety there if the inode_dio_wait thread
manages to grab the ILOCK before the generic_write_sync thread does.

--D

> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> 
> Thanks,
> 
> Josef
Goldwyn Rodrigues Sept. 22, 2020, 5:25 p.m. UTC | #5
On  9:31 22/09, Darrick J. Wong wrote:
> On Tue, Sep 22, 2020 at 10:20:11AM -0400, Josef Bacik wrote:
> > On 9/21/20 10:43 AM, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > 
> > > iomap complete routine can deadlock with btrfs_fallocate because of the
> > > call to generic_write_sync().
> > > 
> > > P0                      P1
> > > inode_lock()            fallocate(FALLOC_FL_ZERO_RANGE)
> > > __iomap_dio_rw()        inode_lock()
> > >                          <block>
> > > <submits IO>
> > > <completes IO>
> > > inode_unlock()
> > >                          <gets inode_lock()>
> > >                          inode_dio_wait()
> > > iomap_dio_complete()
> > >    generic_write_sync()
> > >      btrfs_file_fsync()
> > >        inode_lock()
> > >        <deadlock>
> > > 
> > > inode_dio_end() is used to notify the end of DIO data in order
> > > to synchronize with truncate. Call inode_dio_end() before calling
> > > generic_write_sync(), so filesystems can lock i_rwsem during a sync.
> > > 
> > > ---
> > >   fs/iomap/direct-io.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index d970c6bbbe11..e01f81e7b76f 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -118,6 +118,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > >   			dio_warn_stale_pagecache(iocb->ki_filp);
> > >   	}
> > > +	inode_dio_end(file_inode(iocb->ki_filp));
> > >   	/*
> > >   	 * If this is a DSYNC write, make sure we push it to stable storage now
> > >   	 * that we've written data.
> > > @@ -125,7 +126,6 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > >   	if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> > >   		ret = generic_write_sync(iocb, ret);
> > > -	inode_dio_end(file_inode(iocb->ki_filp));
> > >   	kfree(dio);
> > >   	return ret;
> > > 
> > 
> > Did you verify that xfs or ext4 don't rely on the inode_dio_end() happening
> > before the generic_write_sync()?  I wouldn't expect that they would, but
> > we've already run into problems making those kind of assumptions.  If it's
> > fine you can add
> 
> I was gonna ask the same question, but as there's no SoB on this patch I
> hadn't really looked at it yet. ;)
> 
> Operations that rely on inode_dio_wait to have blocked until all the
> directios are complete could get tripped up by iomap not having done the
> generic_write_sync to stabilise the metadata, but I /think/ most
> operations that do that also themselves flush the file.  But I don't
> really know if there's a subtlety there if the inode_dio_wait thread
> manages to grab the ILOCK before the generic_write_sync thread does.
> 

I ran xfstests and it was successful. I am testing ext4 now.
Dave Chinner Sept. 22, 2020, 9:49 p.m. UTC | #6
On Tue, Sep 22, 2020 at 09:31:56AM -0700, Darrick J. Wong wrote:
> On Tue, Sep 22, 2020 at 10:20:11AM -0400, Josef Bacik wrote:
> > On 9/21/20 10:43 AM, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > 
> > > iomap complete routine can deadlock with btrfs_fallocate because of the
> > > call to generic_write_sync().
> > > 
> > > P0                      P1
> > > inode_lock()            fallocate(FALLOC_FL_ZERO_RANGE)
> > > __iomap_dio_rw()        inode_lock()
> > >                          <block>
> > > <submits IO>
> > > <completes IO>
> > > inode_unlock()
> > >                          <gets inode_lock()>
> > >                          inode_dio_wait()
> > > iomap_dio_complete()
> > >    generic_write_sync()
> > >      btrfs_file_fsync()
> > >        inode_lock()
> > >        <deadlock>
> > > 
> > > inode_dio_end() is used to notify the end of DIO data in order
> > > to synchronize with truncate. Call inode_dio_end() before calling
> > > generic_write_sync(), so filesystems can lock i_rwsem during a sync.
> > > 
> > > ---
> > >   fs/iomap/direct-io.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index d970c6bbbe11..e01f81e7b76f 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -118,6 +118,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > >   			dio_warn_stale_pagecache(iocb->ki_filp);
> > >   	}
> > > +	inode_dio_end(file_inode(iocb->ki_filp));
> > >   	/*
> > >   	 * If this is a DSYNC write, make sure we push it to stable storage now
> > >   	 * that we've written data.
> > > @@ -125,7 +126,6 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > >   	if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> > >   		ret = generic_write_sync(iocb, ret);
> > > -	inode_dio_end(file_inode(iocb->ki_filp));
> > >   	kfree(dio);
> > >   	return ret;
> > > 
> > 
> > Did you verify that xfs or ext4 don't rely on the inode_dio_end() happening
> > before the generic_write_sync()?  I wouldn't expect that they would, but
> > we've already run into problems making those kind of assumptions.  If it's
> > fine you can add
> 
> I was gonna ask the same question, but as there's no SoB on this patch I
> hadn't really looked at it yet. ;)
> 
> Operations that rely on inode_dio_wait to have blocked until all the
> directios are complete could get tripped up by iomap not having done the
> generic_write_sync to stabilise the metadata, but I /think/ most
> operations that do that also themselves flush the file.  But I don't
> really know if there's a subtlety there if the inode_dio_wait thread
> manages to grab the ILOCK before the generic_write_sync thread does.

I did point out in the previous thread that this actually means that
inode_dio_wait() now has inconsistent wait semantics for O_DSYNC
writes. If it's a pure overwrite and we hit the FUA path, the
O_DSYNC write will be complete and guaranteed to be on stable storage
before the IO completes. If the inode is metadata dirty, then the IO
will now be signalled complete *before* the data and metadata are
flushed to stable storage.

Hence, from the perspective of writes to *stable* storage, this
makes the ordering of O_DSYNC DIO against anything waiting for it to
complete to be potentially inconsistent at the stable storage level.

That's an extremely subtle change of behaviour, and something that
would be largely impossible to test or reproduce. And, really, I
don't like having this sort of "oh, it should be fine" handwavy
justification when we are talking about data integrity operations...

Cheers,

Dave.
Christoph Hellwig Sept. 23, 2020, 5:16 a.m. UTC | #7
On Wed, Sep 23, 2020 at 07:49:34AM +1000, Dave Chinner wrote:
> I did point out in the previous thread that this actually means that
> inode_dio_wait() now has inconsistent wait semantics for O_DSYNC
> writes. If it's a pure overwrite and we hit the FUA path, the
> O_DSYNC write will be complete and guaranteed to be on stable storage
> before the IO completes. If the inode is metadata dirty, then the IO
> will now be signalled complete *before* the data and metadata are
> flushed to stable storage.
> 
> Hence, from the perspective of writes to *stable* storage, this
> makes the ordering of O_DSYNC DIO against anything waiting for it to
> complete to be potentially inconsistent at the stable storage level.
> 
> That's an extremely subtle change of behaviour, and something that
> would be largely impossible to test or reproduce. And, really, I
> don't like having this sort of "oh, it should be fine" handwavy
> justification when we are talking about data integrity operations...

... and I replied with a detailed analysis of what it is fine, and
how this just restores the behavior we historically had before
switching to the iomap direct I/O code.  Although if we want to go
into the fine details we did not have the REQ_FUA path back then,
but that does not change the analysis.
Darrick J. Wong Sept. 23, 2020, 5:31 a.m. UTC | #8
On Wed, Sep 23, 2020 at 07:16:58AM +0200, Christoph Hellwig wrote:
> On Wed, Sep 23, 2020 at 07:49:34AM +1000, Dave Chinner wrote:
> > I did point out in the previous thread that this actually means that
> > inode_dio_wait() now has inconsistent wait semantics for O_DSYNC
> > writes. If it's a pure overwrite and we hit the FUA path, the
> > O_DSYNC write will be complete and guaranteed to be on stable storage
> > before the IO completes. If the inode is metadata dirty, then the IO
> > will now be signalled complete *before* the data and metadata are
> > flushed to stable storage.
> > 
> > Hence, from the perspective of writes to *stable* storage, this
> > makes the ordering of O_DSYNC DIO against anything waiting for it to
> > complete to be potentially inconsistent at the stable storage level.
> > 
> > That's an extremely subtle change of behaviour, and something that
> > would be largely impossible to test or reproduce. And, really, I
> > don't like having this sort of "oh, it should be fine" handwavy
> > justification when we are talking about data integrity operations...
> 
> ... and I replied with a detailed analysis of what it is fine, and
> how this just restores the behavior we historically had before
> switching to the iomap direct I/O code.  Although if we want to go
> into the fine details we did not have the REQ_FUA path back then,
> but that does not change the analysis.

You did?  Got a link?  Not sure if vger/oraclemail are still delaying
messages for me.... :/

--D
Christoph Hellwig Sept. 23, 2020, 5:49 a.m. UTC | #9
On Tue, Sep 22, 2020 at 10:31:49PM -0700, Darrick J. Wong wrote:
> > ... and I replied with a detailed analysis of what it is fine, and
> > how this just restores the behavior we historically had before
> > switching to the iomap direct I/O code.  Although if we want to go
> > into the fine details we did not have the REQ_FUA path back then,
> > but that does not change the analysis.
> 
> You did?  Got a link?  Not sure if vger/oraclemail are still delaying
> messages for me.... :/

Two replies from September 17 to the
"Re: [RFC PATCH] btrfs: don't call btrfs_sync_file from iomap context"

thread.

Msg IDs:

20200917055232.GA31646@lst.de

and

20200917064238.GA32441@lst.de
Dave Chinner Sept. 23, 2020, 5:59 a.m. UTC | #10
On Wed, Sep 23, 2020 at 07:49:25AM +0200, Christoph Hellwig wrote:
> On Tue, Sep 22, 2020 at 10:31:49PM -0700, Darrick J. Wong wrote:
> > > ... and I replied with a detailed analysis of what it is fine, and
> > > how this just restores the behavior we historically had before
> > > switching to the iomap direct I/O code.  Although if we want to go
> > > into the fine details we did not have the REQ_FUA path back then,
> > > but that does not change the analysis.
> > 
> > You did?  Got a link?  Not sure if vger/oraclemail are still delaying
> > messages for me.... :/
> 
> Two replies from September 17 to the
> "Re: [RFC PATCH] btrfs: don't call btrfs_sync_file from iomap context"
> 
> thread.
> 
> Msg IDs:
> 
> 20200917055232.GA31646@lst.de
> 
> and
> 
> 20200917064238.GA32441@lst.de

<sigh>

That last one is not in my local archive - vger has been on the
blink lately, so I guess I'm not really that surprised that mail has
gone missing and not just delayed for a day or two....

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index d970c6bbbe11..e01f81e7b76f 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -118,6 +118,7 @@  ssize_t iomap_dio_complete(struct iomap_dio *dio)
 			dio_warn_stale_pagecache(iocb->ki_filp);
 	}
 
+	inode_dio_end(file_inode(iocb->ki_filp));
 	/*
 	 * If this is a DSYNC write, make sure we push it to stable storage now
 	 * that we've written data.
@@ -125,7 +126,6 @@  ssize_t iomap_dio_complete(struct iomap_dio *dio)
 	if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
 		ret = generic_write_sync(iocb, ret);
 
-	inode_dio_end(file_inode(iocb->ki_filp));
 	kfree(dio);
 
 	return ret;