diff mbox series

[01/10] iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release

Message ID 20240827051028.1751933-2-hch@lst.de (mailing list archive)
State Superseded, archived
Headers show
Series [01/10] iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release | expand

Commit Message

Christoph Hellwig Aug. 27, 2024, 5:09 a.m. UTC
When direct I/O completions invalidates the page cache it holds neither the
i_rwsem nor the invalidate_lock so it can be racing with
iomap_write_delalloc_release.  If the search for the end of the region that
contains data returns the start offset we hit such a race and just need to
look for the end of the newly created hole instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Comments

Darrick J. Wong Aug. 27, 2024, 4:14 p.m. UTC | #1
On Tue, Aug 27, 2024 at 07:09:48AM +0200, Christoph Hellwig wrote:
> When direct I/O completions invalidates the page cache it holds neither the
> i_rwsem nor the invalidate_lock so it can be racing with
> iomap_write_delalloc_release.  If the search for the end of the region that
> contains data returns the start offset we hit such a race and just need to
> look for the end of the newly created hole instead.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/buffered-io.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index f420c53d86acc5..69a931de1979b9 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1241,7 +1241,15 @@ static int iomap_write_delalloc_release(struct inode *inode,
>  			error = data_end;
>  			goto out_unlock;
>  		}
> -		WARN_ON_ONCE(data_end <= start_byte);
> +
> +		/*
> +		 * If we race with post-direct I/O invalidation of the page cache,
> +		 * there might be no data left at start_byte.
> +		 */
> +		if (data_end == start_byte)
> +			continue;

Is there any chance that we could get stuck in a loop here?  I
think it's the case that if SEEK_HOLE returns data_end == start_byte,
then the next time through the loop, the SEEK_DATA will return something
that is > start_byte.  Unless someone is very rapidly writing and
punching the page cache?

Hmm but then if *xfs* is punching delalloc then we're we holding the
iolock so who else could be doing that?

If the answers are 'no' and 'nobody' then
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> +
> +		WARN_ON_ONCE(data_end < start_byte);
>  		WARN_ON_ONCE(data_end > scan_end_byte);
>  
>  		error = iomap_write_delalloc_scan(inode, &punch_start_byte,
> -- 
> 2.43.0
> 
>
Christoph Hellwig Aug. 28, 2024, 4:48 a.m. UTC | #2
On Tue, Aug 27, 2024 at 09:14:16AM -0700, Darrick J. Wong wrote:
> Is there any chance that we could get stuck in a loop here?  I
> think it's the case that if SEEK_HOLE returns data_end == start_byte,
> then the next time through the loop, the SEEK_DATA will return something
> that is > start_byte.

Yes.

> Unless someone is very rapidly writing and
> punching the page cache?
> 
> Hmm but then if *xfs* is punching delalloc then we're we holding the
> iolock so who else could be doing that?

Yes.  It's only the async direct I/O completions that punch without
the lock.
Darrick J. Wong Aug. 28, 2024, 4:13 p.m. UTC | #3
On Wed, Aug 28, 2024 at 06:48:48AM +0200, Christoph Hellwig wrote:
> On Tue, Aug 27, 2024 at 09:14:16AM -0700, Darrick J. Wong wrote:
> > Is there any chance that we could get stuck in a loop here?  I
> > think it's the case that if SEEK_HOLE returns data_end == start_byte,
> > then the next time through the loop, the SEEK_DATA will return something
> > that is > start_byte.
> 
> Yes.
> 
> > Unless someone is very rapidly writing and
> > punching the page cache?
> > 
> > Hmm but then if *xfs* is punching delalloc then we're we holding the
> > iolock so who else could be doing that?
> 
> Yes.  It's only the async direct I/O completions that punch without
> the lock.

Heh, I forgot that I'd asked three questions.  Anyway I'm satisfied with
the answers I got, so

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Though we might have to revisit this for filesystems that don't take
i_rwsem exclusively when writing -- is that a problem?  I guess if you
had two threads both writing and punching the pagecache they could get
into trouble, but that might be a case of "if it hurts don't do that".

--D
Christoph Hellwig Aug. 29, 2024, 3:46 a.m. UTC | #4
On Wed, Aug 28, 2024 at 09:13:38AM -0700, Darrick J. Wong wrote:
> Though we might have to revisit this for filesystems that don't take
> i_rwsem exclusively when writing -- is that a problem?  I guess if you
> had two threads both writing and punching the pagecache they could get
> into trouble, but that might be a case of "if it hurts don't do that".

No i_rwsem for buffered writes?  You can't really do that without hell
breaking lose.  At least not without another exclusive lock.
Darrick J. Wong Aug. 29, 2024, 2:22 p.m. UTC | #5
On Thu, Aug 29, 2024 at 05:46:26AM +0200, Christoph Hellwig wrote:
> On Wed, Aug 28, 2024 at 09:13:38AM -0700, Darrick J. Wong wrote:
> > Though we might have to revisit this for filesystems that don't take
> > i_rwsem exclusively when writing -- is that a problem?  I guess if you
> > had two threads both writing and punching the pagecache they could get
> > into trouble, but that might be a case of "if it hurts don't do that".
> 
> No i_rwsem for buffered writes?  You can't really do that without hell
> breaking lose.  At least not without another exclusive lock.

Well, i_rwsem in shared mode.  ISTR ext4 does inode_lock_shared and
serializes on the folio lock, at least for non extending writes.

--D
Christoph Hellwig Aug. 30, 2024, 3:42 a.m. UTC | #6
On Thu, Aug 29, 2024 at 07:22:19AM -0700, Darrick J. Wong wrote:
> On Thu, Aug 29, 2024 at 05:46:26AM +0200, Christoph Hellwig wrote:
> > On Wed, Aug 28, 2024 at 09:13:38AM -0700, Darrick J. Wong wrote:
> > > Though we might have to revisit this for filesystems that don't take
> > > i_rwsem exclusively when writing -- is that a problem?  I guess if you
> > > had two threads both writing and punching the pagecache they could get
> > > into trouble, but that might be a case of "if it hurts don't do that".
> > 
> > No i_rwsem for buffered writes?  You can't really do that without hell
> > breaking lose.  At least not without another exclusive lock.
> 
> Well, i_rwsem in shared mode.  ISTR ext4 does inode_lock_shared and
> serializes on the folio lock, at least for non extending writes.

ext4 uses plain inode lock/unlock which is an exclusive i_rwsem.
Given that Posix requires the entire write to synchronized vs other
writes applications would break otherwise.
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f420c53d86acc5..69a931de1979b9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1241,7 +1241,15 @@  static int iomap_write_delalloc_release(struct inode *inode,
 			error = data_end;
 			goto out_unlock;
 		}
-		WARN_ON_ONCE(data_end <= start_byte);
+
+		/*
+		 * If we race with post-direct I/O invalidation of the page cache,
+		 * there might be no data left at start_byte.
+		 */
+		if (data_end == start_byte)
+			continue;
+
+		WARN_ON_ONCE(data_end < start_byte);
 		WARN_ON_ONCE(data_end > scan_end_byte);
 
 		error = iomap_write_delalloc_scan(inode, &punch_start_byte,