diff mbox series

[v2] iomap: return partial I/O count on error in iomap_dio_bio_actor

Message ID 20200319150805.uaggnfue5xgaougx@fiona (mailing list archive)
State New, archived
Headers show
Series [v2] iomap: return partial I/O count on error in iomap_dio_bio_actor | expand

Commit Message

Goldwyn Rodrigues March 19, 2020, 3:08 p.m. UTC
Currently, I/Os that complete with an error indicate this by passing
written == 0 to the iomap_end function.  However, btrfs needs to know how
many bytes were written for its own accounting.  Change the convention
to pass the number of bytes which were actually written, and change the
only user (ext4) to check for a short write instead of a zero length
write.

For filesystems that do not define ->iomap_end(), check for
dio->error again after the iomap_apply() call to diagnose the error.

Changes since v1:
 - Considerate of iov_iter rollback functions
 - Double check errors for filesystems not implementing iomap_end()

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

Comments

Christoph Hellwig March 20, 2020, 2:05 p.m. UTC | #1
I spent a fair amount of time looking over this change, and I am
starting to feel very bad about it.  iomap_apply() has pretty clear
semantics of either return an error, or return the bytes processed,
and in general these semantics work just fine.

The thing that breaks this concept is the btrfs submit_bio hook,
which allows the file system to keep state for each bio actually
submitted.  But I think you can simply keep the length internally
in btrfs - use the space in iomap->private as a counter of how
much was allocated, pass the iomap to the submit_io hook, and
update it there, and then deal with the rest in ->iomap_end.

That assumes ->iomap_end actually is the right place - can someone
explain what the expected call site for __endio_write_update_ordered
is?  It kinda sorta looks to me like something that would want to
be called after I/O completion, not after I/O submission, but maybe
I misunderstand the code.
Josef Bacik March 20, 2020, 2:23 p.m. UTC | #2
On 3/20/20 10:05 AM, Christoph Hellwig wrote:
> I spent a fair amount of time looking over this change, and I am
> starting to feel very bad about it.  iomap_apply() has pretty clear
> semantics of either return an error, or return the bytes processed,
> and in general these semantics work just fine.
> 
> The thing that breaks this concept is the btrfs submit_bio hook,
> which allows the file system to keep state for each bio actually
> submitted.  But I think you can simply keep the length internally
> in btrfs - use the space in iomap->private as a counter of how
> much was allocated, pass the iomap to the submit_io hook, and
> update it there, and then deal with the rest in ->iomap_end.
> 
> That assumes ->iomap_end actually is the right place - can someone
> explain what the expected call site for __endio_write_update_ordered
> is?  It kinda sorta looks to me like something that would want to
> be called after I/O completion, not after I/O submission, but maybe
> I misunderstand the code.
> 

I'm not sure what you're looking at specifically wrt error handling, but I can 
explain __endio_write_update_ordered.

Btrfs has ordered extents to keep track of an extent that currently has IO being 
done on it.  Generally that IO takes multiple bio's, so we keep track of the 
outstanding size of the IO being done, and each bio completes and thus removes 
its size from the pending size.  If any one of those bios has an error we need 
to make sure we discard the whole ordered extent, as part of it won't be valid. 
Just a cursory look at the current code I assume that's what's confusing you, we 
call this when we have an error in the O_DIRECT code.  This is just so we get 
the proper cleanup for the ordered extent.  People will wait on the ordered 
extent to be completed, so if we've started an ordered extent and aren't able to 
complete the range we need to do __endio_write_update_ordered() so that the 
ordered extent is finished and we wakeup any waiters.

Does this help?  If I need to I can context switch into whatever you're looking 
at, but I'm going to avoid looking and hope I can just shout useful information 
in your direction ;).  Thanks,

Josef
Christoph Hellwig March 20, 2020, 2:35 p.m. UTC | #3
On Fri, Mar 20, 2020 at 10:23:43AM -0400, Josef Bacik wrote:
> I'm not sure what you're looking at specifically wrt error handling, but I
> can explain __endio_write_update_ordered.
> 
> Btrfs has ordered extents to keep track of an extent that currently has IO
> being done on it.  Generally that IO takes multiple bio's, so we keep track
> of the outstanding size of the IO being done, and each bio completes and
> thus removes its size from the pending size.  If any one of those bios has
> an error we need to make sure we discard the whole ordered extent, as part
> of it won't be valid. Just a cursory look at the current code I assume
> that's what's confusing you, we call this when we have an error in the
> O_DIRECT code.  This is just so we get the proper cleanup for the ordered
> extent.  People will wait on the ordered extent to be completed, so if we've
> started an ordered extent and aren't able to complete the range we need to
> do __endio_write_update_ordered() so that the ordered extent is finished and
> we wakeup any waiters.
> 
> Does this help?  If I need to I can context switch into whatever you're
> looking at, but I'm going to avoid looking and hope I can just shout useful
> information in your direction ;).  Thanks,

Yes, this helps a lot.  This is about the patches from Goldwyn to
convert btrfs to use the iomap direct I/O code.  And in that series
he currently calls __endio_write_update_ordered from the ->iomap_end
method, which for direct I/O is called after all bios are submitted
to complete ordered extents for a range after an I/O error, that
is one that no I/O has been submitted to, and the accounting for that
is a little complicated..
Goldwyn Rodrigues March 20, 2020, 3:35 p.m. UTC | #4
On  7:35 20/03, Christoph Hellwig wrote:
> On Fri, Mar 20, 2020 at 10:23:43AM -0400, Josef Bacik wrote:
> > I'm not sure what you're looking at specifically wrt error handling, but I
> > can explain __endio_write_update_ordered.
> > 
> > Btrfs has ordered extents to keep track of an extent that currently has IO
> > being done on it.  Generally that IO takes multiple bio's, so we keep track
> > of the outstanding size of the IO being done, and each bio completes and
> > thus removes its size from the pending size.  If any one of those bios has
> > an error we need to make sure we discard the whole ordered extent, as part
> > of it won't be valid. Just a cursory look at the current code I assume
> > that's what's confusing you, we call this when we have an error in the
> > O_DIRECT code.  This is just so we get the proper cleanup for the ordered
> > extent.  People will wait on the ordered extent to be completed, so if we've
> > started an ordered extent and aren't able to complete the range we need to
> > do __endio_write_update_ordered() so that the ordered extent is finished and
> > we wakeup any waiters.
> > 
> > Does this help?  If I need to I can context switch into whatever you're
> > looking at, but I'm going to avoid looking and hope I can just shout useful
> > information in your direction ;).  Thanks,
> 
> Yes, this helps a lot.  This is about the patches from Goldwyn to
> convert btrfs to use the iomap direct I/O code.  And in that series
> he currently calls __endio_write_update_ordered from the ->iomap_end
> method, which for direct I/O is called after all bios are submitted
> to complete ordered extents for a range after an I/O error, that
> is one that no I/O has been submitted to, and the accounting for that
> is a little complicated..

I think you meant "some" instead of "no".

Yes, keeping the information in iomap->private and setting in
btrfs_submit_direct() would be better. I will modify the code and
re-test. Thanks!
diff mbox series

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fa0ff78..d52c70f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3475,7 +3475,7 @@  static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 	 * the I/O. Any blocks that may have been allocated in preparation for
 	 * the direct I/O will be reused during buffered I/O.
 	 */
-	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
+	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written < length)
 		return -ENOTBLK;
 
 	return 0;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 41c1e7c..b5f4d4a 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -264,7 +264,7 @@  iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		size_t n;
 		if (dio->error) {
 			iov_iter_revert(dio->submit.iter, copied);
-			copied = ret = 0;
+			ret = dio->error;
 			goto out;
 		}
 
@@ -325,8 +325,17 @@  iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 			iomap_dio_zero(dio, iomap, pos, fs_block_size - pad);
 	}
 out:
-	/* Undo iter limitation to current extent */
-	iov_iter_reexpand(dio->submit.iter, orig_count - copied);
+	/*
+	 * Undo iter limitation to current extent
+	 * If there is an error, undo the entire extent. However, return the
+	 * bytes copied so far for filesystems such as btrfs to account for
+	 * submitted I/O.
+	 */
+	if (ret < 0)
+		iov_iter_reexpand(dio->submit.iter, orig_count);
+	else
+		iov_iter_reexpand(dio->submit.iter, orig_count - copied);
+
 	if (copied)
 		return copied;
 	return ret;
@@ -499,6 +508,10 @@  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	do {
 		ret = iomap_apply(inode, pos, count, flags, ops, dio,
 				iomap_dio_actor);
+
+		if (ret >= 0 && dio->error)
+			ret = dio->error;
+
 		if (ret <= 0) {
 			/* magic error code to fall back to buffered I/O */
 			if (ret == -ENOTBLK) {