diff mbox series

[1/3] btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents

Message ID b3864441547e49a69d45c7771aa8cc5e595d18fc.1656934419.git.fdmanana@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: a few direct IO fixes/improvements | expand

Commit Message

Filipe Manana July 4, 2022, 11:42 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

When doing a direct IO read or write, we always return -ENOTBLK when we
find a compressed extent (or an inline extent) so that we fallback to
buffered IO. This however is not ideal in case we are in a NOWAIT context
(io_uring for example), because buffered IO can block and we currently
have no support for NOWAIT semantics for buffered IO, so if we need to
fallback to buffered IO we should first signal the caller that we may
need to block by returning -EAGAIN instead.

This behaviour can also result in short reads being returned to user
space, which although it's not incorrect and user space should be able
to deal with partial reads, it's somewhat surprising and even some popular
applications like QEMU (Link tag #1) and MariaDB (Link tag #2) don't
deal with short reads properly (or at all).

The short read case happens when we try to read from a range that has a
non-compressed and non-inline extent followed by a compressed extent.
After having read the first extent, when we find the compressed extent we
return -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to
treat the request as a short read, returning 0 (success) and waiting for
previously submitted bios to complete (this happens at
fs/iomap/direct-io.c:__iomap_dio_rw()). After that, and while at
btrfs_file_read_iter(), we call filemap_read() to use buffered IO to
read the remaining data, and pass it the number of bytes we were able to
read with direct IO. Than at filemap_read() if we get a page fault error
when accessing the read buffer, we return a partial read instead of an
-EFAULT error, because the number of bytes previously read is greater
than zero.

So fix this by returning -EAGAIN for NOWAIT direct IO when we find a
compressed or an inline extent.

Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Link: https://lore.kernel.org/linux-btrfs/YrrFGO4A1jS0GI0G@atmark-techno.com/
Link: https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582
Tested-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig July 4, 2022, 11:57 a.m. UTC | #1
On Mon, Jul 04, 2022 at 12:42:03PM +0100, fdmanana@kernel.org wrote:
> +		 * filemap_read(), we fail to fault in pages for the read buffer,
> +		 * in which case filemap_read() returns a short read (the number
> +		 * of bytes previously read is > 0, so it does not return -EFAULT).

Two overly long lines here, which are especially annoying in block
comments as they completely break the layout.

> +		ret = (flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOTBLK;

Nit: I'd find this a lot more readable with a good old if / else.

Either way the technical change looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
David Sterba July 7, 2022, 4:47 p.m. UTC | #2
On Mon, Jul 04, 2022 at 04:57:41AM -0700, Christoph Hellwig wrote:
> On Mon, Jul 04, 2022 at 12:42:03PM +0100, fdmanana@kernel.org wrote:
> > +		 * filemap_read(), we fail to fault in pages for the read buffer,
> > +		 * in which case filemap_read() returns a short read (the number
> > +		 * of bytes previously read is > 0, so it does not return -EFAULT).
> 
> Two overly long lines here, which are especially annoying in block
> comments as they completely break the layout.

The only line that is not under 81 is the last one and what does not if
is "T).". This is within the acceptable overflow and I adjust many lines
in patches based on how the code looks (ie. avoiding some line breaks)
and if the potentially overflown text does not obscure the meaning.

Keeping the lines under 80 makes sense for me personally when resolving
conflicts in the 3+1 vimdiff view, but the limit is not strict and the
criteria is if the code follows the common patterns we've settled on and
what people in the btrfs group are used to, either reading or writing.
The kernel coding style does not cover everything and is a good starting
point. The rest is at
https://btrfs.wiki.kernel.org/index.php/Development_notes#Coding_style_preferences
diff mbox series

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7a54f964ff37..b86be4c3513d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7684,7 +7684,19 @@  static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags) ||
 	    em->block_start == EXTENT_MAP_INLINE) {
 		free_extent_map(em);
-		ret = -ENOTBLK;
+		/*
+		 * If we are in a NOWAIT context, return -EAGAIN in order to
+		 * fallback to buffered IO. This is not only because we can
+		 * block with buffered IO (no support for NOWAIT semantics at
+		 * the moment) but also to avoid returning short reads to user
+		 * space - this happens if we were able to read some data from
+		 * previous non-compressed extents and then when we fallback to
+		 * buffered IO, at btrfs_file_read_iter() by calling
+		 * filemap_read(), we fail to fault in pages for the read buffer,
+		 * in which case filemap_read() returns a short read (the number
+		 * of bytes previously read is > 0, so it does not return -EFAULT).
+		 */
+		ret = (flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOTBLK;
 		goto unlock_err;
 	}