From patchwork Mon Jul  4 11:42:03 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Filipe Manana <fdmanana@kernel.org>
X-Patchwork-Id: 12905159
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 54446C433EF
	for <linux-btrfs@archiver.kernel.org>; Mon,  4 Jul 2022 11:42:44 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233052AbiGDLmn (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Mon, 4 Jul 2022 07:42:43 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53348 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233506AbiGDLmM (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>); Mon, 4 Jul 2022 07:42:12 -0400
Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EFD6391
        for <linux-btrfs@vger.kernel.org>;
 Mon,  4 Jul 2022 04:42:10 -0700 (PDT)
Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140])
        (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
        (No client certificate requested)
        by dfw.source.kernel.org (Postfix) with ESMTPS id 8D0FB60B1D
        for <linux-btrfs@vger.kernel.org>;
 Mon,  4 Jul 2022 11:42:10 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 74D3EC341CE
        for <linux-btrfs@vger.kernel.org>;
 Mon,  4 Jul 2022 11:42:09 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
        s=k20201202; t=1656934930;
        bh=DvI94eykOj9NCptTuh1CBEsLpjJ4QI+SmE2lhlYAH/E=;
        h=From:To:Subject:Date:In-Reply-To:References:From;
        b=NItNzzUvdgCCoVhogBo06kPH/IGE41akCE2ZSunxAQ2Tiee+5nzqzngGb2GWjDnkM
         FOu6T3F5gZtvnEW3tKj67cocSSXKRIxWv+RfHR7ISBbKKqiA1Cm1u55ZiRjWpfSCVc
         94Zzey0RShmwYoHlAP7ql0CTWV5i2Ovi1YyiUOKyHw4jmaxnhemPzvrQpPIk0JXHWU
         o6o+v3KCZbXylkMOSCjmQRu74O3LB6EcGHbUQVLWVceIImrtxqfv9BdUzMZI5OpU0n
         BZCplIaSPjMgC0gkPCtIAiFIaMbZPng29bJks0yABPv4lSWHAjtd37od4ZLf46eRTt
         08PgDiqYWN2Gw==
From: fdmanana@kernel.org
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 1/3] btrfs: return -EAGAIN for NOWAIT dio reads/writes on
 compressed and inline extents
Date: Mon,  4 Jul 2022 12:42:03 +0100
Message-Id: 
 <b3864441547e49a69d45c7771aa8cc5e595d18fc.1656934419.git.fdmanana@suse.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1656934419.git.fdmanana@suse.com>
References: <cover.1656934419.git.fdmanana@suse.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

From: Filipe Manana <fdmanana@suse.com>

When doing a direct IO read or write, we always return -ENOTBLK when we
find a compressed extent (or an inline extent) so that we fallback to
buffered IO. This however is not ideal in case we are in a NOWAIT context
(io_uring for example), because buffered IO can block and we currently
have no support for NOWAIT semantics for buffered IO, so if we need to
fallback to buffered IO we should first signal the caller that we may
need to block by returning -EAGAIN instead.

This behaviour can also result in short reads being returned to user
space, which although it's not incorrect and user space should be able
to deal with partial reads, it's somewhat surprising and even some popular
applications like QEMU (Link tag #1) and MariaDB (Link tag #2) don't
deal with short reads properly (or at all).

The short read case happens when we try to read from a range that has a
non-compressed and non-inline extent followed by a compressed extent.
After having read the first extent, when we find the compressed extent we
return -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to
treat the request as a short read, returning 0 (success) and waiting for
previously submitted bios to complete (this happens at
fs/iomap/direct-io.c:__iomap_dio_rw()). After that, and while at
btrfs_file_read_iter(), we call filemap_read() to use buffered IO to
read the remaining data, and pass it the number of bytes we were able to
read with direct IO. Than at filemap_read() if we get a page fault error
when accessing the read buffer, we return a partial read instead of an
-EFAULT error, because the number of bytes previously read is greater
than zero.

So fix this by returning -EAGAIN for NOWAIT direct IO when we find a
compressed or an inline extent.

Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Link: https://lore.kernel.org/linux-btrfs/YrrFGO4A1jS0GI0G@atmark-techno.com/
Link: https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582
Tested-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7a54f964ff37..b86be4c3513d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7684,7 +7684,19 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags) ||
 	    em->block_start == EXTENT_MAP_INLINE) {
 		free_extent_map(em);
-		ret = -ENOTBLK;
+		/*
+		 * If we are in a NOWAIT context, return -EAGAIN in order to
+		 * fallback to buffered IO. This is not only because we can
+		 * block with buffered IO (no support for NOWAIT semantics at
+		 * the moment) but also to avoid returning short reads to user
+		 * space - this happens if we were able to read some data from
+		 * previous non-compressed extents and then when we fallback to
+		 * buffered IO, at btrfs_file_read_iter() by calling
+		 * filemap_read(), we fail to fault in pages for the read buffer,
+		 * in which case filemap_read() returns a short read (the number
+		 * of bytes previously read is > 0, so it does not return -EFAULT).
+		 */
+		ret = (flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOTBLK;
 		goto unlock_err;
 	}
 

From patchwork Mon Jul  4 11:42:04 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Filipe Manana <fdmanana@kernel.org>
X-Patchwork-Id: 12905157
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 20AC2C433EF
	for <linux-btrfs@archiver.kernel.org>; Mon,  4 Jul 2022 11:42:42 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232982AbiGDLmk (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Mon, 4 Jul 2022 07:42:40 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53198 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233471AbiGDLmM (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>); Mon, 4 Jul 2022 07:42:12 -0400
Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E027110B0
        for <linux-btrfs@vger.kernel.org>;
 Mon,  4 Jul 2022 04:42:11 -0700 (PDT)
Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140])
        (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
        (No client certificate requested)
        by dfw.source.kernel.org (Postfix) with ESMTPS id 7CF2E60B1F
        for <linux-btrfs@vger.kernel.org>;
 Mon,  4 Jul 2022 11:42:11 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 65E88C3411E
        for <linux-btrfs@vger.kernel.org>;
 Mon,  4 Jul 2022 11:42:10 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
        s=k20201202; t=1656934930;
        bh=KBoz1o1dXdw43pFqErIx4yF26CNm+xlWUljXVW2JgFU=;
        h=From:To:Subject:Date:In-Reply-To:References:From;
        b=Gcwq0V7wroCsnK6fKTDciIMHBjJ/n2s6WySDe51Sz4R/RIqGTgBkBUx1aQQM7WrGh
         SErOAo25me6Jt1NlEl7xTmMazrBk3jDYA6tdhbSaM6n6Jus2rPgr7ijC0HsrJQ7qD8
         1mOMUlfvWdkd4rMFyu0iANbvB6vBfPLbzWLdKEw4a9CTE6lbjJhQW88KyoKzkPg4C2
         1EB0sECx7irDttEGh5hMAfOVPNuHmpwNwzSFu58woEMRhj9rcrlFZLzqyCphkfsl8j
         LwiHStHz6sKlkM4tchb05RgCyiekLMqGcF3uz8yKL3AqaEqH34orIANNFTlZWL1lR+
         mml1ykFUC4SvQ==
From: fdmanana@kernel.org
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 2/3] btrfs: don't fallback to buffered IO for NOWAIT direct IO
 writes
Date: Mon,  4 Jul 2022 12:42:04 +0100
Message-Id: 
 <1a7080444b73f8ca1481a7786e52bdf405193be9.1656934419.git.fdmanana@suse.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1656934419.git.fdmanana@suse.com>
References: <cover.1656934419.git.fdmanana@suse.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

From: Filipe Manana <fdmanana@suse.com>

Currently, for a direct IO write, if we need to fallback to buffered IO,
either the satisfy the whole write operation or just a part of it, we do
it in the current context even if it's a NOWAIT context. This is not ideal
because we currently don't have support for NOWAIT semantics in the
buffered IO path (we can block for several reasons), so we should instead
return -EAGAIN to the caller, so that it knows it should retry (the whole
operation or what's left of it) in a context where blocking is acceptable.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index da41a0c371bc..9c8e3a668d70 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1971,11 +1971,25 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	if (is_sync_write)
 		iocb->ki_flags |= IOCB_DSYNC;
 
-	/* If 'err' is -ENOTBLK then it means we must fallback to buffered IO. */
+	/*
+	 * If 'err' is -ENOTBLK or we have not written all data, then it means
+	 * we must fallback to buffered IO.
+	 */
 	if ((err < 0 && err != -ENOTBLK) || !iov_iter_count(from))
 		goto out;
 
 buffered:
+	/*
+	 * If we are in a NOWAIT context, then return -EAGAIN to signal the caller
+	 * it must retry the operation in a context where blocking is acceptable,
+	 * since we currently don't have NOWAIT semantics support for buffered IO
+	 * and may block there for many reasons (reserving space for example).
+	 */
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		err = -EAGAIN;
+		goto out;
+	}
+
 	pos = iocb->ki_pos;
 	written_buffered = btrfs_buffered_write(iocb, from);
 	if (written_buffered < 0) {

From patchwork Mon Jul  4 11:42:05 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Filipe Manana <fdmanana@kernel.org>
X-Patchwork-Id: 12905156
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B17C8C433EF
	for <linux-btrfs@archiver.kernel.org>; Mon,  4 Jul 2022 11:42:16 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232408AbiGDLmP (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Mon, 4 Jul 2022 07:42:15 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53764 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233195AbiGDLmO (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>); Mon, 4 Jul 2022 07:42:14 -0400
Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 14A2E10AF
        for <linux-btrfs@vger.kernel.org>;
 Mon,  4 Jul 2022 04:42:12 -0700 (PDT)
Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140])
        (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
        (No client certificate requested)
        by dfw.source.kernel.org (Postfix) with ESMTPS id 6EB9F60B29
        for <linux-btrfs@vger.kernel.org>;
 Mon,  4 Jul 2022 11:42:12 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 580C2C341C7
        for <linux-btrfs@vger.kernel.org>;
 Mon,  4 Jul 2022 11:42:11 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
        s=k20201202; t=1656934931;
        bh=36MkUOll9Ci/OQiJXnSZnI8zIZNSoA4ZN8JV+VyVNQo=;
        h=From:To:Subject:Date:In-Reply-To:References:From;
        b=Ef1qbKnGmMnTvqV5scpIzXMLuFquRGa450bVu2PBQ0IS500ZYEmW6BvG6eCaQa442
         VHnxUyMrE5WjHsO8uZA+WU5KO7HaKi1Yw92f3PKS/NHlN3CgQqlXRDQhxPcI3bS61e
         js9oXIZ8amw6ATFyxwLuDDsTYqB4A/k1AW20Lc1f6sfvkKHzvoD9TWjk5SYed/t1Tk
         uMV7MB3V4xWpkUaxhdJxuPfTnigIx9alxE2jvi/uu9gZtzdFpHwM83HK1hi57a6IZO
         qJjL5LcEtyiT95CUTQYNg2QWQ+vXjAhr7+G+kuxzoTmIlEyf2+kE62UBDQyjubD4/V
         q6Jc0rKBY9QYw==
From: fdmanana@kernel.org
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 3/3] btrfs: fault in pages for dio reads/writes in a more
 controlled way
Date: Mon,  4 Jul 2022 12:42:05 +0100
Message-Id: 
 <f01450d13f24e9a01436b60f5f2596ef650c790f.1656934419.git.fdmanana@suse.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1656934419.git.fdmanana@suse.com>
References: <cover.1656934419.git.fdmanana@suse.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

From: Filipe Manana <fdmanana@suse.com>

When we need to fault in the pages of the iovector of a direct IO read or
write, we try to fault in all the remaining pages. While this works, it's
not ideal if there's a large number of remaining pages and the system is
under memory pressure. By the time we fault in some pages, some of the
previously faulted in pages may have been evicted due to memory pressure,
resulting in slower progress or falling back to buffered IO (which is fine
but it's not ideal).

So limit the number of pages we fault in. The amount is decided based on
what's left of the iovector and the threshold for dirty page rate limiting
(the nr_dirtied and nr_dirtied_pause fields of struct task_struct), and
it's borrowed from gfs2 (fs/gfs2/file.c:should_fault_in_pages()).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c | 56 +++++++++++++++++++++++++++++++++++--------------
 1 file changed, 40 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9c8e3a668d70..1528b8edc7a9 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1846,6 +1846,33 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
+static size_t dio_fault_in_size(const struct kiocb *iocb,
+				const struct iov_iter *iov,
+				size_t prev_left)
+{
+	const size_t left = iov_iter_count(iov);
+	size_t size = PAGE_SIZE;
+
+	/*
+	 * If there's no progress since the last time we had to fault in pages,
+	 * then we fault in at most 1 page. Faulting in more than that may
+	 * result in making very slow progress or falling back to buffered IO,
+	 * because by the time we retry the DIO operation some of the first
+	 * remaining pages may have been evicted in order to fault in other pages
+	 * that follow them. That can happen when we are under memory pressure and
+	 * the iov represents a large buffer.
+	 */
+	if (left != prev_left) {
+		int dirty_tresh = current->nr_dirtied_pause - current->nr_dirtied;
+
+		size = max(dirty_tresh, 8) << PAGE_SHIFT;
+		size = min_t(size_t, SZ_1M, size);
+	}
+	size -= offset_in_page(iocb->ki_pos);
+
+	return min(left, size);
+}
+
 static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
@@ -1956,7 +1983,9 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		if (left == prev_left) {
 			err = -ENOTBLK;
 		} else {
-			fault_in_iov_iter_readable(from, left);
+			const size_t size = dio_fault_in_size(iocb, from, prev_left);
+
+			fault_in_iov_iter_readable(from, size);
 			prev_left = left;
 			goto again;
 		}
@@ -3737,25 +3766,20 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 		read = ret;
 
 	if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
-		const size_t left = iov_iter_count(to);
+		if (iter_is_iovec(to)) {
+			const size_t left = iov_iter_count(to);
+			const size_t size = dio_fault_in_size(iocb, to, prev_left);
 
-		if (left == prev_left) {
-			/*
-			 * We didn't make any progress since the last attempt,
-			 * fallback to a buffered read for the remainder of the
-			 * range. This is just to avoid any possibility of looping
-			 * for too long.
-			 */
-			ret = read;
+			fault_in_iov_iter_writeable(to, size);
+			prev_left = left;
+			goto again;
 		} else {
 			/*
-			 * We made some progress since the last retry or this is
-			 * the first time we are retrying. Fault in as many pages
-			 * as possible and retry.
+			 * fault_in_iov_iter_writeable() only works for iovecs,
+			 * return with a partial read and fallback to buffered
+			 * IO for the rest of the range.
 			 */
-			fault_in_iov_iter_writeable(to, left);
-			prev_left = left;
-			goto again;
+			ret = read;
 		}
 	}
 	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);