From patchwork Mon Jul 4 11:42:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12905159 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54446C433EF for ; Mon, 4 Jul 2022 11:42:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233052AbiGDLmn (ORCPT ); Mon, 4 Jul 2022 07:42:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53348 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233506AbiGDLmM (ORCPT ); Mon, 4 Jul 2022 07:42:12 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EFD6391 for ; Mon, 4 Jul 2022 04:42:10 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 8D0FB60B1D for ; Mon, 4 Jul 2022 11:42:10 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 74D3EC341CE for ; Mon, 4 Jul 2022 11:42:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1656934930; bh=DvI94eykOj9NCptTuh1CBEsLpjJ4QI+SmE2lhlYAH/E=; h=From:To:Subject:Date:In-Reply-To:References:From; b=NItNzzUvdgCCoVhogBo06kPH/IGE41akCE2ZSunxAQ2Tiee+5nzqzngGb2GWjDnkM FOu6T3F5gZtvnEW3tKj67cocSSXKRIxWv+RfHR7ISBbKKqiA1Cm1u55ZiRjWpfSCVc 94Zzey0RShmwYoHlAP7ql0CTWV5i2Ovi1YyiUOKyHw4jmaxnhemPzvrQpPIk0JXHWU o6o+v3KCZbXylkMOSCjmQRu74O3LB6EcGHbUQVLWVceIImrtxqfv9BdUzMZI5OpU0n BZCplIaSPjMgC0gkPCtIAiFIaMbZPng29bJks0yABPv4lSWHAjtd37od4ZLf46eRTt 08PgDiqYWN2Gw== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 1/3] btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents Date: Mon, 4 Jul 2022 12:42:03 +0100 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When doing a direct IO read or write, we always return -ENOTBLK when we find a compressed extent (or an inline extent) so that we fallback to buffered IO. This however is not ideal in case we are in a NOWAIT context (io_uring for example), because buffered IO can block and we currently have no support for NOWAIT semantics for buffered IO, so if we need to fallback to buffered IO we should first signal the caller that we may need to block by returning -EAGAIN instead. This behaviour can also result in short reads being returned to user space, which although it's not incorrect and user space should be able to deal with partial reads, it's somewhat surprising and even some popular applications like QEMU (Link tag #1) and MariaDB (Link tag #2) don't deal with short reads properly (or at all). The short read case happens when we try to read from a range that has a non-compressed and non-inline extent followed by a compressed extent. After having read the first extent, when we find the compressed extent we return -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to treat the request as a short read, returning 0 (success) and waiting for previously submitted bios to complete (this happens at fs/iomap/direct-io.c:__iomap_dio_rw()). After that, and while at btrfs_file_read_iter(), we call filemap_read() to use buffered IO to read the remaining data, and pass it the number of bytes we were able to read with direct IO. Than at filemap_read() if we get a page fault error when accessing the read buffer, we return a partial read instead of an -EFAULT error, because the number of bytes previously read is greater than zero. So fix this by returning -EAGAIN for NOWAIT direct IO when we find a compressed or an inline extent. Reported-by: Dominique MARTINET Link: https://lore.kernel.org/linux-btrfs/YrrFGO4A1jS0GI0G@atmark-techno.com/ Link: https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582 Tested-by: Dominique MARTINET CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Filipe Manana Reviewed-by: Christoph Hellwig --- fs/btrfs/inode.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 7a54f964ff37..b86be4c3513d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7684,7 +7684,19 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags) || em->block_start == EXTENT_MAP_INLINE) { free_extent_map(em); - ret = -ENOTBLK; + /* + * If we are in a NOWAIT context, return -EAGAIN in order to + * fallback to buffered IO. This is not only because we can + * block with buffered IO (no support for NOWAIT semantics at + * the moment) but also to avoid returning short reads to user + * space - this happens if we were able to read some data from + * previous non-compressed extents and then when we fallback to + * buffered IO, at btrfs_file_read_iter() by calling + * filemap_read(), we fail to fault in pages for the read buffer, + * in which case filemap_read() returns a short read (the number + * of bytes previously read is > 0, so it does not return -EFAULT). + */ + ret = (flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOTBLK; goto unlock_err; } From patchwork Mon Jul 4 11:42:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12905157 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 20AC2C433EF for ; Mon, 4 Jul 2022 11:42:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232982AbiGDLmk (ORCPT ); Mon, 4 Jul 2022 07:42:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53198 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233471AbiGDLmM (ORCPT ); Mon, 4 Jul 2022 07:42:12 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E027110B0 for ; Mon, 4 Jul 2022 04:42:11 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 7CF2E60B1F for ; Mon, 4 Jul 2022 11:42:11 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 65E88C3411E for ; Mon, 4 Jul 2022 11:42:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1656934930; bh=KBoz1o1dXdw43pFqErIx4yF26CNm+xlWUljXVW2JgFU=; h=From:To:Subject:Date:In-Reply-To:References:From; b=Gcwq0V7wroCsnK6fKTDciIMHBjJ/n2s6WySDe51Sz4R/RIqGTgBkBUx1aQQM7WrGh SErOAo25me6Jt1NlEl7xTmMazrBk3jDYA6tdhbSaM6n6Jus2rPgr7ijC0HsrJQ7qD8 1mOMUlfvWdkd4rMFyu0iANbvB6vBfPLbzWLdKEw4a9CTE6lbjJhQW88KyoKzkPg4C2 1EB0sECx7irDttEGh5hMAfOVPNuHmpwNwzSFu58woEMRhj9rcrlFZLzqyCphkfsl8j LwiHStHz6sKlkM4tchb05RgCyiekLMqGcF3uz8yKL3AqaEqH34orIANNFTlZWL1lR+ mml1ykFUC4SvQ== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 2/3] btrfs: don't fallback to buffered IO for NOWAIT direct IO writes Date: Mon, 4 Jul 2022 12:42:04 +0100 Message-Id: <1a7080444b73f8ca1481a7786e52bdf405193be9.1656934419.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana Currently, for a direct IO write, if we need to fallback to buffered IO, either the satisfy the whole write operation or just a part of it, we do it in the current context even if it's a NOWAIT context. This is not ideal because we currently don't have support for NOWAIT semantics in the buffered IO path (we can block for several reasons), so we should instead return -EAGAIN to the caller, so that it knows it should retry (the whole operation or what's left of it) in a context where blocking is acceptable. Signed-off-by: Filipe Manana --- fs/btrfs/file.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index da41a0c371bc..9c8e3a668d70 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1971,11 +1971,25 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) if (is_sync_write) iocb->ki_flags |= IOCB_DSYNC; - /* If 'err' is -ENOTBLK then it means we must fallback to buffered IO. */ + /* + * If 'err' is -ENOTBLK or we have not written all data, then it means + * we must fallback to buffered IO. + */ if ((err < 0 && err != -ENOTBLK) || !iov_iter_count(from)) goto out; buffered: + /* + * If we are in a NOWAIT context, then return -EAGAIN to signal the caller + * it must retry the operation in a context where blocking is acceptable, + * since we currently don't have NOWAIT semantics support for buffered IO + * and may block there for many reasons (reserving space for example). + */ + if (iocb->ki_flags & IOCB_NOWAIT) { + err = -EAGAIN; + goto out; + } + pos = iocb->ki_pos; written_buffered = btrfs_buffered_write(iocb, from); if (written_buffered < 0) { From patchwork Mon Jul 4 11:42:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12905156 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B17C8C433EF for ; Mon, 4 Jul 2022 11:42:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232408AbiGDLmP (ORCPT ); Mon, 4 Jul 2022 07:42:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53764 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233195AbiGDLmO (ORCPT ); Mon, 4 Jul 2022 07:42:14 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 14A2E10AF for ; Mon, 4 Jul 2022 04:42:12 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 6EB9F60B29 for ; Mon, 4 Jul 2022 11:42:12 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 580C2C341C7 for ; Mon, 4 Jul 2022 11:42:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1656934931; bh=36MkUOll9Ci/OQiJXnSZnI8zIZNSoA4ZN8JV+VyVNQo=; h=From:To:Subject:Date:In-Reply-To:References:From; b=Ef1qbKnGmMnTvqV5scpIzXMLuFquRGa450bVu2PBQ0IS500ZYEmW6BvG6eCaQa442 VHnxUyMrE5WjHsO8uZA+WU5KO7HaKi1Yw92f3PKS/NHlN3CgQqlXRDQhxPcI3bS61e js9oXIZ8amw6ATFyxwLuDDsTYqB4A/k1AW20Lc1f6sfvkKHzvoD9TWjk5SYed/t1Tk uMV7MB3V4xWpkUaxhdJxuPfTnigIx9alxE2jvi/uu9gZtzdFpHwM83HK1hi57a6IZO qJjL5LcEtyiT95CUTQYNg2QWQ+vXjAhr7+G+kuxzoTmIlEyf2+kE62UBDQyjubD4/V q6Jc0rKBY9QYw== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 3/3] btrfs: fault in pages for dio reads/writes in a more controlled way Date: Mon, 4 Jul 2022 12:42:05 +0100 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When we need to fault in the pages of the iovector of a direct IO read or write, we try to fault in all the remaining pages. While this works, it's not ideal if there's a large number of remaining pages and the system is under memory pressure. By the time we fault in some pages, some of the previously faulted in pages may have been evicted due to memory pressure, resulting in slower progress or falling back to buffered IO (which is fine but it's not ideal). So limit the number of pages we fault in. The amount is decided based on what's left of the iovector and the threshold for dirty page rate limiting (the nr_dirtied and nr_dirtied_pause fields of struct task_struct), and it's borrowed from gfs2 (fs/gfs2/file.c:should_fault_in_pages()). Signed-off-by: Filipe Manana --- fs/btrfs/file.c | 56 +++++++++++++++++++++++++++++++++++-------------- 1 file changed, 40 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 9c8e3a668d70..1528b8edc7a9 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1846,6 +1846,33 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info, return 0; } +static size_t dio_fault_in_size(const struct kiocb *iocb, + const struct iov_iter *iov, + size_t prev_left) +{ + const size_t left = iov_iter_count(iov); + size_t size = PAGE_SIZE; + + /* + * If there's no progress since the last time we had to fault in pages, + * then we fault in at most 1 page. Faulting in more than that may + * result in making very slow progress or falling back to buffered IO, + * because by the time we retry the DIO operation some of the first + * remaining pages may have been evicted in order to fault in other pages + * that follow them. That can happen when we are under memory pressure and + * the iov represents a large buffer. + */ + if (left != prev_left) { + int dirty_tresh = current->nr_dirtied_pause - current->nr_dirtied; + + size = max(dirty_tresh, 8) << PAGE_SHIFT; + size = min_t(size_t, SZ_1M, size); + } + size -= offset_in_page(iocb->ki_pos); + + return min(left, size); +} + static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) { const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC); @@ -1956,7 +1983,9 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) if (left == prev_left) { err = -ENOTBLK; } else { - fault_in_iov_iter_readable(from, left); + const size_t size = dio_fault_in_size(iocb, from, prev_left); + + fault_in_iov_iter_readable(from, size); prev_left = left; goto again; } @@ -3737,25 +3766,20 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to) read = ret; if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) { - const size_t left = iov_iter_count(to); + if (iter_is_iovec(to)) { + const size_t left = iov_iter_count(to); + const size_t size = dio_fault_in_size(iocb, to, prev_left); - if (left == prev_left) { - /* - * We didn't make any progress since the last attempt, - * fallback to a buffered read for the remainder of the - * range. This is just to avoid any possibility of looping - * for too long. - */ - ret = read; + fault_in_iov_iter_writeable(to, size); + prev_left = left; + goto again; } else { /* - * We made some progress since the last retry or this is - * the first time we are retrying. Fault in as many pages - * as possible and retry. + * fault_in_iov_iter_writeable() only works for iovecs, + * return with a partial read and fallback to buffered + * IO for the rest of the range. */ - fault_in_iov_iter_writeable(to, left); - prev_left = left; - goto again; + ret = read; } } btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);