From patchwork Wed Oct 30 06:33:57 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qu Wenruo <wqu@suse.com>
X-Patchwork-Id: 13855967
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B754419DF95
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:38 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=195.135.223.131
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730270081; cv=none;
 b=ro0WsjmGOww/YY6JAb8ZCQmOij3/sMgSehiMGzNWSUyiuzIbmetRsTyRvxvxK2Vmam+zThfryn+Y77n2LAGVmXJqgf0kElU72vRZ4htXO3eUXmYxnIFwaQtvgmxP/iE3TadvA02rYy7UNakXJ8jacpdFdfqzHanT0uX/Sk1X67g=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730270081; c=relaxed/simple;
	bh=h1J3GgOA2jCNdbp1PGrtNt1SUrfy9iElE41Mb1XoEUg=;
	h=From:To:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=MkvL/fHRErXUGZIYLRDecUlel8ee2RBiZHFUXZaqjjJ/3r8g47thg1yRFMVbj1GmiruepgaKX3KWLCWXsD/p8Ee7iroQJ0crse9w0AEkcy7CNIpUknQNJGlI4bETLhkNir6wKJezZTqKUahNHJoow0GS749+4n2jc3Xgj4hmoiU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=suse.com;
 spf=pass smtp.mailfrom=suse.com;
 dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b=dvYpr1yg;
 dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b=dvYpr1yg; arc=none smtp.client-ip=195.135.223.131
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=suse.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=suse.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b="dvYpr1yg";
	dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b="dvYpr1yg"
Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by smtp-out2.suse.de (Postfix) with ESMTPS id D03A31F7D9
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:36 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1730270076; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
	 mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=jmQVfuqidDBvATraHd+tbzzmEJJJW4E7ih4Hrt/Qw+M=;
	b=dvYpr1ygnP32zUMkjk40u0QlZx1PT2s12SGwnHq6e95VO+HmTPvEXhcmN/eIXUs/WzX0jn
	fX24h8PndmHyY1YcmoQoaycWvMmlFyG18JKnfKCd3L5OqFfjiMk2VO5L1gbnwjgusk6Zyg
	9pF1ryt8hdd03ABiyskZaUz1v47Pk2A=
Authentication-Results: smtp-out2.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1730270076; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
	 mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=jmQVfuqidDBvATraHd+tbzzmEJJJW4E7ih4Hrt/Qw+M=;
	b=dvYpr1ygnP32zUMkjk40u0QlZx1PT2s12SGwnHq6e95VO+HmTPvEXhcmN/eIXUs/WzX0jn
	fX24h8PndmHyY1YcmoQoaycWvMmlFyG18JKnfKCd3L5OqFfjiMk2VO5L1gbnwjgusk6Zyg
	9pF1ryt8hdd03ABiyskZaUz1v47Pk2A=
Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 19C34136A5
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:35 +0000 (UTC)
Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167])
	by imap1.dmz-prg2.suse.org with ESMTPSA
	id QAtyM3vTIWcFcwAAD6G6ig
	(envelope-from <wqu@suse.com>)
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:35 +0000
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 1/6] btrfs: fix double accounting of ordered extents during
 errors
Date: Wed, 30 Oct 2024 17:03:57 +1030
Message-ID: 
 <2faab8a96c6dd2a414a96e4cebae97ecbddf021d.1730269807.git.wqu@suse.com>
X-Mailer: git-send-email 2.47.0
In-Reply-To: <cover.1730269807.git.wqu@suse.com>
References: <cover.1730269807.git.wqu@suse.com>
Precedence: bulk
X-Mailing-List: linux-btrfs@vger.kernel.org
List-Id: <linux-btrfs.vger.kernel.org>
List-Subscribe: <mailto:linux-btrfs+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-btrfs+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Level: 
X-Spamd-Result: default: False [-2.80 / 50.00];
	BAYES_HAM(-3.00)[100.00%];
	MID_CONTAINS_FROM(1.00)[];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	R_MISSING_CHARSET(0.50)[];
	NEURAL_HAM_SHORT(-0.20)[-1.000];
	MIME_GOOD(-0.10)[text/plain];
	FUZZY_BLOCKED(0.00)[rspamd.com];
	RCVD_VIA_SMTP_AUTH(0.00)[];
	RCPT_COUNT_ONE(0.00)[1];
	ARC_NA(0.00)[];
	DKIM_SIGNED(0.00)[suse.com:s=susede1];
	DBL_BLOCKED_OPENRESOLVER(0.00)[suse.com:email,suse.com:mid];
	FROM_EQ_ENVFROM(0.00)[];
	FROM_HAS_DN(0.00)[];
	MIME_TRACE(0.00)[0:+];
	RCVD_COUNT_TWO(0.00)[2];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	TO_DN_NONE(0.00)[];
	PREVIOUSLY_DELIVERED(0.00)[linux-btrfs@vger.kernel.org];
	RCVD_TLS_ALL(0.00)[]
X-Spam-Score: -2.80
X-Spam-Flag: NO

[BUG]
Btrfs will fail generic/750 randomly if its sector size is smaller than
page size.

One of the warning looks like this:

 ------------[ cut here ]------------
 WARNING: CPU: 1 PID: 90263 at fs/btrfs/ordered-data.c:360 can_finish_ordered_extent+0x33c/0x390 [btrfs]
 CPU: 1 UID: 0 PID: 90263 Comm: kworker/u18:1 Tainted: G           OE      6.12.0-rc3-custom+ #79
 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
 pc : can_finish_ordered_extent+0x33c/0x390 [btrfs]
 lr : can_finish_ordered_extent+0xdc/0x390 [btrfs]
 Call trace:
  can_finish_ordered_extent+0x33c/0x390 [btrfs]
  btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs]
  extent_writepage+0xfc/0x338 [btrfs]
  extent_write_cache_pages+0x1d4/0x4b8 [btrfs]
  btrfs_writepages+0x94/0x158 [btrfs]
  do_writepages+0x74/0x190
  filemap_fdatawrite_wbc+0x88/0xc8
  start_delalloc_inodes+0x180/0x3b0 [btrfs]
  btrfs_start_delalloc_roots+0x17c/0x288 [btrfs]
  shrink_delalloc+0x11c/0x280 [btrfs]
  flush_space+0x27c/0x310 [btrfs]
  btrfs_async_reclaim_metadata_space+0xcc/0x208 [btrfs]
  process_one_work+0x228/0x670
  worker_thread+0x1bc/0x360
  kthread+0x100/0x118
  ret_from_fork+0x10/0x20
 irq event stamp: 9784200
 hardirqs last  enabled at (9784199): [<ffffd21ec54dc01c>] _raw_spin_unlock_irqrestore+0x74/0x80
 hardirqs last disabled at (9784200): [<ffffd21ec54db374>] _raw_spin_lock_irqsave+0x8c/0xa0
 softirqs last  enabled at (9784148): [<ffffd21ec472ff44>] handle_softirqs+0x45c/0x4b0
 softirqs last disabled at (9784141): [<ffffd21ec46d01e4>] __do_softirq+0x1c/0x28
 ---[ end trace 0000000000000000 ]---
 BTRFS critical (device dm-2): bad ordered extent accounting, root=5 ino=1492 OE offset=1654784 OE len=57344 to_dec=49152 left=0

[CAUSE]
The function btrfs_mark_ordered_io_finished() is called for marking all
ordered extents in the page range as finished, for error handling.

But for sector size < page size cases, we can have multiple ordered
extents in one page.

If extent_writepage_io() failed (the only possible case is
submit_one_sector() failed to grab an extent map), then the call site
inside extent_writepage() will call btrfs_mark_ordered_io_finished() to
finish the created ordered extents.

However some range of the ordered extent may have been submitted already,
then btrfs_mark_ordered_io_finished() is called on the same range, causing
double accounting.

[FIX]
- Introduce a new member btrfs_bio_ctrl::last_submitted
  This will trace the last sector submitted through
  extent_writepage_io().

  So for the above extent_writepage() case, we will know exactly which
  sectors are submitted and should not do the ordered extent accounting.

- Introduce a helper cleanup_ordered_extents()
  This will do a sector-by-sector cleanup with
  btrfs_bio_ctrl::last_submitted and btrfs_bio_ctrl::submit_bitmap into
  consideartion.

  Using @last_submitted is to avoid double accounting on the submitted
  ranges.
  Meanwhile using @submit_bitmap is to avoid touching ranges going
  through compression.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 41 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e629d2ee152a..427bfbe737f2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -108,6 +108,14 @@ struct btrfs_bio_ctrl {
 	 * This is to avoid touching ranges covered by compression/inline.
 	 */
 	unsigned long submit_bitmap;
+
+	/*
+	 * The end (exclusive) of the last submitted range in the folio.
+	 *
+	 * This is for sector size < page size case where we may hit error
+	 * half way.
+	 */
+	u64 last_submitted;
 };
 
 static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
@@ -1435,6 +1443,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
 		if (ret < 0)
 			goto out;
+		bio_ctrl->last_submitted = cur + fs_info->sectorsize;
 		submitted_io = true;
 	}
 out:
@@ -1453,6 +1462,24 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 	return ret;
 }
 
+static void cleanup_ordered_extents(struct btrfs_inode *inode,
+				    struct folio *folio, u64 file_pos,
+				    u64 num_bytes, unsigned long *bitmap)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	unsigned int cur_bit = (file_pos - folio_pos(folio)) >> fs_info->sectorsize_bits;
+
+	for_each_set_bit_from(cur_bit, bitmap, fs_info->sectors_per_page) {
+		u64 cur_pos = folio_pos(folio) + (cur_bit << fs_info->sectorsize_bits);
+
+		if (cur_pos >= file_pos + num_bytes)
+			break;
+
+		btrfs_mark_ordered_io_finished(inode, folio, cur_pos,
+					       fs_info->sectorsize, false);
+	}
+}
+
 /*
  * the writepage semantics are similar to regular writepage.  extent
  * records are inserted to lock ranges in the tree, and as dirty areas
@@ -1492,6 +1519,7 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
 	 * The proper bitmap can only be initialized until writepage_delalloc().
 	 */
 	bio_ctrl->submit_bitmap = (unsigned long)-1;
+	bio_ctrl->last_submitted = page_start;
 	ret = set_folio_extent_mapped(folio);
 	if (ret < 0)
 		goto done;
@@ -1511,8 +1539,10 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
 
 done:
 	if (ret) {
-		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
-					       page_start, PAGE_SIZE, !ret);
+		cleanup_ordered_extents(BTRFS_I(inode), folio,
+				bio_ctrl->last_submitted,
+				page_start + PAGE_SIZE - bio_ctrl->last_submitted,
+				&bio_ctrl->submit_bitmap);
 		mapping_set_error(folio->mapping, ret);
 	}
 
@@ -2288,14 +2318,17 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
 		 * extent_writepage_io() will do the truncation correctly.
 		 */
 		bio_ctrl.submit_bitmap = (unsigned long)-1;
+		bio_ctrl.last_submitted = cur;
 		ret = extent_writepage_io(BTRFS_I(inode), folio, cur, cur_len,
 					  &bio_ctrl, i_size);
 		if (ret == 1)
 			goto next_page;
 
 		if (ret) {
-			btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
-						       cur, cur_len, !ret);
+			cleanup_ordered_extents(BTRFS_I(inode), folio,
+					bio_ctrl.last_submitted,
+					cur_end + 1 - bio_ctrl.last_submitted,
+					&bio_ctrl.submit_bitmap);
 			mapping_set_error(mapping, ret);
 		}
 		btrfs_folio_end_lock(fs_info, folio, cur, cur_len);

From patchwork Wed Oct 30 06:33:58 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qu Wenruo <wqu@suse.com>
X-Patchwork-Id: 13855969
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D220D1D0E33
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:39 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=195.135.223.131
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730270084; cv=none;
 b=cCU3Ay7r82lAoBwJRwzzRb/nnPL2sU4x3NofIia8UoAHdo1LTQyOStNYsQm0QOuZUcYi2R+FbP8Fx5Kf9vaDGC0+wttFSrszD+ysKNP0gkX3pZ8ysBjBGgg0khmKB8Wh6RBCY3joKR/n5ALzsMN/J0Ws7YllicSiFVcRBqtYQvI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730270084; c=relaxed/simple;
	bh=XDDKVV+zK/ZA2wHrxFxgRjlcPDo+hvmNRPfI+7xRAYI=;
	h=From:To:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=lNoygprHNePOjhXMjQbfIBXItMK4cuuncOU0Yv0TtYGHRDpZTP4YS7F8qCepjH/v2fDuBLJp03eWTTuoux6BJDJAdRD68FLFUfasJzf9I0bEdrm8MISAucxClvIeYq88F2wkeca5d+eaVUu28pAOqN4GUBfPMtTg0TSaU8VBr5Y=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=suse.com;
 spf=pass smtp.mailfrom=suse.com;
 dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b=KYLTXYBg;
 dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b=KYLTXYBg; arc=none smtp.client-ip=195.135.223.131
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=suse.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=suse.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b="KYLTXYBg";
	dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b="KYLTXYBg"
Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by smtp-out2.suse.de (Postfix) with ESMTPS id 0EE191FDCB
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:38 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1730270078; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
	 mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=UXxsvq9dKa/KOxuNZn7JZVBgEoOcVCG4E2WiF45nTl4=;
	b=KYLTXYBgHXi+CvfxgmA7kfZ3ZPOu8OoPV/i3vQdvbsdjsESkqUyTllrz2uH7k/LRp/Zapv
	uAOsEjnyVvP9hOZFFzLDUtGqwxKNXmZYc4luOCeotSLJYLCSdLmMAs6JeUk9OAdXw/QaMQ
	Ky3PR8lZbrAru5vFXsuuLEkgshtFusQ=
Authentication-Results: smtp-out2.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1730270078; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
	 mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=UXxsvq9dKa/KOxuNZn7JZVBgEoOcVCG4E2WiF45nTl4=;
	b=KYLTXYBgHXi+CvfxgmA7kfZ3ZPOu8OoPV/i3vQdvbsdjsESkqUyTllrz2uH7k/LRp/Zapv
	uAOsEjnyVvP9hOZFFzLDUtGqwxKNXmZYc4luOCeotSLJYLCSdLmMAs6JeUk9OAdXw/QaMQ
	Ky3PR8lZbrAru5vFXsuuLEkgshtFusQ=
Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 4D069136A5
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:37 +0000 (UTC)
Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167])
	by imap1.dmz-prg2.suse.org with ESMTPSA
	id UHVyBH3TIWcFcwAAD6G6ig
	(envelope-from <wqu@suse.com>)
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:37 +0000
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 2/6] btrfs: extract the inner loop of cow_file_range() to
 enhance the error handling
Date: Wed, 30 Oct 2024 17:03:58 +1030
Message-ID: 
 <de9a521e078f65569ece1db330daebec836e5282.1730269807.git.wqu@suse.com>
X-Mailer: git-send-email 2.47.0
In-Reply-To: <cover.1730269807.git.wqu@suse.com>
References: <cover.1730269807.git.wqu@suse.com>
Precedence: bulk
X-Mailing-List: linux-btrfs@vger.kernel.org
List-Id: <linux-btrfs.vger.kernel.org>
List-Subscribe: <mailto:linux-btrfs+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-btrfs+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Score: -2.80
X-Spamd-Result: default: False [-2.80 / 50.00];
	BAYES_HAM(-3.00)[100.00%];
	MID_CONTAINS_FROM(1.00)[];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	R_MISSING_CHARSET(0.50)[];
	NEURAL_HAM_SHORT(-0.20)[-1.000];
	MIME_GOOD(-0.10)[text/plain];
	FUZZY_BLOCKED(0.00)[rspamd.com];
	RCVD_VIA_SMTP_AUTH(0.00)[];
	RCPT_COUNT_ONE(0.00)[1];
	ARC_NA(0.00)[];
	DKIM_SIGNED(0.00)[suse.com:s=susede1];
	DBL_BLOCKED_OPENRESOLVER(0.00)[suse.com:mid,suse.com:email];
	FROM_EQ_ENVFROM(0.00)[];
	FROM_HAS_DN(0.00)[];
	MIME_TRACE(0.00)[0:+];
	RCVD_COUNT_TWO(0.00)[2];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	TO_DN_NONE(0.00)[];
	PREVIOUSLY_DELIVERED(0.00)[linux-btrfs@vger.kernel.org];
	RCVD_TLS_ALL(0.00)[]
X-Spam-Flag: NO
X-Spam-Level: 

[PROBLEMS]

Currently cow_file_range() has a very complex error handling, splitting
the handling into 3 different parts (with one extra hidden part)

         |-------(1)----|---(2)---|-------------(3)----------|
         `- orig_start  `- start  `- start + cur_alloc_size  `- end

Range (1) is the one we handled without any error.
Range (2) is the current range we're allocating, has data extent
reserved but failed to insert ordered extent.
Range (3) is the range we have not yet touched.

Furthermore there is a special case for range (1) that if a range is for
data reloc tree, and btrfs_reloc_clone_csums() failed, such range will
be counted as range (1), to avoid metadata space release.

Furthermore there is a very complex handling inside the while() loop.

[ENHANCEMENT]
- Extract the main part of the while() loop into run_one_cow_range()
  Which does the data extent reservation, extent map creation and
  ordered extent creation.

  With its own error handling.

  For the special case of range (1) where btrfs_reloc_clone_csums()
  failed, it will return error but still with ins->offset set, so that
  caller can skip the range for error handling.

- Shrink the error range handling to 2 types
  With the proper error handling inside run_one_cow_range(), if a range
  failed run_one_cow_range(), there will be no ordered extent or extent
  map, thus the old range (2) and range (3) can be merged into one.

- Remove some unnecessary ALIGN()
  The range [start, end] is already aligned to sector boundary.
  Just use ASSERT() to check the alignment.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 337 ++++++++++++++++++++++++-----------------------
 1 file changed, 173 insertions(+), 164 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a2128eccafb3..38b063e0ca43 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1293,6 +1293,147 @@ u64 btrfs_get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
 	return alloc_hint;
 }
 
+/*
+ * Run one cow range, which includes:
+ *
+ * - Reserve the data extent
+ * - Create the io extent map
+ * - Create the ordered extent
+ *
+ * @ins will be updated if the range should be skipped for error handling, no
+ * matter the return value.
+ *
+ * Return 0 if everything is fine.
+ * Return -EAGAIN if btrfs_reserve_extent() failed for zoned fs, caller needs
+ * some extra handling.
+ * Return <0 for other errors.
+ */
+static int run_one_cow_range(struct btrfs_inode *inode,
+			     struct folio *locked_folio,
+			     struct btrfs_key *ins,
+			     u64 start,
+			     u64 end, u64 *alloc_hint, bool keep_locked)
+{
+	struct btrfs_root *root = inode->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_ordered_extent *ordered;
+	struct btrfs_file_extent file_extent = { 0 };
+	struct extent_state *cached = NULL;
+	struct extent_map *em = NULL;
+	unsigned long page_ops;
+	const u64 len = end + 1 - start;
+	u32 min_alloc_size;
+	int ret;
+
+	ASSERT(IS_ALIGNED(start, fs_info->sectorsize));
+	ASSERT(IS_ALIGNED(end + 1, fs_info->sectorsize));
+
+	ins->offset = 0;
+	ins->objectid = 0;
+
+	/*
+	 * Relocation relies on the relocated extents to have exactly the same
+	 * size as the original extents. Normally writeback for relocation data
+	 * extents follows a NOCOW path because relocation preallocates the
+	 * extents. However, due to an operation such as scrub turning a block
+	 * group to RO mode, it may fallback to COW mode, so we must make sure
+	 * an extent allocated during COW has exactly the requested size and can
+	 * not be split into smaller extents, otherwise relocation breaks and
+	 * fails during the stage where it updates the bytenr of file extent
+	 * items.
+	 */
+	if (btrfs_is_data_reloc_root(root))
+		min_alloc_size = len;
+	else
+		min_alloc_size = fs_info->sectorsize;
+
+	ret = btrfs_reserve_extent(root, len, len, min_alloc_size, 0,
+				   *alloc_hint, ins, 1, 1);
+	if (ret < 0)
+		return ret;
+
+	file_extent.disk_bytenr = ins->objectid;
+	file_extent.disk_num_bytes = ins->offset;
+	file_extent.num_bytes = ins->offset;
+	file_extent.ram_bytes = ins->offset;
+	file_extent.offset = 0;
+	file_extent.compression = BTRFS_COMPRESS_NONE;
+
+	lock_extent(&inode->io_tree, start, start + ins->offset - 1,
+		    &cached);
+
+	em = btrfs_create_io_em(inode, start, &file_extent,
+				BTRFS_ORDERED_REGULAR);
+	if (IS_ERR(em)) {
+		unlock_extent(&inode->io_tree, start,
+			      start + ins->offset - 1, &cached);
+		ret = PTR_ERR(em);
+		goto out_free_reserved;
+	}
+	free_extent_map(em);
+
+	ordered = btrfs_alloc_ordered_extent(inode, start, &file_extent,
+					     1 << BTRFS_ORDERED_REGULAR);
+	if (IS_ERR(ordered)) {
+		unlock_extent(&inode->io_tree, start,
+			      start + ins->offset - 1, &cached);
+		ret = PTR_ERR(ordered);
+		goto out_drop_em;
+	}
+
+	if (btrfs_is_data_reloc_root(root)) {
+		ret = btrfs_reloc_clone_csums(ordered);
+
+		/*
+		 * Only drop cache here, and process as normal.
+		 *
+		 * We must not allow extent_clear_unlock_delalloc()
+		 * at error handling to free meta of this ordered
+		 * extent, as its meta should be freed by
+		 * btrfs_finish_ordered_io().
+		 *
+		 * So we must continue until @start is increased to
+		 * skip current ordered extent.
+		 */
+		if (ret < 0)
+			btrfs_drop_extent_map_range(inode, start,
+						    start + ins->offset - 1,
+						    false);
+	}
+	btrfs_put_ordered_extent(ordered);
+
+	btrfs_dec_block_group_reservations(fs_info, ins->objectid);
+
+	/*
+	 * We're not doing compressed IO, don't unlock the first page
+	 * (which the caller expects to stay locked), don't clear any
+	 * dirty bits and don't set any writeback bits
+	 *
+	 * Do set the Ordered (Private2) bit so we know this page was
+	 * properly setup for writepage.
+	 */
+	page_ops = (keep_locked ? 0 : PAGE_UNLOCK);
+	page_ops |= PAGE_SET_ORDERED;
+
+	extent_clear_unlock_delalloc(inode, start, start + ins->offset - 1,
+				     locked_folio, &cached,
+				     EXTENT_LOCKED | EXTENT_DELALLOC,
+				     page_ops);
+	*alloc_hint = ins->objectid + ins->offset;
+	return ret;
+
+out_drop_em:
+	btrfs_drop_extent_map_range(inode, start, start + ins->offset - 1, false);
+out_free_reserved:
+	btrfs_dec_block_group_reservations(fs_info, ins->objectid);
+	btrfs_free_reserved_extent(fs_info, ins->objectid, ins->offset, 1);
+	/* This is reserved for btrfs_reserve_extent() error. */
+	ASSERT(ret != -EAGAIN);
+	ins->offset = 0;
+	ins->objectid = 0;
+	return ret;
+}
+
 /*
  * when extent_io.c finds a delayed allocation range in the file,
  * the call backs end up in this code.  The basic idea is to
@@ -1328,15 +1469,10 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
-	struct extent_state *cached = NULL;
+	const u64 num_bytes = end + 1 - start;
 	u64 alloc_hint = 0;
 	u64 orig_start = start;
-	u64 num_bytes;
-	u64 cur_alloc_size = 0;
-	u64 min_alloc_size;
-	u64 blocksize = fs_info->sectorsize;
-	struct btrfs_key ins;
-	struct extent_map *em;
+	u64 cur = start;
 	unsigned clear_bits;
 	unsigned long page_ops;
 	int ret = 0;
@@ -1346,8 +1482,8 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 		goto out_unlock;
 	}
 
-	num_bytes = ALIGN(end - start + 1, blocksize);
-	num_bytes = max(blocksize,  num_bytes);
+	ASSERT(IS_ALIGNED(start, fs_info->sectorsize));
+	ASSERT(IS_ALIGNED(end + 1, fs_info->sectorsize));
 	ASSERT(num_bytes <= btrfs_super_total_bytes(fs_info->super_copy));
 
 	inode_should_defrag(inode, start, end, num_bytes, SZ_64K);
@@ -1372,29 +1508,16 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 
 	alloc_hint = btrfs_get_extent_allocation_hint(inode, start, num_bytes);
 
-	/*
-	 * Relocation relies on the relocated extents to have exactly the same
-	 * size as the original extents. Normally writeback for relocation data
-	 * extents follows a NOCOW path because relocation preallocates the
-	 * extents. However, due to an operation such as scrub turning a block
-	 * group to RO mode, it may fallback to COW mode, so we must make sure
-	 * an extent allocated during COW has exactly the requested size and can
-	 * not be split into smaller extents, otherwise relocation breaks and
-	 * fails during the stage where it updates the bytenr of file extent
-	 * items.
-	 */
-	if (btrfs_is_data_reloc_root(root))
-		min_alloc_size = num_bytes;
-	else
-		min_alloc_size = fs_info->sectorsize;
+	while (cur < end) {
+		struct btrfs_key ins = { 0 };
 
-	while (num_bytes > 0) {
-		struct btrfs_ordered_extent *ordered;
-		struct btrfs_file_extent file_extent;
-
-		ret = btrfs_reserve_extent(root, num_bytes, num_bytes,
-					   min_alloc_size, 0, alloc_hint,
-					   &ins, 1, 1);
+		ret = run_one_cow_range(inode, locked_folio, &ins,
+					cur, end, &alloc_hint, keep_locked);
+		/*
+		 * @cur must be advanced before error handling.
+		 * Special handling for possible btrfs_reloc_clone_csums() error.
+		 */
+		cur += ins.offset;
 		if (ret == -EAGAIN) {
 			/*
 			 * btrfs_reserve_extent only returns -EAGAIN for zoned
@@ -1408,145 +1531,51 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 			 * us, or return -ENOSPC if it can't handle retries.
 			 */
 			ASSERT(btrfs_is_zoned(fs_info));
-			if (start == orig_start) {
+			if (cur == orig_start) {
 				wait_on_bit_io(&inode->root->fs_info->flags,
 					       BTRFS_FS_NEED_ZONE_FINISH,
 					       TASK_UNINTERRUPTIBLE);
 				continue;
 			}
 			if (done_offset) {
-				*done_offset = start - 1;
+				*done_offset = cur - 1;
 				return 0;
 			}
 			ret = -ENOSPC;
 		}
 		if (ret < 0)
 			goto out_unlock;
-		cur_alloc_size = ins.offset;
-
-		file_extent.disk_bytenr = ins.objectid;
-		file_extent.disk_num_bytes = ins.offset;
-		file_extent.num_bytes = ins.offset;
-		file_extent.ram_bytes = ins.offset;
-		file_extent.offset = 0;
-		file_extent.compression = BTRFS_COMPRESS_NONE;
-
-		lock_extent(&inode->io_tree, start, start + cur_alloc_size - 1,
-			    &cached);
-
-		em = btrfs_create_io_em(inode, start, &file_extent,
-					BTRFS_ORDERED_REGULAR);
-		if (IS_ERR(em)) {
-			unlock_extent(&inode->io_tree, start,
-				      start + cur_alloc_size - 1, &cached);
-			ret = PTR_ERR(em);
-			goto out_reserve;
-		}
-		free_extent_map(em);
-
-		ordered = btrfs_alloc_ordered_extent(inode, start, &file_extent,
-						     1 << BTRFS_ORDERED_REGULAR);
-		if (IS_ERR(ordered)) {
-			unlock_extent(&inode->io_tree, start,
-				      start + cur_alloc_size - 1, &cached);
-			ret = PTR_ERR(ordered);
-			goto out_drop_extent_cache;
-		}
-
-		if (btrfs_is_data_reloc_root(root)) {
-			ret = btrfs_reloc_clone_csums(ordered);
-
-			/*
-			 * Only drop cache here, and process as normal.
-			 *
-			 * We must not allow extent_clear_unlock_delalloc()
-			 * at out_unlock label to free meta of this ordered
-			 * extent, as its meta should be freed by
-			 * btrfs_finish_ordered_io().
-			 *
-			 * So we must continue until @start is increased to
-			 * skip current ordered extent.
-			 */
-			if (ret)
-				btrfs_drop_extent_map_range(inode, start,
-							    start + cur_alloc_size - 1,
-							    false);
-		}
-		btrfs_put_ordered_extent(ordered);
-
-		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
-
-		/*
-		 * We're not doing compressed IO, don't unlock the first page
-		 * (which the caller expects to stay locked), don't clear any
-		 * dirty bits and don't set any writeback bits
-		 *
-		 * Do set the Ordered (Private2) bit so we know this page was
-		 * properly setup for writepage.
-		 */
-		page_ops = (keep_locked ? 0 : PAGE_UNLOCK);
-		page_ops |= PAGE_SET_ORDERED;
-
-		extent_clear_unlock_delalloc(inode, start, start + cur_alloc_size - 1,
-					     locked_folio, &cached,
-					     EXTENT_LOCKED | EXTENT_DELALLOC,
-					     page_ops);
-		if (num_bytes < cur_alloc_size)
-			num_bytes = 0;
-		else
-			num_bytes -= cur_alloc_size;
-		alloc_hint = ins.objectid + ins.offset;
-		start += cur_alloc_size;
-		cur_alloc_size = 0;
-
-		/*
-		 * btrfs_reloc_clone_csums() error, since start is increased
-		 * extent_clear_unlock_delalloc() at out_unlock label won't
-		 * free metadata of current ordered extent, we're OK to exit.
-		 */
-		if (ret)
-			goto out_unlock;
 	}
 done:
 	if (done_offset)
 		*done_offset = end;
 	return ret;
 
-out_drop_extent_cache:
-	btrfs_drop_extent_map_range(inode, start, start + cur_alloc_size - 1, false);
-out_reserve:
-	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
-	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
 out_unlock:
 	/*
-	 * Now, we have three regions to clean up:
+	 * Now we have two regions to clean up:
 	 *
-	 * |-------(1)----|---(2)---|-------------(3)----------|
-	 * `- orig_start  `- start  `- start + cur_alloc_size  `- end
+	 * |-----(1)-------|----(2)-----|
+	 * `- start        `- cur       `- end
 	 *
-	 * We process each region below.
+	 * @cur is only updated when a successful extent reservation along
+	 * with extent maps/ordered extents created.
 	 */
-
 	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
 		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
 	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
 
 	/*
-	 * For the range (1). We have already instantiated the ordered extents
-	 * for this region. They are cleaned up by
-	 * btrfs_cleanup_ordered_extents() in e.g,
-	 * btrfs_run_delalloc_range(). EXTENT_LOCKED | EXTENT_DELALLOC are
-	 * already cleared in the above loop. And, EXTENT_DELALLOC_NEW |
-	 * EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV are handled by the cleanup
-	 * function.
+	 * For the range (1), we have already instantiated the ordered extents
+	 * for the region. They will be cleaned up by btrfs_cleanup_ordered_extents().
 	 *
-	 * However, in case of @keep_locked, we still need to unlock the pages
-	 * (except @locked_folio) to ensure all the pages are unlocked.
+	 * Here we only need to unlock the pages, which will be done by
+	 * the extent_clear_unlock_delalloc() with PAGE_UNLOCK.
 	 */
-	if (keep_locked && orig_start < start) {
+	if (keep_locked && start < cur) {
 		if (!locked_folio)
 			mapping_set_error(inode->vfs_inode.i_mapping, ret);
-		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
+		extent_clear_unlock_delalloc(inode, orig_start, cur - 1,
 					     locked_folio, NULL, 0, page_ops);
 	}
 
@@ -1555,39 +1584,19 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	 * clearing these flags under the extent lock, so lock the rest of the
 	 * range and clear everything up.
 	 */
-	lock_extent(&inode->io_tree, start, end, NULL);
+	lock_extent(&inode->io_tree, cur, end, NULL);
 
 	/*
-	 * For the range (2). If we reserved an extent for our delalloc range
-	 * (or a subrange) and failed to create the respective ordered extent,
-	 * then it means that when we reserved the extent we decremented the
-	 * extent's size from the data space_info's bytes_may_use counter and
-	 * incremented the space_info's bytes_reserved counter by the same
-	 * amount. We must make sure extent_clear_unlock_delalloc() does not try
-	 * to decrement again the data space_info's bytes_may_use counter,
-	 * therefore we do not pass it the flag EXTENT_CLEAR_DATA_RESV.
-	 */
-	if (cur_alloc_size) {
-		extent_clear_unlock_delalloc(inode, start,
-					     start + cur_alloc_size - 1,
-					     locked_folio, &cached, clear_bits,
-					     page_ops);
-		btrfs_qgroup_free_data(inode, NULL, start, cur_alloc_size, NULL);
-	}
-
-	/*
-	 * For the range (3). We never touched the region. In addition to the
+	 * For the range (2). We never touched the region. In addition to the
 	 * clear_bits above, we add EXTENT_CLEAR_DATA_RESV to release the data
 	 * space_info's bytes_may_use counter, reserved in
 	 * btrfs_check_data_free_space().
 	 */
-	if (start + cur_alloc_size < end) {
+	if (cur < end) {
 		clear_bits |= EXTENT_CLEAR_DATA_RESV;
-		extent_clear_unlock_delalloc(inode, start + cur_alloc_size,
-					     end, locked_folio,
-					     &cached, clear_bits, page_ops);
-		btrfs_qgroup_free_data(inode, NULL, start + cur_alloc_size,
-				       end - start - cur_alloc_size + 1, NULL);
+		extent_clear_unlock_delalloc(inode, cur, end, locked_folio,
+					     NULL, clear_bits, page_ops);
+		btrfs_qgroup_free_data(inode, NULL, cur, end + 1 - cur, NULL);
 	}
 	return ret;
 }

From patchwork Wed Oct 30 06:33:59 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qu Wenruo <wqu@suse.com>
X-Patchwork-Id: 13855968
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 06C051D1F6F
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:40 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=195.135.223.131
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730270083; cv=none;
 b=DGZaRIgvFf7E/c5Qxr1hgHKzhyMy7PMkwovv3PvDolVWW1dHG0c3MjwQ2rGf7doGBRRc7prPUhDrxq1f1G7X39Ed8DUGuJE2RMit8tcaKKI5rIfxj8XzH3ELNAzBFoyS3b6YVHCY8VRjPmQVmMQLTbM3Nga1cc1MjPeJC8Fdczc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730270083; c=relaxed/simple;
	bh=QvhJAtrwVksKhDrNEMW4I123t0R03/xPdLM+mhHsmVM=;
	h=From:To:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=T3YQbfZJ3BP0gC5aH1CSbM04D4w64YE7ruU04LnumFsty3y3Y1r+y1FBckM31x4/mPQV2MUDxqMYqO7NES4Y5xYcFe9GFXHqttIWWMdndIgZFfgDfc8CtaBrJDM7r2OcDvpqcPuacY/XX8ij2iYSqFZG6E3i0iYveZq41hfl3dI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=suse.com;
 spf=pass smtp.mailfrom=suse.com;
 dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b=JkKz1IAA;
 dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b=JkKz1IAA; arc=none smtp.client-ip=195.135.223.131
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=suse.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=suse.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b="JkKz1IAA";
	dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b="JkKz1IAA"
Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org
 [IPv6:2a07:de40:b281:104:10:150:64:97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by smtp-out2.suse.de (Postfix) with ESMTPS id 4BC211FDCE
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:39 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1730270079; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
	 mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=gKLQ2am/PdVtgXHmBfhxXl1ko5/DrVypFLdp74AoFhg=;
	b=JkKz1IAAKwr1csZZQ4Bc9INbigtF5Vx4E6jVOiEVzUXe83rt+mivt8iuE+VGhqwMniyfGh
	hRndoow3Jk+1F9O1NIklM5wpJvxoTWYayvnKlmtgXabRgQ0BkFrd89mkLeGlA6X4f9+kp1
	3IZzxwduf+BQHx1acqoNGSPrCr81UoI=
Authentication-Results: smtp-out2.suse.de;
	dkim=pass header.d=suse.com header.s=susede1 header.b=JkKz1IAA
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1730270079; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
	 mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=gKLQ2am/PdVtgXHmBfhxXl1ko5/DrVypFLdp74AoFhg=;
	b=JkKz1IAAKwr1csZZQ4Bc9INbigtF5Vx4E6jVOiEVzUXe83rt+mivt8iuE+VGhqwMniyfGh
	hRndoow3Jk+1F9O1NIklM5wpJvxoTWYayvnKlmtgXabRgQ0BkFrd89mkLeGlA6X4f9+kp1
	3IZzxwduf+BQHx1acqoNGSPrCr81UoI=
Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 8054E136A5
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:38 +0000 (UTC)
Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167])
	by imap1.dmz-prg2.suse.org with ESMTPSA
	id 2Jz3EH7TIWcFcwAAD6G6ig
	(envelope-from <wqu@suse.com>)
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:38 +0000
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 3/6] btrfs: use FGP_STABLE to wait for folio writeback
Date: Wed, 30 Oct 2024 17:03:59 +1030
Message-ID: 
 <349d86a380010b227844da15bc48b8b6a1dda599.1730269807.git.wqu@suse.com>
X-Mailer: git-send-email 2.47.0
In-Reply-To: <cover.1730269807.git.wqu@suse.com>
References: <cover.1730269807.git.wqu@suse.com>
Precedence: bulk
X-Mailing-List: linux-btrfs@vger.kernel.org
List-Id: <linux-btrfs.vger.kernel.org>
List-Subscribe: <mailto:linux-btrfs+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-btrfs+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Rspamd-Queue-Id: 4BC211FDCE
X-Spam-Level: 
X-Spamd-Result: default: False [-3.01 / 50.00];
	BAYES_HAM(-3.00)[100.00%];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	MID_CONTAINS_FROM(1.00)[];
	R_MISSING_CHARSET(0.50)[];
	R_DKIM_ALLOW(-0.20)[suse.com:s=susede1];
	NEURAL_HAM_SHORT(-0.20)[-1.000];
	MIME_GOOD(-0.10)[text/plain];
	MX_GOOD(-0.01)[];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	RCVD_VIA_SMTP_AUTH(0.00)[];
	FROM_EQ_ENVFROM(0.00)[];
	ARC_NA(0.00)[];
	FROM_HAS_DN(0.00)[];
	RCPT_COUNT_ONE(0.00)[1];
	PREVIOUSLY_DELIVERED(0.00)[linux-btrfs@vger.kernel.org];
	RCVD_TLS_ALL(0.00)[];
	DBL_BLOCKED_OPENRESOLVER(0.00)[suse.com:email,suse.com:dkim,suse.com:mid];
	DKIM_SIGNED(0.00)[suse.com:s=susede1];
	FUZZY_BLOCKED(0.00)[rspamd.com];
	TO_DN_NONE(0.00)[];
	RCVD_COUNT_TWO(0.00)[2];
	MIME_TRACE(0.00)[0:+];
	DKIM_TRACE(0.00)[suse.com:+]
X-Rspamd-Server: rspamd2.dmz-prg2.suse.org
X-Rspamd-Action: no action
X-Spam-Score: -3.01
X-Spam-Flag: NO

__filemap_get_folio() provides the flag FGP_STABLE to wait for
writeback.

There are two call sites doing __filemap_get_folio() then
folio_wait_writeback():

- btrfs_truncate_block()
- defrag_prepare_one_folio()

We can directly utilize that flag instead of manually calling
folio_wait_writeback().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/defrag.c | 4 +---
 fs/btrfs/inode.c  | 4 +---
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index b95ef44c326b..1644470b9df7 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -865,7 +865,7 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
 
 again:
 	folio = __filemap_get_folio(mapping, index,
-				    FGP_LOCK | FGP_ACCESSED | FGP_CREAT, mask);
+			FGP_LOCK | FGP_ACCESSED | FGP_CREAT | FGP_STABLE, mask);
 	if (IS_ERR(folio))
 		return folio;
 
@@ -1222,8 +1222,6 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 			goto free_folios;
 		}
 	}
-	for (i = 0; i < nr_pages; i++)
-		folio_wait_writeback(folios[i]);
 
 	/* Lock the pages range */
 	lock_extent(&inode->io_tree, start_index << PAGE_SHIFT,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 38b063e0ca43..2a39699ebbc9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4775,7 +4775,7 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 	}
 again:
 	folio = __filemap_get_folio(mapping, index,
-				    FGP_LOCK | FGP_ACCESSED | FGP_CREAT, mask);
+			FGP_LOCK | FGP_ACCESSED | FGP_CREAT | FGP_STABLE, mask);
 	if (IS_ERR(folio)) {
 		btrfs_delalloc_release_space(inode, data_reserved, block_start,
 					     blocksize, true);
@@ -4808,8 +4808,6 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 	if (ret < 0)
 		goto out_unlock;
 
-	folio_wait_writeback(folio);
-
 	lock_extent(io_tree, block_start, block_end, &cached_state);
 
 	ordered = btrfs_lookup_ordered_extent(inode, block_start);

From patchwork Wed Oct 30 06:34:00 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qu Wenruo <wqu@suse.com>
X-Patchwork-Id: 13855970
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 24BF61D278D
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=195.135.223.131
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730270084; cv=none;
 b=mQB61f1DawajvNvC4Dz26M68ghDIxuLejfwjdAERSq5alStpAzs/FvUE3IhGtMImbVPG32lqmY4A0VhCGP8Jq7WKIwlHnGZBRPkWf66JU3GuKRQDD5r0VYhhyu3aaPBeTNgrJu6qA0NB5Bq2ykfmYOLLY7N+rR9bc0hL/VeXi30=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730270084; c=relaxed/simple;
	bh=t75TGCjuFaMNxzcIX8oc2nlHJGf3XL5NaifOIcxQqeA=;
	h=From:To:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=ItBei4TIZtFDmWDrRg8p62CZzMcU83znvrMM4cgW36FZJSMU91FlFYhSEYXM2U1aQ0NRn6MM79IoOfDp1bHmWuCwFK+Q3MD7oSVbCZ00W9p/f7jMKVPlCHW2SnWtobPVdtzRhBlo2EaC7+8f2Oc9/9O+jpscybEMhnzI9uqi8k0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=suse.com;
 spf=pass smtp.mailfrom=suse.com;
 dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b=tBEa9LiV;
 dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b=tBEa9LiV; arc=none smtp.client-ip=195.135.223.131
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=suse.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=suse.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b="tBEa9LiV";
	dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b="tBEa9LiV"
Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org
 [IPv6:2a07:de40:b281:104:10:150:64:97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by smtp-out2.suse.de (Postfix) with ESMTPS id 754851FDD0
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:40 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1730270080; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
	 mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=OFWjHK1O5acXduukbp1QDhX/EahRD30um1Ycb0Z93gM=;
	b=tBEa9LiVsjWpTWjP0hJHpxIbH4wmD0NMCJN4y3huOVIcKU7hyzYOqcuDai5+OCUrnQ9nTP
	oQ3BSCrmr+w42S9T8S2rrZ6orhBOyEBxaVG1k0DUbq09Gk0ksRdJ9+dEtv+930WyEPY2rz
	D1KgnB+/Cvvho2lI1hUJABMAmbgHpG4=
Authentication-Results: smtp-out2.suse.de;
	dkim=pass header.d=suse.com header.s=susede1 header.b=tBEa9LiV
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1730270080; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
	 mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=OFWjHK1O5acXduukbp1QDhX/EahRD30um1Ycb0Z93gM=;
	b=tBEa9LiVsjWpTWjP0hJHpxIbH4wmD0NMCJN4y3huOVIcKU7hyzYOqcuDai5+OCUrnQ9nTP
	oQ3BSCrmr+w42S9T8S2rrZ6orhBOyEBxaVG1k0DUbq09Gk0ksRdJ9+dEtv+930WyEPY2rz
	D1KgnB+/Cvvho2lI1hUJABMAmbgHpG4=
Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id B29F6136A5
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:39 +0000 (UTC)
Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167])
	by imap1.dmz-prg2.suse.org with ESMTPSA
	id GAJLHX/TIWcFcwAAD6G6ig
	(envelope-from <wqu@suse.com>)
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:39 +0000
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 4/6] btrfs: make btrfs_do_readpage() to do block-by-block read
Date: Wed, 30 Oct 2024 17:04:00 +1030
Message-ID: 
 <380e75a203c864915ccb59ee42f0bb659396ba77.1730269807.git.wqu@suse.com>
X-Mailer: git-send-email 2.47.0
In-Reply-To: <cover.1730269807.git.wqu@suse.com>
References: <cover.1730269807.git.wqu@suse.com>
Precedence: bulk
X-Mailing-List: linux-btrfs@vger.kernel.org
List-Id: <linux-btrfs.vger.kernel.org>
List-Subscribe: <mailto:linux-btrfs+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-btrfs+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Rspamd-Queue-Id: 754851FDD0
X-Spam-Score: -3.01
X-Rspamd-Action: no action
X-Spamd-Result: default: False [-3.01 / 50.00];
	BAYES_HAM(-3.00)[100.00%];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	MID_CONTAINS_FROM(1.00)[];
	R_MISSING_CHARSET(0.50)[];
	R_DKIM_ALLOW(-0.20)[suse.com:s=susede1];
	NEURAL_HAM_SHORT(-0.20)[-1.000];
	MIME_GOOD(-0.10)[text/plain];
	MX_GOOD(-0.01)[];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	RCVD_VIA_SMTP_AUTH(0.00)[];
	FROM_EQ_ENVFROM(0.00)[];
	ARC_NA(0.00)[];
	FROM_HAS_DN(0.00)[];
	RCPT_COUNT_ONE(0.00)[1];
	PREVIOUSLY_DELIVERED(0.00)[linux-btrfs@vger.kernel.org];
	RCVD_TLS_ALL(0.00)[];
	DBL_BLOCKED_OPENRESOLVER(0.00)[suse.com:dkim,suse.com:mid,suse.com:email];
	DKIM_SIGNED(0.00)[suse.com:s=susede1];
	FUZZY_BLOCKED(0.00)[rspamd.com];
	TO_DN_NONE(0.00)[];
	RCVD_COUNT_TWO(0.00)[2];
	MIME_TRACE(0.00)[0:+];
	DKIM_TRACE(0.00)[suse.com:+]
X-Rspamd-Server: rspamd1.dmz-prg2.suse.org
X-Spam-Flag: NO
X-Spam-Level: 

Currently if a btrfs has sector size < page size, btrfs_do_readpage()
will handle the range extent by extent, this is good for performance as
it doesn't need to re-lookup the same extent map again and again
(although __get_extent_map() already does extra cached em check).

This is totally fine and is a valid optimization, but it has an
assumption that, there is no partial uptodate range in the page.

Meanwhile there is an incoming feature, requiring btrfs to skip the full
page read if a buffered write range covers a full sector but not a full
page.

In that case, we can have a page that is partially uptodate, and the
current per-extent lookup can not handle such case.

So here we change btrfs_do_readpage() to do block-by-block read, this
simplifies the following things:

- Remove the need for @iosize variable
  Because we just use sectorsize as our increment.

- Remove @pg_offset, and calculate it inside the loop when needed
  It's just offset_in_folio().

- Use a for() loop instead of a while() loop

This will slightly reduce the read performance for
sector size < page size cases, but for the future where we can skip a
full page read for a lot of cases, it should still be worthy.

For sector size == page size, this brings no performance change.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 37 ++++++++++++-------------------------
 1 file changed, 12 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 427bfbe737f2..9518a5c5192b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -961,9 +961,7 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 	u64 block_start;
 	struct extent_map *em;
 	int ret = 0;
-	size_t pg_offset = 0;
-	size_t iosize;
-	size_t blocksize = fs_info->sectorsize;
+	const size_t blocksize = fs_info->sectorsize;
 
 	ret = set_folio_extent_mapped(folio);
 	if (ret < 0) {
@@ -974,23 +972,22 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 	if (folio->index == last_byte >> folio_shift(folio)) {
 		size_t zero_offset = offset_in_folio(folio, last_byte);
 
-		if (zero_offset) {
-			iosize = folio_size(folio) - zero_offset;
-			folio_zero_range(folio, zero_offset, iosize);
-		}
+		if (zero_offset)
+			folio_zero_range(folio, zero_offset,
+					 folio_size(folio) - zero_offset);
 	}
 	bio_ctrl->end_io_func = end_bbio_data_read;
 	begin_folio_read(fs_info, folio);
-	while (cur <= end) {
+	for (cur = start; cur <= end; cur += blocksize) {
 		enum btrfs_compression_type compress_type = BTRFS_COMPRESS_NONE;
+		unsigned long pg_offset = offset_in_folio(folio, cur);
 		bool force_bio_submit = false;
 		u64 disk_bytenr;
 
 		ASSERT(IS_ALIGNED(cur, fs_info->sectorsize));
 		if (cur >= last_byte) {
-			iosize = folio_size(folio) - pg_offset;
-			folio_zero_range(folio, pg_offset, iosize);
-			end_folio_read(folio, true, cur, iosize);
+			folio_zero_range(folio, pg_offset, end - cur + 1);
+			end_folio_read(folio, true, cur, end - cur + 1);
 			break;
 		}
 		em = __get_extent_map(inode, folio, cur, end - cur + 1,
@@ -1005,8 +1002,6 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 
 		compress_type = extent_map_compression(em);
 
-		iosize = min(extent_map_end(em) - cur, end - cur + 1);
-		iosize = ALIGN(iosize, blocksize);
 		if (compress_type != BTRFS_COMPRESS_NONE)
 			disk_bytenr = em->disk_bytenr;
 		else
@@ -1062,18 +1057,13 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 
 		/* we've found a hole, just zero and go on */
 		if (block_start == EXTENT_MAP_HOLE) {
-			folio_zero_range(folio, pg_offset, iosize);
-
-			end_folio_read(folio, true, cur, iosize);
-			cur = cur + iosize;
-			pg_offset += iosize;
+			folio_zero_range(folio, pg_offset, blocksize);
+			end_folio_read(folio, true, cur, blocksize);
 			continue;
 		}
 		/* the get_extent function already copied into the folio */
 		if (block_start == EXTENT_MAP_INLINE) {
-			end_folio_read(folio, true, cur, iosize);
-			cur = cur + iosize;
-			pg_offset += iosize;
+			end_folio_read(folio, true, cur, blocksize);
 			continue;
 		}
 
@@ -1084,12 +1074,9 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 
 		if (force_bio_submit)
 			submit_one_bio(bio_ctrl);
-		submit_extent_folio(bio_ctrl, disk_bytenr, folio, iosize,
+		submit_extent_folio(bio_ctrl, disk_bytenr, folio, blocksize,
 				    pg_offset);
-		cur = cur + iosize;
-		pg_offset += iosize;
 	}
-
 	return 0;
 }
 

From patchwork Wed Oct 30 06:34:01 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qu Wenruo <wqu@suse.com>
X-Patchwork-Id: 13855972
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 753EA1D433B
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:43 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=195.135.223.131
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730270087; cv=none;
 b=r7OiT3LnmH3JObVRGTdsvtKixccTdelVAkaTaC2t5h/fl5e6g4DgO7joxHWphglGpF2jeclbNSIpNkJLBfIEm7TKPlK0UOZPiHFfi7NVaxA2/ldstrxq2joaVwEFno75qoL4J72CbUp1uHtSYJrVCVYb4IysAz9ZyUURxyEsykM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730270087; c=relaxed/simple;
	bh=u9LFesd0GRkiXacob0dstz0j6F9kKiAljjA16rMEW2U=;
	h=From:To:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=TAeFQ7rKc/dzxhHzO5oKj6yN6RFp7fG0MMvuzTYJg5n5CNAzisebrGKmABAzmlMkO37nSa/maBGvja8PJlU2tJ2ooOLpz+MaXM5/fICX+jO8lWK1jfHBsmRmUg9LzoKjRYpXBH7zAS1XiS+oth4pZ7vkwTiRmDsr7gwJhx5UdBY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=suse.com;
 spf=pass smtp.mailfrom=suse.com;
 dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b=bcci9lMm;
 dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b=bcci9lMm; arc=none smtp.client-ip=195.135.223.131
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=suse.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=suse.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b="bcci9lMm";
	dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b="bcci9lMm"
Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by smtp-out2.suse.de (Postfix) with ESMTPS id A738B1F7D9
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:41 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1730270081; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
	 mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=bLFyAatZdlBaYEjAmLRRdTsWIn+98sVPOwcWs3H5b/E=;
	b=bcci9lMmUpFBytrSYrfvM7crntCbJDObPLHLz5idTSrVPO0s95bo8se+mHUm4I7XDqc4cL
	nE1UNurqILKX6sjm8VOfal3H24NODUKh63XnbuOqkpukrfN0pEBljgHD44T25P9roI7Ppv
	ZWEfvXb9YDYrdG3CHO3Xw3Mwc2CSqP4=
Authentication-Results: smtp-out2.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1730270081; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
	 mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=bLFyAatZdlBaYEjAmLRRdTsWIn+98sVPOwcWs3H5b/E=;
	b=bcci9lMmUpFBytrSYrfvM7crntCbJDObPLHLz5idTSrVPO0s95bo8se+mHUm4I7XDqc4cL
	nE1UNurqILKX6sjm8VOfal3H24NODUKh63XnbuOqkpukrfN0pEBljgHD44T25P9roI7Ppv
	ZWEfvXb9YDYrdG3CHO3Xw3Mwc2CSqP4=
Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id E5A3A136A5
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:40 +0000 (UTC)
Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167])
	by imap1.dmz-prg2.suse.org with ESMTPSA
	id cICuKYDTIWcFcwAAD6G6ig
	(envelope-from <wqu@suse.com>)
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:40 +0000
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 5/6] btrfs: avoid deadlock when reading a partial uptodate
 folio
Date: Wed, 30 Oct 2024 17:04:01 +1030
Message-ID: 
 <0f3566744f3d7ea7d8220f53b16d17f53f8046d0.1730269807.git.wqu@suse.com>
X-Mailer: git-send-email 2.47.0
In-Reply-To: <cover.1730269807.git.wqu@suse.com>
References: <cover.1730269807.git.wqu@suse.com>
Precedence: bulk
X-Mailing-List: linux-btrfs@vger.kernel.org
List-Id: <linux-btrfs.vger.kernel.org>
List-Subscribe: <mailto:linux-btrfs+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-btrfs+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Level: 
X-Spamd-Result: default: False [-2.80 / 50.00];
	BAYES_HAM(-3.00)[100.00%];
	MID_CONTAINS_FROM(1.00)[];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	R_MISSING_CHARSET(0.50)[];
	NEURAL_HAM_SHORT(-0.20)[-1.000];
	MIME_GOOD(-0.10)[text/plain];
	FUZZY_BLOCKED(0.00)[rspamd.com];
	RCVD_VIA_SMTP_AUTH(0.00)[];
	RCPT_COUNT_ONE(0.00)[1];
	ARC_NA(0.00)[];
	DKIM_SIGNED(0.00)[suse.com:s=susede1];
	DBL_BLOCKED_OPENRESOLVER(0.00)[suse.com:email,suse.com:mid];
	FROM_EQ_ENVFROM(0.00)[];
	FROM_HAS_DN(0.00)[];
	MIME_TRACE(0.00)[0:+];
	RCVD_COUNT_TWO(0.00)[2];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	TO_DN_NONE(0.00)[];
	PREVIOUSLY_DELIVERED(0.00)[linux-btrfs@vger.kernel.org];
	RCVD_TLS_ALL(0.00)[]
X-Spam-Score: -2.80
X-Spam-Flag: NO

[BUG]
This is for a deadlock only possible after the out-of-tree patch
"btrfs: allow buffered write to skip full page if it's sector aligned".

For now it's impossible to hit the deadlock, the reason will be
explained in [CAUSE] section.

If the sector size is smaller than page size, and we allow btrfs to
avoid reading the full page because the buffered write range is
sector aligned, we can hit a hang with generic/095 runs:

  __switch_to+0xf8/0x168
  __schedule+0x328/0x8a8
  schedule+0x54/0x140
  io_schedule+0x44/0x68
  folio_wait_bit_common+0x198/0x3f8
  __folio_lock+0x24/0x40
  extent_write_cache_pages+0x2e0/0x4c0 [btrfs]
  btrfs_writepages+0x94/0x158 [btrfs]
  do_writepages+0x74/0x190
  filemap_fdatawrite_wbc+0x88/0xc8
  __filemap_fdatawrite_range+0x6c/0xa8
  filemap_fdatawrite_range+0x1c/0x30
  btrfs_start_ordered_extent+0x264/0x2e0 [btrfs]
  btrfs_lock_and_flush_ordered_range+0x8c/0x160 [btrfs]
  __get_extent_map+0xa0/0x220 [btrfs]
  btrfs_do_readpage+0x1bc/0x5d8 [btrfs]
  btrfs_read_folio+0x50/0xa0 [btrfs]
  filemap_read_folio+0x54/0x110
  filemap_update_page+0x2e0/0x3b8
  filemap_get_pages+0x228/0x4d8
  filemap_read+0x11c/0x3b8
  btrfs_file_read_iter+0x74/0x90 [btrfs]
  new_sync_read+0xd0/0x1d0
  vfs_read+0x1a0/0x1f0

There is also the minimal fio reproducer extracted from that test case
to reproduce the deadlock:

  [global]
  bs=8k
  iodepth=1
  randrepeat=1
  size=256k
  directory=$mnt
  numjobs=1
  [job1]
  ioengine=sync
  bs=512
  direct=1
  rw=randread
  filename=file1
  [job2]
  ioengine=libaio
  rw=randwrite
  direct=1
  filename=file1
  [job3]
  ioengine=posixaio
  rw=randwrite
  filename=file1

[CAUSE]
The above call trace shows that, during the folio read a writeback is
triggered on the same folio.
And since during btrfs_do_readpage(), the folio is locked, the writeback
will never be able to lock the folio, thus it is waiting on itself thus
causing the deadlock.

The root cause is a little complex, the system is 64K page sized, with
4K sector size:

1) The folio has its range [48K, 64K) marked dirty by buffered write

   0          16K         32K          48K         64K
   |                                   |///////////|
                                             \- sector Uptodate|Dirty

2) Writeback finished for [48K, 64K), but ordered extent not yet finished

   0          16K         32K          48K         64K
   |                                   |///////////|
                                             \- sector Uptodate
					        extent map PINNED
						OE still here

3) The folio is released from page cache
   This can be triggered by direct IO through the following call chain:

   iomap_dio_rw()
   \- kiocb_invalidate_pages()
    \- filemap_invalidate_pages()
     \- invalidate_inode_pages2_range()
      \- invalidate_complete_folio2()
       \- filemap_release_folio()
        \- btrfs_release_folio()
	 \- __btrfs_release_folio()
	  \- try_release_extent_mapping()

   Since there is no extent state with EXTENT_LOCKED flag in the folio
   range, btrfs allows the folio to be released.
   Now there is no folio->private to record which block is uptodate.
   But extent map and OE are still here.

   0          16K         32K          48K         64K
   |                                   |///////////|
                                             \- extent map PINNED
						OE still here

4) Buffered write dirtied range [0, 16K)
   Since it's sector aligned, btrfs didn't read the full folio from disk.

   0          16K         32K          48K         64K
   |//////////|                        |///////////|
       \- sector Uptodate|Dirty              \- extent map PINNED
						OE still here

5) Read on the folio is triggered
   For the range [0, 16K), since it's already uptodate, btrfs skips this
   range.
   For the range [16K, 48K), btrfs submit the read from disk.

   The problem comes to the range [48K, 64K), the following call chain
   happens:

   btrfs_do_readpage()
   \- __get_extent_map()
    \- btrfs_lock_and_flush_ordered_range()
     \- btrfs_start_ordered_extent()
      \- filemap_fdatawrite_range()
       \- btrfs_writepages()
        \- extent_write_cache_pages()
	 \- folio_lock()

   Since the folio indeed has dirty sectors in range [0, 16K), the range
   will be written back.

   But the folio is already locked by the folio read, the writeback
   will never be able to lock the folio, thus lead to the deadlock.

This sequence can only happen if all the following conditions are met:

- The sector size is smaller than page size.
  Or we won't have mixed dirty blocks in the same folio we're reading.

- We allow the buffered write to skip the folio read if it's sector
  aligned.
  This is done by the incoming patch
  "btrfs: allow buffered write to skip full page if it's sector aligned".

  The ultimate goal of that patch is to reduce unnecessary read for sector
  size < page size cases, and to pass generic/563.

  Otherwise the folio will be read from the disk during buffered write,
  before marking it dirty.
  Thus will not trigger the deadlock.

[FIX]
Break the step 5) of the above case.

By passing an optional @locked_folio into btrfs_start_ordered_extent()
and btrfs_lock_and_flush_ordered_range().
If we got such locked folio skip the writeback for ranges of that folio.

Here we also do extra asserts to make sure the target
range is already not dirty, or the ordered extent we wait will never be
able to finish, since part of the ordered extent is never submitted.

So far only the call site inside __get_extent_map() is passing the new
parameter.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/defrag.c       |  2 +-
 fs/btrfs/direct-io.c    |  2 +-
 fs/btrfs/extent_io.c    |  3 +-
 fs/btrfs/file.c         |  8 ++---
 fs/btrfs/inode.c        |  6 ++--
 fs/btrfs/ordered-data.c | 67 ++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/ordered-data.h |  8 +++--
 7 files changed, 75 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 1644470b9df7..2467990d6ac7 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -902,7 +902,7 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
 			break;
 
 		folio_unlock(folio);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, NULL);
 		btrfs_put_ordered_extent(ordered);
 		folio_lock(folio);
 		/*
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index a7c3e221378d..2fb02aa19be0 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -103,7 +103,7 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 			 */
 			if (writing ||
 			    test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags))
-				btrfs_start_ordered_extent(ordered);
+				btrfs_start_ordered_extent(ordered, NULL);
 			else
 				ret = nowait ? -EAGAIN : -ENOTBLK;
 			btrfs_put_ordered_extent(ordered);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9518a5c5192b..301604c3cf13 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -930,7 +930,8 @@ static struct extent_map *__get_extent_map(struct inode *inode,
 		*em_cached = NULL;
 	}
 
-	btrfs_lock_and_flush_ordered_range(BTRFS_I(inode), start, start + len - 1, &cached_state);
+	btrfs_lock_and_flush_ordered_range(BTRFS_I(inode), folio, start,
+					   start + len - 1, &cached_state);
 	em = btrfs_get_extent(BTRFS_I(inode), folio, start, len);
 	if (!IS_ERR(em)) {
 		BUG_ON(*em_cached);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index fbb753300071..e2c6165eba21 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -979,7 +979,7 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct folio *folio,
 				      cached_state);
 			folio_unlock(folio);
 			folio_put(folio);
-			btrfs_start_ordered_extent(ordered);
+			btrfs_start_ordered_extent(ordered, NULL);
 			btrfs_put_ordered_extent(ordered);
 			return -EAGAIN;
 		}
@@ -1047,8 +1047,8 @@ int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
 			return -EAGAIN;
 		}
 	} else {
-		btrfs_lock_and_flush_ordered_range(inode, lockstart, lockend,
-						   &cached_state);
+		btrfs_lock_and_flush_ordered_range(inode, NULL, lockstart,
+						   lockend, &cached_state);
 	}
 	ret = can_nocow_extent(&inode->vfs_inode, lockstart, &num_bytes,
 			       NULL, nowait, false);
@@ -1884,7 +1884,7 @@ static vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 		unlock_extent(io_tree, page_start, page_end, &cached_state);
 		folio_unlock(folio);
 		up_read(&BTRFS_I(inode)->i_mmap_lock);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, NULL);
 		btrfs_put_ordered_extent(ordered);
 		goto again;
 	}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2a39699ebbc9..6ec223f9987f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2771,7 +2771,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 		unlock_extent(&inode->io_tree, page_start, page_end,
 			      &cached_state);
 		folio_unlock(folio);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, NULL);
 		btrfs_put_ordered_extent(ordered);
 		goto again;
 	}
@@ -4815,7 +4815,7 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 		unlock_extent(io_tree, block_start, block_end, &cached_state);
 		folio_unlock(folio);
 		folio_put(folio);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, NULL);
 		btrfs_put_ordered_extent(ordered);
 		goto again;
 	}
@@ -4950,7 +4950,7 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
 	if (size <= hole_start)
 		return 0;
 
-	btrfs_lock_and_flush_ordered_range(inode, hole_start, block_end - 1,
+	btrfs_lock_and_flush_ordered_range(inode, NULL, hole_start, block_end - 1,
 					   &cached_state);
 	cur_offset = hole_start;
 	while (1) {
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 2104d60c2161..35b2c441ee6f 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -729,7 +729,7 @@ static void btrfs_run_ordered_extent_work(struct btrfs_work *work)
 	struct btrfs_ordered_extent *ordered;
 
 	ordered = container_of(work, struct btrfs_ordered_extent, flush_work);
-	btrfs_start_ordered_extent(ordered);
+	btrfs_start_ordered_extent(ordered, NULL);
 	complete(&ordered->completion);
 }
 
@@ -845,12 +845,14 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
  * Wait on page writeback for all the pages in the extent and the IO completion
  * code to insert metadata into the btree corresponding to the extent.
  */
-void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry)
+void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry,
+				struct folio *locked_folio)
 {
 	u64 start = entry->file_offset;
 	u64 end = start + entry->num_bytes - 1;
 	struct btrfs_inode *inode = entry->inode;
 	bool freespace_inode;
+	bool skip_writeback = false;
 
 	trace_btrfs_ordered_extent_start(inode, entry);
 
@@ -860,13 +862,59 @@ void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry)
 	 */
 	freespace_inode = btrfs_is_free_space_inode(inode);
 
+	/*
+	 * The locked folio covers the ordered extent range and the full
+	 * folio is dirty.
+	 * We can not trigger writeback on it, as we will try to lock
+	 * the same folio we already hold.
+	 *
+	 * This only happens for sector size < page size case, and even
+	 * that happens we're still safe because this can only happen
+	 * when the range is submitted and finished, but OE is not yet
+	 * finished.
+	 */
+	if (locked_folio) {
+		const u64 skip_start = max_t(u64, folio_pos(locked_folio), start);
+		const u64 skip_end = min_t(u64,
+				folio_pos(locked_folio) + folio_size(locked_folio),
+				end + 1) - 1;
+
+		ASSERT(folio_test_locked(locked_folio));
+
+		/* The folio should intersect with the OE range. */
+		ASSERT(folio_pos(locked_folio) <= end ||
+		       folio_pos(locked_folio) + folio_size(locked_folio) > start);
+
+		/*
+		 * The range must not be dirty.
+		 *
+		 * Since we will skip writeback for the folio, if the involved range
+		 * is dirty the range will never be submitted, thus the ordered
+		 * extent we are going to wait will never finish, cause another deadlock.
+		 */
+		btrfs_folio_assert_not_dirty(inode->root->fs_info, locked_folio,
+					     skip_start, skip_end  + 1 - skip_start);
+		skip_writeback = true;
+	}
 	/*
 	 * pages in the range can be dirty, clean or writeback.  We
 	 * start IO on any dirty ones so the wait doesn't stall waiting
 	 * for the flusher thread to find them
 	 */
-	if (!test_bit(BTRFS_ORDERED_DIRECT, &entry->flags))
-		filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start, end);
+	if (!test_bit(BTRFS_ORDERED_DIRECT, &entry->flags)) {
+		if (!skip_writeback) {
+			filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start, end);
+		} else {
+			/* Need to skip the locked folio range. */
+			if (start < folio_pos(locked_folio))
+				filemap_fdatawrite_range(inode->vfs_inode.i_mapping,
+						start, folio_pos(locked_folio) - 1);
+			if (end + 1 > folio_pos(locked_folio) + folio_size(locked_folio))
+				filemap_fdatawrite_range(inode->vfs_inode.i_mapping,
+						folio_pos(locked_folio) + folio_size(locked_folio),
+						end);
+		}
+	}
 
 	if (!freespace_inode)
 		btrfs_might_wait_for_event(inode->root->fs_info, btrfs_ordered_extent);
@@ -921,7 +969,7 @@ int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len)
 			btrfs_put_ordered_extent(ordered);
 			break;
 		}
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, NULL);
 		end = ordered->file_offset;
 		/*
 		 * If the ordered extent had an error save the error but don't
@@ -1141,6 +1189,8 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
  * @inode:        Inode whose ordered tree is to be searched
  * @start:        Beginning of range to flush
  * @end:          Last byte of range to lock
+ * @locked_folio: If passed, will not start writeback of this folio, to avoid
+ *		  locking the same folio already locked by the caller.
  * @cached_state: If passed, will return the extent state responsible for the
  *                locked range. It's the caller's responsibility to free the
  *                cached state.
@@ -1148,8 +1198,9 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
  * Always return with the given range locked, ensuring after it's called no
  * order extent can be pending.
  */
-void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
-					u64 end,
+void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode,
+					struct folio *locked_folio,
+					u64 start, u64 end,
 					struct extent_state **cached_state)
 {
 	struct btrfs_ordered_extent *ordered;
@@ -1174,7 +1225,7 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 			break;
 		}
 		unlock_extent(&inode->io_tree, start, end, cachedp);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, locked_folio);
 		btrfs_put_ordered_extent(ordered);
 	}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 4e152736d06c..a4bb24572c73 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -191,7 +191,8 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 			   struct btrfs_ordered_sum *sum);
 struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
 							 u64 file_offset);
-void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry);
+void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry,
+				struct folio *locked_folio);
 int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len);
 struct btrfs_ordered_extent *
 btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset);
@@ -207,8 +208,9 @@ u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr,
 			       const struct btrfs_block_group *bg);
 void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
 			      const struct btrfs_block_group *bg);
-void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
-					u64 end,
+void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode,
+					struct folio *locked_folio,
+					u64 start, u64 end,
 					struct extent_state **cached_state);
 bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
 				  struct extent_state **cached_state);

From patchwork Wed Oct 30 06:34:02 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qu Wenruo <wqu@suse.com>
X-Patchwork-Id: 13855971
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B033E1D5ABA
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:44 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=195.135.223.130
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1730270087; cv=none;
 b=ejSnKu4eg7ROIdnHvf8LmyvGPpuoyfPN855ig1OcmTjJyrSD59GR9arOl+nyC9lV4sgYA3xuMtnGU0KSrpiFm3SnONZ90uOdDRMaVjgObtYXpXy3cBi4GhT2QDbYt1XYeaSwvbniDN3EHIeqi/8DxCO26ili9WvhBVhpMsrxWrA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1730270087; c=relaxed/simple;
	bh=yKqZYZ5UIjIXGMen/bjiDHambgE1Fom9MWcHVTe20eg=;
	h=From:To:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=hBDgCR8AQc02UO0+txy0dyPm3yUUpVTACR9TxUHs0Bk8J4ykpnPznKW3Cn/PYKp0tVmyhsv4sQWvOMrtd/8ruSZ1WQLJ/rohdpn24K6VKmOXTSOyB9D5AXWkkDjCUHB6EXgkePCpj1Yebev9Rvg+AG7UUHeAdA3aXe3t4mJwUm8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=suse.com;
 spf=pass smtp.mailfrom=suse.com;
 dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b=ck3LCHaB;
 dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b=ck3LCHaB; arc=none smtp.client-ip=195.135.223.130
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=suse.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=suse.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b="ck3LCHaB";
	dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com
 header.b="ck3LCHaB"
Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org
 [IPv6:2a07:de40:b281:104:10:150:64:97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by smtp-out1.suse.de (Postfix) with ESMTPS id DBC8521DC5
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:42 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1730270082; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
	 mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=DJTeygM7On7rSp3vXDJVZ6lcM2QSpdbOlSGvEHWoO6o=;
	b=ck3LCHaBH8ifgi1aI/DbVcQXog6owRNFif3V55J95Pm8NE1dEhPnlrq6izz5PWS2mtRq3A
	RAp/i1RlnjL1iktK9OimXaTv6U+ZNEMUHzawRzJKXsIJqINfPe19JPSiWNnBEcLn9P+BOV
	vtqcVJA+ivencM7nTr4U5MEIg/OBcGs=
Authentication-Results: smtp-out1.suse.de;
	dkim=pass header.d=suse.com header.s=susede1 header.b=ck3LCHaB
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1730270082; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
	 mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=DJTeygM7On7rSp3vXDJVZ6lcM2QSpdbOlSGvEHWoO6o=;
	b=ck3LCHaBH8ifgi1aI/DbVcQXog6owRNFif3V55J95Pm8NE1dEhPnlrq6izz5PWS2mtRq3A
	RAp/i1RlnjL1iktK9OimXaTv6U+ZNEMUHzawRzJKXsIJqINfPe19JPSiWNnBEcLn9P+BOV
	vtqcVJA+ivencM7nTr4U5MEIg/OBcGs=
Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 25025136A5
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:41 +0000 (UTC)
Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167])
	by imap1.dmz-prg2.suse.org with ESMTPSA
	id 8FRINoHTIWcFcwAAD6G6ig
	(envelope-from <wqu@suse.com>)
	for <linux-btrfs@vger.kernel.org>; Wed, 30 Oct 2024 06:34:41 +0000
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 6/6] btrfs: allow buffered write to skip full page if it's
 sector aligned
Date: Wed, 30 Oct 2024 17:04:02 +1030
Message-ID: 
 <a6c7bf333f651e7352ede4b2d3a5c60fd9196de7.1730269807.git.wqu@suse.com>
X-Mailer: git-send-email 2.47.0
In-Reply-To: <cover.1730269807.git.wqu@suse.com>
References: <cover.1730269807.git.wqu@suse.com>
Precedence: bulk
X-Mailing-List: linux-btrfs@vger.kernel.org
List-Id: <linux-btrfs.vger.kernel.org>
List-Subscribe: <mailto:linux-btrfs+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-btrfs+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Rspamd-Queue-Id: DBC8521DC5
X-Spam-Score: -3.01
X-Rspamd-Action: no action
X-Spamd-Result: default: False [-3.01 / 50.00];
	BAYES_HAM(-3.00)[100.00%];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	MID_CONTAINS_FROM(1.00)[];
	R_MISSING_CHARSET(0.50)[];
	R_DKIM_ALLOW(-0.20)[suse.com:s=susede1];
	NEURAL_HAM_SHORT(-0.20)[-1.000];
	MIME_GOOD(-0.10)[text/plain];
	MX_GOOD(-0.01)[];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	RCVD_VIA_SMTP_AUTH(0.00)[];
	FROM_EQ_ENVFROM(0.00)[];
	ARC_NA(0.00)[];
	FROM_HAS_DN(0.00)[];
	RCPT_COUNT_ONE(0.00)[1];
	PREVIOUSLY_DELIVERED(0.00)[linux-btrfs@vger.kernel.org];
	RCVD_TLS_ALL(0.00)[];
	DBL_BLOCKED_OPENRESOLVER(0.00)[suse.com:dkim,suse.com:mid,suse.com:email];
	DKIM_SIGNED(0.00)[suse.com:s=susede1];
	FUZZY_BLOCKED(0.00)[rspamd.com];
	TO_DN_NONE(0.00)[];
	RCVD_COUNT_TWO(0.00)[2];
	MIME_TRACE(0.00)[0:+];
	DKIM_TRACE(0.00)[suse.com:+]
X-Rspamd-Server: rspamd1.dmz-prg2.suse.org
X-Spam-Flag: NO
X-Spam-Level: 

[BUG]
Since the support of sector size < page size for btrfs, test case
generic/563 fails with 4K sector size and 64K page size:

    --- tests/generic/563.out	2024-04-25 18:13:45.178550333 +0930
    +++ /home/adam/xfstests-dev/results//generic/563.out.bad	2024-09-30 09:09:16.155312379 +0930
    @@ -3,7 +3,8 @@
     read is in range
     write is in range
     write -> read/write
    -read is in range
    +read has value of 8388608
    +read is NOT in range -33792 .. 33792
     write is in range
    ...

[CAUSE]
The test case creates a 8MiB file, then buffered write into the 8MiB
using 4K block size, to overwrite the whole file.

On 4K page sized systems, since the write range covers the full sector and
page, btrfs will no bother reading the page, just like what XFS and EXT4
do.

But 64K page sized systems, although the write is sector aligned, it's
not page aligned, thus btrfs still goes the full page alignment check,
and read the full page out.

This causes extra data read, and fail the test case.

[FIX]
To skip the full page read, we need to do the following modification:

- Do not trigger full page read as long as the buffered write is sector
  aligned
  This is pretty simple by modifying the check inside
  prepare_uptodate_page().

- Skip already uptodate sectors during full page read
  Or we can lead to the following data corruption:

  0       32K        64K
  |///////|          |

  Where the file range [0, 32K) is dirtied by buffered write, the
  remaining range [32K, 64K) is not.

  When reading the full page, since [0,32K) is only dirtied but not
  written back, there is no data extent map for it, but a hole covering
  [0, 64k).

  If we continue reading the full page range [0, 64K), the dirtied range
  will be filled with 0 (since there is only a hole covering the whole
  range).
  This causes the dirtied range to get lost.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 4 ++++
 fs/btrfs/file.c      | 5 +++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 301604c3cf13..255d353058e8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -991,6 +991,10 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 			end_folio_read(folio, true, cur, end - cur + 1);
 			break;
 		}
+		if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
+			end_folio_read(folio, true, cur, blocksize);
+			continue;
+		}
 		em = __get_extent_map(inode, folio, cur, end - cur + 1,
 				      em_cached);
 		if (IS_ERR(em)) {
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e2c6165eba21..6f485b8bda66 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -841,14 +841,15 @@ static int prepare_uptodate_folio(struct inode *inode, struct folio *folio, u64
 {
 	u64 clamp_start = max_t(u64, pos, folio_pos(folio));
 	u64 clamp_end = min_t(u64, pos + len, folio_pos(folio) + folio_size(folio));
+	const u32 sectorsize = inode_to_fs_info(inode)->sectorsize;
 	int ret = 0;
 
 	if (folio_test_uptodate(folio))
 		return 0;
 
 	if (!force_uptodate &&
-	    IS_ALIGNED(clamp_start, PAGE_SIZE) &&
-	    IS_ALIGNED(clamp_end, PAGE_SIZE))
+	    IS_ALIGNED(clamp_start, sectorsize) &&
+	    IS_ALIGNED(clamp_end, sectorsize))
 		return 0;
 
 	ret = btrfs_read_folio(NULL, folio);