From patchwork Tue Jun 27 22:44:05 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 13295068
Return-Path: <linux-xfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E3925C001B0
	for <linux-xfs@archiver.kernel.org>; Tue, 27 Jun 2023 22:44:26 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229895AbjF0WoZ (ORCPT <rfc822;linux-xfs@archiver.kernel.org>);
        Tue, 27 Jun 2023 18:44:25 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60312 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229501AbjF0WoX (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Tue, 27 Jun 2023 18:44:23 -0400
Received: from mail-pl1-x62a.google.com (mail-pl1-x62a.google.com
 [IPv6:2607:f8b0:4864:20::62a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 64B8E272D
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:19 -0700 (PDT)
Received: by mail-pl1-x62a.google.com with SMTP id
 d9443c01a7336-1b51488ad67so28100925ad.3
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905859;
 x=1690497859;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:from:to:cc:subject:date:message-id
         :reply-to;
        bh=8Pt9hTh40OmTegOekLcRzx5llx2d/nej7rlH4ks8rf4=;
        b=mom7otQF0a2HN08fs9kdMIgmJR1YOyc7Tea2SHuWXX277jsbkhuVfpETqFNatTyNEo
         AQSI2f8ASRNhYHzvceu4kph06vQ9//cIGRUPu3xNWbSnbyPHuU7XAtT2PYethdfj01EV
         N2Za+WmccMR9ygztBScc17aLFOZ2Q9TGESZ5+lwKcwXqxywleVBV3R8cFpImntM0EH8/
         SwHB0mpiChMVeHii+VLrU4xW8vdROZFJ8Y5CAzNXbai9ren1iIORRIiUreLRTT7mUAJq
         9Njj2k3vYMgVmEdQuDvjoJTHUi6Krhis4xm+DbGLlV2uKhFf6Ab9AaVwlSdWi3+NAgOR
         xMFA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1687905859; x=1690497859;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=8Pt9hTh40OmTegOekLcRzx5llx2d/nej7rlH4ks8rf4=;
        b=WavzeHeZwkwzCCNhDLpiBSbAiJXtAxU4uS83sAHukj7pEdvLmvhMpgxbmH/IUJveY7
         oIy3WS5HQCnniTZup4nF/LhkDzSMp9516Y1kUhKtS7iSIGtgBNi/cpt2MT0r6sNrwzmw
         PeVh4idqBgoaQfZecvLfH37CMsQ4iUF+NmxDdc+cwBgeXaL8tY4D5n3qZc5dhjHyjRf4
         pGiYDWTjkIjcVLpyLRSjlWSECAHLtjSJnQDBfRToxLhjY4duJDby+7xtAieSNS0yi64N
         b5ZSWvpDH9s0kjIksqK2OujAcCabRDRVbultip+V/kIWsWV5zGXmlbT5U3YXTngs2R9R
         UIXQ==
X-Gm-Message-State: AC+VfDzLcd878wxDUen8WNnYEAiNF24UdfoRp9QZqx+qPSCBYk9Vg7W8
        fSWKM+eyze7KwZ3RlFEHb+JZyOwvwnVlp2o5KYg=
X-Google-Smtp-Source: 
 ACHHUZ6haVb/ehzeHZFWdzfjctyqRHDgzI7KB0p32JOJcvehJ+yGPjuuQe+Ov6hZmK7jPUD10lmF7A==
X-Received: by 2002:a17:903:2343:b0:1b5:54cc:fcb7 with SMTP id
 c3-20020a170903234300b001b554ccfcb7mr9955372plh.34.1687905858863;
        Tue, 27 Jun 2023 15:44:18 -0700 (PDT)
Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au.
 [49.186.94.37])
        by smtp.gmail.com with ESMTPSA id
 jn16-20020a170903051000b001b53472353csm6403955plb.211.2023.06.27.15.44.17
        for <linux-xfs@vger.kernel.org>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 27 Jun 2023 15:44:17 -0700 (PDT)
Received: from [192.168.253.23] (helo=devoid.disaster.area)
        by dread.disaster.area with esmtp (Exim 4.96)
        (envelope-from <dave@fromorbit.com>)
        id 1qEHPz-00GzwS-0z
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:15 +1000
Received: from dave by devoid.disaster.area with local (Exim 4.96)
        (envelope-from <dave@devoid.disaster.area>)
        id 1qEHPy-009Zm7-37
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:14 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 1/8] xfs: don't reverse order of items in bulk AIL insertion
Date: Wed, 28 Jun 2023 08:44:05 +1000
Message-Id: <20230627224412.2242198-2-david@fromorbit.com>
X-Mailer: git-send-email 2.40.1
In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com>
References: <20230627224412.2242198-1-david@fromorbit.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

From: Dave Chinner <dchinner@redhat.com>

XFS has strict metadata ordering requirements. One of the things it
does is maintain the commit order of items from transaction commit
through the CIL and into the AIL. That is, if a transaction logs
item A before item B in a modification, then they will be inserted
into the CIL in the order {A, B}. These items are then written into
the iclog during checkpointing in the order {A, B}. When the
checkpoint commits, they are supposed to be inserted into the AIL in
the order {A, B}, and when they are pushed from the AIL, they are
pushed in the order {A, B}.

If we crash, log recovery then replays the two items from the
checkpoint in the order {A, B}, resulting in the objects the items
apply to being queued for writeback at the end of the checkpoint
in the order {A, B}. This means recovery behaves the same way as the
runtime code.

In places, we have subtle dependencies on this ordering being
maintained. One of this place is performing intent recovery from the
log. It assumes that recovering an intent will result in a
non-intent object being the first thing that is modified in the
recovery transaction, and so when the transaction commits and the
journal flushes, the first object inserted into the AIL beyond the
intent recovery range will be a non-intent item.  It uses the
transistion from intent items to non-intent items to stop the
recovery pass.

A recent log recovery issue indicated that an intent was appearing
as the first item in the AIL beyond the recovery range, hence
breaking the end of recovery detection that exists.

Tracing indicated insertion of the items into the AIL was apparently
occurring in the right order (the intent was last in the commit item
list), but the intent was appearing first in the AIL. IOWs, the
order of items in the AIL was {D,C,B,A}, not {A,B,C,D}, and bulk
insertion was reversing the order of the items in the batch of items
being inserted.

Lucky for us, all the items fed to bulk insertion have the same LSN,
so the reversal of order does not affect the log head/tail tracking
that is based on the contents of the AIL. It only impacts on code
that has implicit, subtle dependencies on object order, and AFAICT
only the intent recovery loop is impacted by it.

Make sure bulk AIL insertion does not reorder items incorrectly.

Fixes: 0e57f6a36f9b ("xfs: bulk AIL insertion during transaction commit")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_trans_ail.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 7d4109af193e..1098452e7f95 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -823,7 +823,7 @@ xfs_trans_ail_update_bulk(
 			trace_xfs_ail_insert(lip, 0, lsn);
 		}
 		lip->li_lsn = lsn;
-		list_add(&lip->li_ail, &tmp);
+		list_add_tail(&lip->li_ail, &tmp);
 	}
 
 	if (!list_empty(&tmp))

From patchwork Tue Jun 27 22:44:06 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 13295074
Return-Path: <linux-xfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9496CEB64DC
	for <linux-xfs@archiver.kernel.org>; Tue, 27 Jun 2023 22:44:31 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230004AbjF0Wo3 (ORCPT <rfc822;linux-xfs@archiver.kernel.org>);
        Tue, 27 Jun 2023 18:44:29 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60370 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229501AbjF0Wo2 (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Tue, 27 Jun 2023 18:44:28 -0400
Received: from mail-oi1-x231.google.com (mail-oi1-x231.google.com
 [IPv6:2607:f8b0:4864:20::231])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7B51F2945
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:22 -0700 (PDT)
Received: by mail-oi1-x231.google.com with SMTP id
 5614622812f47-3a0423ea749so4126043b6e.0
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905861;
 x=1690497861;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:from:to:cc:subject:date:message-id
         :reply-to;
        bh=HQFgV6ssEPI/mbuynXLID8AB1DFeefdovCY9d+I6BuQ=;
        b=QClWJY1CdrPrdWcb5pjZWkFt47ZfzXO6KAVf3efpVcz0hTmTmEB4kFALdPAEMjRal8
         lPkimnNmD/eKZK3jGmVuOoafB+vJCvEUeKpGC0887DfIPHAi95tJb5wYB/TdIoyZe0lY
         f2/bXcuCPKsbfSawx89zF8xUKNg1dBq8lnZ/UmfMbe6isWyHPkXLkhT4LA4i2NGm9QYI
         /ec7EfJpPqZ0N4nV8jaXhXSS8nT6aYq1Rw6lUyLu3Xgg4c2xRxMb2FAKjTelr7RdDmfO
         J0nqkrTub/7/6bfStYdDPZ5pxthkC6XA1pZdqh1JNYgGXP/oFiiVvaoPDPd4sicpW2Kd
         9Ygw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1687905861; x=1690497861;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=HQFgV6ssEPI/mbuynXLID8AB1DFeefdovCY9d+I6BuQ=;
        b=Tk8X5XAj853vTgLtCZjYzDm8yHH3u2o/MjqzJVRECPeKQ/DCNPw+nuMsnZOL6j4lDZ
         75R9MbVsFfwjtQKmsc0qtts2Fqj5EMSL6QHLSL42wQmrpOFC97+Qxzu3c0GTYKNpTDmB
         CI71jFaw596OkJAbQ+MtVxnyIvoUcjLg+GRnVeIvgPbmmDwILzFVu7YWC6SfJLsFyJLG
         rJy7sESyIYcTrLbYkHYfv4dViSQFEQB3Gk2qm250XbLO0RURRjxNQEl3jyYGWFLqh/pg
         WGHj+ErOKGbZ1wZGFur0l4oWo3ueCKb2lWNgYg/AlyIVirm0BdyS8ug81jix1pevDiJ8
         YcZA==
X-Gm-Message-State: AC+VfDxyb1BejIYWFxSeqRd8lp+sAc0J7u5Iu038um+xiWS6yHutu2on
        ApnHE/QrNBKj/Ou2S1M/Dbx9r5oVxfB9ziCUk7w=
X-Google-Smtp-Source: 
 ACHHUZ6d4ehbGkUOlK4xHgqtEh9YG6Z1t9rHGr6CbaVatR/Woe8u/fii0I+NxjmB1tZ1G2cGOpQzeg==
X-Received: by 2002:a05:6808:1983:b0:3a1:e222:97db with SMTP id
 bj3-20020a056808198300b003a1e22297dbmr6994023oib.30.1687905861650;
        Tue, 27 Jun 2023 15:44:21 -0700 (PDT)
Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au.
 [49.186.94.37])
        by smtp.gmail.com with ESMTPSA id
 29-20020a17090a199d00b0026334010a4fsm520032pji.35.2023.06.27.15.44.18
        for <linux-xfs@vger.kernel.org>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 27 Jun 2023 15:44:20 -0700 (PDT)
Received: from [192.168.253.23] (helo=devoid.disaster.area)
        by dread.disaster.area with esmtp (Exim 4.96)
        (envelope-from <dave@fromorbit.com>)
        id 1qEHPz-00GzwV-1C
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:15 +1000
Received: from dave by devoid.disaster.area with local (Exim 4.96)
        (envelope-from <dave@devoid.disaster.area>)
        id 1qEHPz-009ZmB-01
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:15 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 2/8] xfs: use deferred frees for btree block freeing
Date: Wed, 28 Jun 2023 08:44:06 +1000
Message-Id: <20230627224412.2242198-3-david@fromorbit.com>
X-Mailer: git-send-email 2.40.1
In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com>
References: <20230627224412.2242198-1-david@fromorbit.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

From: Dave Chinner <dchinner@redhat.com>

Btrees that aren't freespace management trees use the normal extent
allocation and freeing routines for their blocks. Hence when a btree
block is freed, a direct call to xfs_free_extent() is made and the
extent is immediately freed. This puts the entire free space
management btrees under this path, so we are stacking btrees on
btrees in the call stack. The inobt, finobt and refcount btrees
all do this.

However, the bmap btree does not do this - it calls
xfs_free_extent_later() to defer the extent free operation via an
XEFI and hence it gets processed in deferred operation processing
during the commit of the primary transaction (i.e. via intent
chaining).

We need to change xfs_free_extent() to behave in a non-blocking
manner so that we can avoid deadlocks with busy extents near ENOSPC
in transactions that free multiple extents. Inserting or removing a
record from a btree can cause a multi-level tree merge operation and
that will free multiple blocks from the btree in a single
transaction. i.e. we can call xfs_free_extent() multiple times, and
hence the btree manipulation transaction is vulnerable to this busy
extent deadlock vector.

To fix this, convert all the remaining callers of xfs_free_extent()
to use xfs_free_extent_later() to queue XEFIs and hence defer
processing of the extent frees to a context that can be safely
restarted if a deadlock condition is detected.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_ag.c             | 2 +-
 fs/xfs/libxfs/xfs_alloc.c          | 4 ++++
 fs/xfs/libxfs/xfs_alloc.h          | 8 +++++---
 fs/xfs/libxfs/xfs_bmap.c           | 8 +++++---
 fs/xfs/libxfs/xfs_bmap_btree.c     | 3 ++-
 fs/xfs/libxfs/xfs_ialloc.c         | 8 ++++----
 fs/xfs/libxfs/xfs_ialloc_btree.c   | 3 +--
 fs/xfs/libxfs/xfs_refcount.c       | 9 ++++++---
 fs/xfs/libxfs/xfs_refcount_btree.c | 8 +-------
 fs/xfs/xfs_extfree_item.c          | 3 ++-
 fs/xfs/xfs_reflink.c               | 3 ++-
 11 files changed, 33 insertions(+), 26 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index ee84835ebc66..e9cc481b4ddf 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -985,7 +985,7 @@ xfs_ag_shrink_space(
 			goto resv_err;
 
 		err2 = __xfs_free_extent_later(*tpp, args.fsbno, delta, NULL,
-				true);
+				XFS_AG_RESV_NONE, true);
 		if (err2)
 			goto resv_err;
 
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index c20fe99405d8..cc3f7b905ea1 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2449,6 +2449,7 @@ xfs_defer_agfl_block(
 	xefi->xefi_startblock = XFS_AGB_TO_FSB(mp, agno, agbno);
 	xefi->xefi_blockcount = 1;
 	xefi->xefi_owner = oinfo->oi_owner;
+	xefi->xefi_type = XFS_AG_RESV_AGFL;
 
 	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbno(mp, xefi->xefi_startblock)))
 		return -EFSCORRUPTED;
@@ -2470,6 +2471,7 @@ __xfs_free_extent_later(
 	xfs_fsblock_t			bno,
 	xfs_filblks_t			len,
 	const struct xfs_owner_info	*oinfo,
+	enum xfs_ag_resv_type		type,
 	bool				skip_discard)
 {
 	struct xfs_extent_free_item	*xefi;
@@ -2490,6 +2492,7 @@ __xfs_free_extent_later(
 	ASSERT(agbno + len <= mp->m_sb.sb_agblocks);
 #endif
 	ASSERT(xfs_extfree_item_cache != NULL);
+	ASSERT(type != XFS_AG_RESV_AGFL);
 
 	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, bno, len)))
 		return -EFSCORRUPTED;
@@ -2498,6 +2501,7 @@ __xfs_free_extent_later(
 			       GFP_KERNEL | __GFP_NOFAIL);
 	xefi->xefi_startblock = bno;
 	xefi->xefi_blockcount = (xfs_extlen_t)len;
+	xefi->xefi_type = type;
 	if (skip_discard)
 		xefi->xefi_flags |= XFS_EFI_SKIP_DISCARD;
 	if (oinfo) {
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 85ac470be0da..121faf1e11ad 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -232,7 +232,7 @@ xfs_buf_to_agfl_bno(
 
 int __xfs_free_extent_later(struct xfs_trans *tp, xfs_fsblock_t bno,
 		xfs_filblks_t len, const struct xfs_owner_info *oinfo,
-		bool skip_discard);
+		enum xfs_ag_resv_type type, bool skip_discard);
 
 /*
  * List of extents to be free "later".
@@ -245,6 +245,7 @@ struct xfs_extent_free_item {
 	xfs_extlen_t		xefi_blockcount;/* number of blocks in extent */
 	struct xfs_perag	*xefi_pag;
 	unsigned int		xefi_flags;
+	enum xfs_ag_resv_type	xefi_type;
 };
 
 void xfs_extent_free_get_group(struct xfs_mount *mp,
@@ -259,9 +260,10 @@ xfs_free_extent_later(
 	struct xfs_trans		*tp,
 	xfs_fsblock_t			bno,
 	xfs_filblks_t			len,
-	const struct xfs_owner_info	*oinfo)
+	const struct xfs_owner_info	*oinfo,
+	enum xfs_ag_resv_type		type)
 {
-	return __xfs_free_extent_later(tp, bno, len, oinfo, false);
+	return __xfs_free_extent_later(tp, bno, len, oinfo, type, false);
 }
 
 
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index fef35696adb7..30c931b38853 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -574,7 +574,8 @@ xfs_bmap_btree_to_extents(
 		return error;
 
 	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork);
-	error = xfs_free_extent_later(cur->bc_tp, cbno, 1, &oinfo);
+	error = xfs_free_extent_later(cur->bc_tp, cbno, 1, &oinfo,
+			XFS_AG_RESV_NONE);
 	if (error)
 		return error;
 
@@ -5236,8 +5237,9 @@ xfs_bmap_del_extent_real(
 		} else {
 			error = __xfs_free_extent_later(tp, del->br_startblock,
 					del->br_blockcount, NULL,
-					(bflags & XFS_BMAPI_NODISCARD) ||
-					del->br_state == XFS_EXT_UNWRITTEN);
+					XFS_AG_RESV_NONE,
+					((bflags & XFS_BMAPI_NODISCARD) ||
+					del->br_state == XFS_EXT_UNWRITTEN));
 			if (error)
 				goto done;
 		}
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 36564ae3084f..bf3f1b36fdd2 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -271,7 +271,8 @@ xfs_bmbt_free_block(
 	int			error;
 
 	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, cur->bc_ino.whichfork);
-	error = xfs_free_extent_later(cur->bc_tp, fsbno, 1, &oinfo);
+	error = xfs_free_extent_later(cur->bc_tp, fsbno, 1, &oinfo,
+			XFS_AG_RESV_NONE);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 34600f94c2f4..1e5fafbc0cdb 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1853,8 +1853,8 @@ xfs_difree_inode_chunk(
 		/* not sparse, calculate extent info directly */
 		return xfs_free_extent_later(tp,
 				XFS_AGB_TO_FSB(mp, agno, sagbno),
-				M_IGEO(mp)->ialloc_blks,
-				&XFS_RMAP_OINFO_INODES);
+				M_IGEO(mp)->ialloc_blks, &XFS_RMAP_OINFO_INODES,
+				XFS_AG_RESV_NONE);
 	}
 
 	/* holemask is only 16-bits (fits in an unsigned long) */
@@ -1899,8 +1899,8 @@ xfs_difree_inode_chunk(
 		ASSERT(agbno % mp->m_sb.sb_spino_align == 0);
 		ASSERT(contigblk % mp->m_sb.sb_spino_align == 0);
 		error = xfs_free_extent_later(tp,
-				XFS_AGB_TO_FSB(mp, agno, agbno),
-				contigblk, &XFS_RMAP_OINFO_INODES);
+				XFS_AGB_TO_FSB(mp, agno, agbno), contigblk,
+				&XFS_RMAP_OINFO_INODES, XFS_AG_RESV_NONE);
 		if (error)
 			return error;
 
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 5a945ae21b5d..9258f01c0015 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -160,8 +160,7 @@ __xfs_inobt_free_block(
 
 	xfs_inobt_mod_blockcount(cur, -1);
 	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp));
-	return xfs_free_extent(cur->bc_tp, cur->bc_ag.pag,
-			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1,
+	return xfs_free_extent_later(cur->bc_tp, fsbno, 1,
 			&XFS_RMAP_OINFO_INOBT, resv);
 }
 
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index b6e21433925c..70ab113c9cea 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1152,7 +1152,8 @@ xfs_refcount_adjust_extents(
 						cur->bc_ag.pag->pag_agno,
 						tmp.rc_startblock);
 				error = xfs_free_extent_later(cur->bc_tp, fsbno,
-						  tmp.rc_blockcount, NULL);
+						  tmp.rc_blockcount, NULL,
+						  XFS_AG_RESV_NONE);
 				if (error)
 					goto out_error;
 			}
@@ -1213,7 +1214,8 @@ xfs_refcount_adjust_extents(
 					cur->bc_ag.pag->pag_agno,
 					ext.rc_startblock);
 			error = xfs_free_extent_later(cur->bc_tp, fsbno,
-					ext.rc_blockcount, NULL);
+					ext.rc_blockcount, NULL,
+					XFS_AG_RESV_NONE);
 			if (error)
 				goto out_error;
 		}
@@ -1981,7 +1983,8 @@ xfs_refcount_recover_cow_leftovers(
 
 		/* Free the block. */
 		error = xfs_free_extent_later(tp, fsb,
-				rr->rr_rrec.rc_blockcount, NULL);
+				rr->rr_rrec.rc_blockcount, NULL,
+				XFS_AG_RESV_NONE);
 		if (error)
 			goto out_trans;
 
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index d4afc5f4e6a5..5c3987d8dc24 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -106,19 +106,13 @@ xfs_refcountbt_free_block(
 	struct xfs_buf		*agbp = cur->bc_ag.agbp;
 	struct xfs_agf		*agf = agbp->b_addr;
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp));
-	int			error;
 
 	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_ag.pag->pag_agno,
 			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
 	be32_add_cpu(&agf->agf_refcount_blocks, -1);
 	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
-	error = xfs_free_extent(cur->bc_tp, cur->bc_ag.pag,
-			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1,
+	return xfs_free_extent_later(cur->bc_tp, fsbno, 1,
 			&XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA);
-	if (error)
-		return error;
-
-	return error;
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index f9e36b810663..79e65bb6b0a2 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -365,7 +365,7 @@ xfs_trans_free_extent(
 			agbno, xefi->xefi_blockcount);
 
 	error = __xfs_free_extent(tp, xefi->xefi_pag, agbno,
-			xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
+			xefi->xefi_blockcount, &oinfo, xefi->xefi_type,
 			xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
 
 	/*
@@ -644,6 +644,7 @@ xfs_efi_item_recover(
 	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
 		struct xfs_extent_free_item	fake = {
 			.xefi_owner		= XFS_RMAP_OWN_UNKNOWN,
+			.xefi_type		= XFS_AG_RESV_NONE,
 		};
 		struct xfs_extent		*extp;
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index abcc559f3c64..eb9102453aff 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -617,7 +617,8 @@ xfs_reflink_cancel_cow_blocks(
 					del.br_blockcount);
 
 			error = xfs_free_extent_later(*tpp, del.br_startblock,
-					  del.br_blockcount, NULL);
+					del.br_blockcount, NULL,
+					XFS_AG_RESV_NONE);
 			if (error)
 				break;
 

From patchwork Tue Jun 27 22:44:07 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 13295073
Return-Path: <linux-xfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BBFF5EB64D9
	for <linux-xfs@archiver.kernel.org>; Tue, 27 Jun 2023 22:44:30 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229987AbjF0Wo3 (ORCPT <rfc822;linux-xfs@archiver.kernel.org>);
        Tue, 27 Jun 2023 18:44:29 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60366 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229995AbjF0Wo0 (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Tue, 27 Jun 2023 18:44:26 -0400
Received: from mail-pj1-x102f.google.com (mail-pj1-x102f.google.com
 [IPv6:2607:f8b0:4864:20::102f])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C4EFB26A3
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:20 -0700 (PDT)
Received: by mail-pj1-x102f.google.com with SMTP id
 98e67ed59e1d1-2632a72f289so567226a91.2
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905860;
 x=1690497860;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:from:to:cc:subject:date:message-id
         :reply-to;
        bh=aWhBapU75WqXp3rBQnncpYPP9KljAfbysUq2fskWm48=;
        b=pIvizUS4zLEGKzIiqdgB/83Lact5bQ5Fyge++x94J8tX2sD7ggX+vjaEJmG9ixMA0s
         vABHdUaTo7yY91IRYcwgQSrwa6tGKgucaU16fNxrRNGnk/0vLGrbmSJZagbLlXjHz5Lo
         qFcOdU7+23KTT43z9v6Q/UT4NKvtwYhc9RpIGrFT2wLzK9JAgDCwNWeh3uq1GX1WCEfb
         tbHRf/aiyQJjokoszosK7o5U20Otcwf8OffzahbK1NEpF6KOZwJnAmMsmp+69J/PfDG8
         0quTAY9mmJdol04BNsoW1YOYICWyUhxc/MgURtgyVswN1G7RhbFqxtqxHWaMnd1zEh3I
         tLNg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1687905860; x=1690497860;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=aWhBapU75WqXp3rBQnncpYPP9KljAfbysUq2fskWm48=;
        b=XgQ9ABRSMafadhaTZYm0J1LJS9C13DnuTCjY8eNKPb9EhH17kkITxwiK0M1VXbQj0e
         C4lmY51ZXa0Lh4zBwQUq4P05n+jGtAJpmg6xmGz8nFSIbkdHr9f55PE2v4NKJVMCRVVC
         flWtCKvfrSQMFeOd9WN/b+u3h1+AWoPPvZgI3EgB34TF/beLxRl4IQq2j77GHG7WRXkT
         PgwjEelBW5w21yp2eyOMyWrHIosE3q12JmkXg1HogOc18QFCcGFvkTZiVaqHWvQ7zxGq
         XUTyCIaA35Sh0N9XE9BN7VK9NXuBYStSeTy5bq5azIxA+uuCp7GOh40I/ZQwzMb81v/Y
         5HWQ==
X-Gm-Message-State: AC+VfDwfCp0hBaIDWRTK54oKyq0HyP/saOoBpJFDufI9RV8nNDlXCUKe
        +3lEYCRXvB2cAoo2VMkq24KiCxSrP4pu49fzaIA=
X-Google-Smtp-Source: 
 ACHHUZ7iVuIG7upCLL//q7OessuDj8ZOwLRAbumCN/soSxUXhgtiWGhngKRBtIOK1bWrnPN6BxHxrA==
X-Received: by 2002:a17:90a:898c:b0:256:cf39:afce with SMTP id
 v12-20020a17090a898c00b00256cf39afcemr22523164pjn.43.1687905860172;
        Tue, 27 Jun 2023 15:44:20 -0700 (PDT)
Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au.
 [49.186.94.37])
        by smtp.gmail.com with ESMTPSA id
 v8-20020a17090ad58800b00256a4d59bfasm8326551pju.23.2023.06.27.15.44.17
        for <linux-xfs@vger.kernel.org>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 27 Jun 2023 15:44:18 -0700 (PDT)
Received: from [192.168.253.23] (helo=devoid.disaster.area)
        by dread.disaster.area with esmtp (Exim 4.96)
        (envelope-from <dave@fromorbit.com>)
        id 1qEHPz-00GzwY-1J
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:15 +1000
Received: from dave by devoid.disaster.area with local (Exim 4.96)
        (envelope-from <dave@devoid.disaster.area>)
        id 1qEHPz-009ZmG-0B
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:15 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 3/8] xfs: pass alloc flags through to xfs_extent_busy_flush()
Date: Wed, 28 Jun 2023 08:44:07 +1000
Message-Id: <20230627224412.2242198-4-david@fromorbit.com>
X-Mailer: git-send-email 2.40.1
In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com>
References: <20230627224412.2242198-1-david@fromorbit.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

From: Dave Chinner <dchinner@redhat.com>

To avoid blocking in xfs_extent_busy_flush() when freeing extents
and the only busy extents are held by the current transaction, we
need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way
into xfs_extent_busy_flush().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 96 +++++++++++++++++++++------------------
 fs/xfs/libxfs/xfs_alloc.h |  2 +-
 fs/xfs/xfs_extent_busy.c  |  3 +-
 fs/xfs/xfs_extent_busy.h  |  2 +-
 4 files changed, 56 insertions(+), 47 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index cc3f7b905ea1..ba929f6b5c9b 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock(
  */
 STATIC int
 xfs_alloc_ag_vextent_near(
-	struct xfs_alloc_arg	*args)
+	struct xfs_alloc_arg	*args,
+	uint32_t		alloc_flags)
 {
 	struct xfs_alloc_cur	acur = {};
 	int			error;		/* error code */
@@ -1612,7 +1613,7 @@ xfs_alloc_ag_vextent_near(
 		if (acur.busy) {
 			trace_xfs_alloc_near_busy(args);
 			xfs_extent_busy_flush(args->mp, args->pag,
-					      acur.busy_gen);
+					      acur.busy_gen, alloc_flags);
 			goto restart;
 		}
 		trace_xfs_alloc_size_neither(args);
@@ -1635,21 +1636,22 @@ xfs_alloc_ag_vextent_near(
  * and of the form k * prod + mod unless there's nothing that large.
  * Return the starting a.g. block, or NULLAGBLOCK if we can't do it.
  */
-STATIC int				/* error */
+static int
 xfs_alloc_ag_vextent_size(
-	xfs_alloc_arg_t	*args)		/* allocation argument structure */
+	struct xfs_alloc_arg	*args,
+	uint32_t		alloc_flags)
 {
-	struct xfs_agf	*agf = args->agbp->b_addr;
-	struct xfs_btree_cur *bno_cur;	/* cursor for bno btree */
-	struct xfs_btree_cur *cnt_cur;	/* cursor for cnt btree */
-	int		error;		/* error result */
-	xfs_agblock_t	fbno;		/* start of found freespace */
-	xfs_extlen_t	flen;		/* length of found freespace */
-	int		i;		/* temp status variable */
-	xfs_agblock_t	rbno;		/* returned block number */
-	xfs_extlen_t	rlen;		/* length of returned extent */
-	bool		busy;
-	unsigned	busy_gen;
+	struct xfs_agf		*agf = args->agbp->b_addr;
+	struct xfs_btree_cur	*bno_cur;
+	struct xfs_btree_cur	*cnt_cur;
+	xfs_agblock_t		fbno;		/* start of found freespace */
+	xfs_extlen_t		flen;		/* length of found freespace */
+	xfs_agblock_t		rbno;		/* returned block number */
+	xfs_extlen_t		rlen;		/* length of returned extent */
+	bool			busy;
+	unsigned		busy_gen;
+	int			error;
+	int			i;
 
 restart:
 	/*
@@ -1717,8 +1719,8 @@ xfs_alloc_ag_vextent_size(
 				xfs_btree_del_cursor(cnt_cur,
 						     XFS_BTREE_NOERROR);
 				trace_xfs_alloc_size_busy(args);
-				xfs_extent_busy_flush(args->mp,
-							args->pag, busy_gen);
+				xfs_extent_busy_flush(args->mp, args->pag,
+						busy_gen, alloc_flags);
 				goto restart;
 			}
 		}
@@ -1802,7 +1804,8 @@ xfs_alloc_ag_vextent_size(
 		if (busy) {
 			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
 			trace_xfs_alloc_size_busy(args);
-			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
+			xfs_extent_busy_flush(args->mp, args->pag, busy_gen,
+					alloc_flags);
 			goto restart;
 		}
 		goto out_nominleft;
@@ -2572,7 +2575,7 @@ xfs_exact_minlen_extent_available(
 int			/* error */
 xfs_alloc_fix_freelist(
 	struct xfs_alloc_arg	*args,	/* allocation argument structure */
-	int			flags)	/* XFS_ALLOC_FLAG_... */
+	uint32_t		alloc_flags)
 {
 	struct xfs_mount	*mp = args->mp;
 	struct xfs_perag	*pag = args->pag;
@@ -2588,7 +2591,7 @@ xfs_alloc_fix_freelist(
 	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
 
 	if (!xfs_perag_initialised_agf(pag)) {
-		error = xfs_alloc_read_agf(pag, tp, flags, &agbp);
+		error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp);
 		if (error) {
 			/* Couldn't lock the AGF so skip this AG. */
 			if (error == -EAGAIN)
@@ -2604,13 +2607,13 @@ xfs_alloc_fix_freelist(
 	 */
 	if (xfs_perag_prefers_metadata(pag) &&
 	    (args->datatype & XFS_ALLOC_USERDATA) &&
-	    (flags & XFS_ALLOC_FLAG_TRYLOCK)) {
-		ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING));
+	    (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)) {
+		ASSERT(!(alloc_flags & XFS_ALLOC_FLAG_FREEING));
 		goto out_agbp_relse;
 	}
 
 	need = xfs_alloc_min_freelist(mp, pag);
-	if (!xfs_alloc_space_available(args, need, flags |
+	if (!xfs_alloc_space_available(args, need, alloc_flags |
 			XFS_ALLOC_FLAG_CHECK))
 		goto out_agbp_relse;
 
@@ -2619,7 +2622,7 @@ xfs_alloc_fix_freelist(
 	 * Can fail if we're not blocking on locks, and it's held.
 	 */
 	if (!agbp) {
-		error = xfs_alloc_read_agf(pag, tp, flags, &agbp);
+		error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp);
 		if (error) {
 			/* Couldn't lock the AGF so skip this AG. */
 			if (error == -EAGAIN)
@@ -2634,7 +2637,7 @@ xfs_alloc_fix_freelist(
 
 	/* If there isn't enough total space or single-extent, reject it. */
 	need = xfs_alloc_min_freelist(mp, pag);
-	if (!xfs_alloc_space_available(args, need, flags))
+	if (!xfs_alloc_space_available(args, need, alloc_flags))
 		goto out_agbp_relse;
 
 #ifdef DEBUG
@@ -2672,11 +2675,12 @@ xfs_alloc_fix_freelist(
 	 */
 	memset(&targs, 0, sizeof(targs));
 	/* struct copy below */
-	if (flags & XFS_ALLOC_FLAG_NORMAP)
+	if (alloc_flags & XFS_ALLOC_FLAG_NORMAP)
 		targs.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE;
 	else
 		targs.oinfo = XFS_RMAP_OINFO_AG;
-	while (!(flags & XFS_ALLOC_FLAG_NOSHRINK) && pag->pagf_flcount > need) {
+	while (!(alloc_flags & XFS_ALLOC_FLAG_NOSHRINK) &&
+			pag->pagf_flcount > need) {
 		error = xfs_alloc_get_freelist(pag, tp, agbp, &bno, 0);
 		if (error)
 			goto out_agbp_relse;
@@ -2704,7 +2708,7 @@ xfs_alloc_fix_freelist(
 		targs.resv = XFS_AG_RESV_AGFL;
 
 		/* Allocate as many blocks as possible at once. */
-		error = xfs_alloc_ag_vextent_size(&targs);
+		error = xfs_alloc_ag_vextent_size(&targs, alloc_flags);
 		if (error)
 			goto out_agflbp_relse;
 
@@ -2714,7 +2718,7 @@ xfs_alloc_fix_freelist(
 		 * on a completely full ag.
 		 */
 		if (targs.agbno == NULLAGBLOCK) {
-			if (flags & XFS_ALLOC_FLAG_FREEING)
+			if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
 				break;
 			goto out_agflbp_relse;
 		}
@@ -3230,7 +3234,7 @@ xfs_alloc_vextent_check_args(
 static int
 xfs_alloc_vextent_prepare_ag(
 	struct xfs_alloc_arg	*args,
-	uint32_t		flags)
+	uint32_t		alloc_flags)
 {
 	bool			need_pag = !args->pag;
 	int			error;
@@ -3239,7 +3243,7 @@ xfs_alloc_vextent_prepare_ag(
 		args->pag = xfs_perag_get(args->mp, args->agno);
 
 	args->agbp = NULL;
-	error = xfs_alloc_fix_freelist(args, flags);
+	error = xfs_alloc_fix_freelist(args, alloc_flags);
 	if (error) {
 		trace_xfs_alloc_vextent_nofix(args);
 		if (need_pag)
@@ -3361,6 +3365,7 @@ xfs_alloc_vextent_this_ag(
 {
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		minimum_agno;
+	uint32_t		alloc_flags = 0;
 	int			error;
 
 	ASSERT(args->pag != NULL);
@@ -3379,9 +3384,9 @@ xfs_alloc_vextent_this_ag(
 		return error;
 	}
 
-	error = xfs_alloc_vextent_prepare_ag(args, 0);
+	error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
 	if (!error && args->agbp)
-		error = xfs_alloc_ag_vextent_size(args);
+		error = xfs_alloc_ag_vextent_size(args, alloc_flags);
 
 	return xfs_alloc_vextent_finish(args, minimum_agno, error, false);
 }
@@ -3410,20 +3415,20 @@ xfs_alloc_vextent_iterate_ags(
 	xfs_agnumber_t		minimum_agno,
 	xfs_agnumber_t		start_agno,
 	xfs_agblock_t		target_agbno,
-	uint32_t		flags)
+	uint32_t		alloc_flags)
 {
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		restart_agno = minimum_agno;
 	xfs_agnumber_t		agno;
 	int			error = 0;
 
-	if (flags & XFS_ALLOC_FLAG_TRYLOCK)
+	if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)
 		restart_agno = 0;
 restart:
 	for_each_perag_wrap_range(mp, start_agno, restart_agno,
 			mp->m_sb.sb_agcount, agno, args->pag) {
 		args->agno = agno;
-		error = xfs_alloc_vextent_prepare_ag(args, flags);
+		error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
 		if (error)
 			break;
 		if (!args->agbp) {
@@ -3437,10 +3442,10 @@ xfs_alloc_vextent_iterate_ags(
 		 */
 		if (args->agno == start_agno && target_agbno) {
 			args->agbno = target_agbno;
-			error = xfs_alloc_ag_vextent_near(args);
+			error = xfs_alloc_ag_vextent_near(args, alloc_flags);
 		} else {
 			args->agbno = 0;
-			error = xfs_alloc_ag_vextent_size(args);
+			error = xfs_alloc_ag_vextent_size(args, alloc_flags);
 		}
 		break;
 	}
@@ -3457,8 +3462,8 @@ xfs_alloc_vextent_iterate_ags(
 	 * constraining flags by the caller, drop them and retry the allocation
 	 * without any constraints being set.
 	 */
-	if (flags) {
-		flags = 0;
+	if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK) {
+		alloc_flags &= ~XFS_ALLOC_FLAG_TRYLOCK;
 		restart_agno = minimum_agno;
 		goto restart;
 	}
@@ -3486,6 +3491,7 @@ xfs_alloc_vextent_start_ag(
 	xfs_agnumber_t		start_agno;
 	xfs_agnumber_t		rotorstep = xfs_rotorstep;
 	bool			bump_rotor = false;
+	uint32_t		alloc_flags = XFS_ALLOC_FLAG_TRYLOCK;
 	int			error;
 
 	ASSERT(args->pag == NULL);
@@ -3512,7 +3518,7 @@ xfs_alloc_vextent_start_ag(
 
 	start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target));
 	error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno,
-			XFS_FSB_TO_AGBNO(mp, target), XFS_ALLOC_FLAG_TRYLOCK);
+			XFS_FSB_TO_AGBNO(mp, target), alloc_flags);
 
 	if (bump_rotor) {
 		if (args->agno == start_agno)
@@ -3539,6 +3545,7 @@ xfs_alloc_vextent_first_ag(
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		minimum_agno;
 	xfs_agnumber_t		start_agno;
+	uint32_t		alloc_flags = XFS_ALLOC_FLAG_TRYLOCK;
 	int			error;
 
 	ASSERT(args->pag == NULL);
@@ -3557,7 +3564,7 @@ xfs_alloc_vextent_first_ag(
 
 	start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target));
 	error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno,
-			XFS_FSB_TO_AGBNO(mp, target), 0);
+			XFS_FSB_TO_AGBNO(mp, target), alloc_flags);
 	return xfs_alloc_vextent_finish(args, minimum_agno, error, true);
 }
 
@@ -3610,6 +3617,7 @@ xfs_alloc_vextent_near_bno(
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		minimum_agno;
 	bool			needs_perag = args->pag == NULL;
+	uint32_t		alloc_flags = 0;
 	int			error;
 
 	if (!needs_perag)
@@ -3630,9 +3638,9 @@ xfs_alloc_vextent_near_bno(
 	if (needs_perag)
 		args->pag = xfs_perag_grab(mp, args->agno);
 
-	error = xfs_alloc_vextent_prepare_ag(args, 0);
+	error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
 	if (!error && args->agbp)
-		error = xfs_alloc_ag_vextent_near(args);
+		error = xfs_alloc_ag_vextent_near(args, alloc_flags);
 
 	return xfs_alloc_vextent_finish(args, minimum_agno, error, needs_perag);
 }
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 121faf1e11ad..50119ebaede9 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -195,7 +195,7 @@ int xfs_alloc_read_agfl(struct xfs_perag *pag, struct xfs_trans *tp,
 		struct xfs_buf **bpp);
 int xfs_free_agfl_block(struct xfs_trans *, xfs_agnumber_t, xfs_agblock_t,
 			struct xfs_buf *, struct xfs_owner_info *);
-int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, int flags);
+int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, uint32_t alloc_flags);
 int xfs_free_extent_fix_freelist(struct xfs_trans *tp, struct xfs_perag *pag,
 		struct xfs_buf **agbp);
 
diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
index f3d328e4a440..5f44936eae1c 100644
--- a/fs/xfs/xfs_extent_busy.c
+++ b/fs/xfs/xfs_extent_busy.c
@@ -571,7 +571,8 @@ void
 xfs_extent_busy_flush(
 	struct xfs_mount	*mp,
 	struct xfs_perag	*pag,
-	unsigned		busy_gen)
+	unsigned		busy_gen,
+	uint32_t		alloc_flags)
 {
 	DEFINE_WAIT		(wait);
 	int			error;
diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
index 4a118131059f..7a82c689bfa4 100644
--- a/fs/xfs/xfs_extent_busy.h
+++ b/fs/xfs/xfs_extent_busy.h
@@ -53,7 +53,7 @@ xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
 
 void
 xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
-	unsigned busy_gen);
+	unsigned busy_gen, uint32_t alloc_flags);
 
 void
 xfs_extent_busy_wait_all(struct xfs_mount *mp);

From patchwork Tue Jun 27 22:44:08 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 13295069
Return-Path: <linux-xfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id ACED0C001B3
	for <linux-xfs@archiver.kernel.org>; Tue, 27 Jun 2023 22:44:27 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229985AbjF0Wo0 (ORCPT <rfc822;linux-xfs@archiver.kernel.org>);
        Tue, 27 Jun 2023 18:44:26 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60326 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229482AbjF0WoY (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Tue, 27 Jun 2023 18:44:24 -0400
Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com
 [IPv6:2607:f8b0:4864:20::42c])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EC633211E
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:19 -0700 (PDT)
Received: by mail-pf1-x42c.google.com with SMTP id
 d2e1a72fcca58-6687096c6ddso2791866b3a.0
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905859;
 x=1690497859;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:from:to:cc:subject:date:message-id
         :reply-to;
        bh=nkNsZl7vYsW79WSqtxpYRKvhd3dwbJ/b3snusgn1D54=;
        b=Xx8R0bsI8bvoYq/tKAujz1eJiT58NNO1T6yylfJj2R7a9nWdc1dooEN77u468T+DmS
         1tSYDR3hOsHSuR0i+RIAvUY3gTwMP3J/Ha53TqeSpv6r05zTScVQNKzuR7x8+NKXTrNK
         rt+2ZwzZh0WdGLqUCamj+v4Lhp7uimB+jNUXqTh8Et3aOUHgteoI7VkGuLadmcggkPJJ
         FpcVcKw4TiaIwGRjkGy5SvgDmDgt5qOAS6Cr985w6MChJbiFBhJCsepC4Omn7yY0jOaI
         ylC9DCFQoDNxoJcpp02SeLUIr0UwY2mGd+9y2pmbCy4hls8PDSNW6ubve0DIH5ZnLblU
         yKLw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1687905859; x=1690497859;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=nkNsZl7vYsW79WSqtxpYRKvhd3dwbJ/b3snusgn1D54=;
        b=IhiOKKmn6QhgcJrGfurO9+oOJ9nvoUwAQxXJRf1QYxzuRxxSyEu360YgzUaI0cm7nU
         0atEiGIK37Zy3f7waffHE8NfbIWimJA21x5xagoD61T0vvyVM5tyrZqgvTZXMayqSr6J
         t1WRwmASulMsCQv5xPU7KgcDZnPN0GK7xzHR3A257lypkNSmYa+nmi94IocnAjH+bnHV
         AAij1mAFiYKoBxtRJ2vy7AG9FPgzQheFiq2NwaJoUr8rFBM6m1A1ZGxdlvcS8AqMqzzi
         IcgJJ+yN/v1i6sb+LB65ReEjaUEcYZnsVxABAzcvfVGmkVI24+wBeYAiU2ozJYON7fYF
         N8kQ==
X-Gm-Message-State: AC+VfDwO4asSYBCFByfbzTlwBIClFsWmWWF4+gC/fZ4TY6quuxm+CeS7
        kIsIQ7IJsMYy3E9j0f/z+Y/i0k+33wx8NfGsY2I=
X-Google-Smtp-Source: 
 ACHHUZ5JIsisapbs8/L+mOS3fJ+LSNbgWMZmAnFkv6n76rHWvajXVsnqaHikNEOWjH6oSyxvsRxQow==
X-Received: by 2002:a05:6a20:7349:b0:10e:e218:d3f0 with SMTP id
 v9-20020a056a20734900b0010ee218d3f0mr25799779pzc.54.1687905859304;
        Tue, 27 Jun 2023 15:44:19 -0700 (PDT)
Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au.
 [49.186.94.37])
        by smtp.gmail.com with ESMTPSA id
 x21-20020aa793b5000000b00673e652985esm4448967pff.44.2023.06.27.15.44.17
        for <linux-xfs@vger.kernel.org>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 27 Jun 2023 15:44:17 -0700 (PDT)
Received: from [192.168.253.23] (helo=devoid.disaster.area)
        by dread.disaster.area with esmtp (Exim 4.96)
        (envelope-from <dave@fromorbit.com>)
        id 1qEHPz-00Gzwb-1R
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:15 +1000
Received: from dave by devoid.disaster.area with local (Exim 4.96)
        (envelope-from <dave@devoid.disaster.area>)
        id 1qEHPz-009ZmL-0L
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:15 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 4/8] xfs: allow extent free intents to be retried
Date: Wed, 28 Jun 2023 08:44:08 +1000
Message-Id: <20230627224412.2242198-5-david@fromorbit.com>
X-Mailer: git-send-email 2.40.1
In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com>
References: <20230627224412.2242198-1-david@fromorbit.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

From: Dave Chinner <dchinner@redhat.com>

Extent freeing neeeds to be able to avoid a busy extent deadlock
when the transaction itself holds the only busy extents in the
allocation group. This may occur if we have an EFI that contains
multiple extents to be freed, and the freeing the second intent
requires the space the first extent free released to expand the
AGFL. If we block on the busy extent at this point, we deadlock.

We hold a dirty transaction that contains a entire atomic extent
free operations within it, so if we can abort the extent free
operation and commit the progress that we've made, the busy extent
can be resolved by a log force. Hence we can restart the aborted
extent free with a new transaction and continue to make
progress without risking deadlocks.

To enable this, we need the EFI processing code to be able to handle
an -EAGAIN error to tell it to commit the current transaction and
retry again. This mechanism is already built into the defer ops
processing (used bythe refcount btree modification intents), so
there's relatively little handling we need to add to the EFI code to
enable this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/xfs_extfree_item.c | 73 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 70 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 79e65bb6b0a2..098420cbd4a0 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -336,6 +336,34 @@ xfs_trans_get_efd(
 	return efdp;
 }
 
+/*
+ * Fill the EFD with all extents from the EFI when we need to roll the
+ * transaction and continue with a new EFI.
+ *
+ * This simply copies all the extents in the EFI to the EFD rather than make
+ * assumptions about which extents in the EFI have already been processed. We
+ * currently keep the xefi list in the same order as the EFI extent list, but
+ * that may not always be the case. Copying everything avoids leaving a landmine
+ * were we fail to cancel all the extents in an EFI if the xefi list is
+ * processed in a different order to the extents in the EFI.
+ */
+static void
+xfs_efd_from_efi(
+	struct xfs_efd_log_item	*efdp)
+{
+	struct xfs_efi_log_item *efip = efdp->efd_efip;
+	uint                    i;
+
+	ASSERT(efip->efi_format.efi_nextents > 0);
+	ASSERT(efdp->efd_next_extent < efip->efi_format.efi_nextents);
+
+	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
+	       efdp->efd_format.efd_extents[i] =
+		       efip->efi_format.efi_extents[i];
+	}
+	efdp->efd_next_extent = efip->efi_format.efi_nextents;
+}
+
 /*
  * Free an extent and log it to the EFD. Note that the transaction is marked
  * dirty regardless of whether the extent free succeeds or fails to support the
@@ -378,6 +406,17 @@ xfs_trans_free_extent(
 	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
 	set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags);
 
+	/*
+	 * If we need a new transaction to make progress, the caller will log a
+	 * new EFI with the current contents. It will also log an EFD to cancel
+	 * the existing EFI, and so we need to copy all the unprocessed extents
+	 * in this EFI to the EFD so this works correctly.
+	 */
+	if (error == -EAGAIN) {
+		xfs_efd_from_efi(efdp);
+		return error;
+	}
+
 	next_extent = efdp->efd_next_extent;
 	ASSERT(next_extent < efdp->efd_format.efd_nextents);
 	extp = &(efdp->efd_format.efd_extents[next_extent]);
@@ -495,6 +534,13 @@ xfs_extent_free_finish_item(
 
 	error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
 
+	/*
+	 * Don't free the XEFI if we need a new transaction to complete
+	 * processing of it.
+	 */
+	if (error == -EAGAIN)
+		return error;
+
 	xfs_extent_free_put_group(xefi);
 	kmem_cache_free(xfs_extfree_item_cache, xefi);
 	return error;
@@ -620,6 +666,7 @@ xfs_efi_item_recover(
 	struct xfs_trans		*tp;
 	int				i;
 	int				error = 0;
+	bool				requeue_only = false;
 
 	/*
 	 * First check the validity of the extents described by the
@@ -653,9 +700,29 @@ xfs_efi_item_recover(
 		fake.xefi_startblock = extp->ext_start;
 		fake.xefi_blockcount = extp->ext_len;
 
-		xfs_extent_free_get_group(mp, &fake);
-		error = xfs_trans_free_extent(tp, efdp, &fake);
-		xfs_extent_free_put_group(&fake);
+		if (!requeue_only) {
+			xfs_extent_free_get_group(mp, &fake);
+			error = xfs_trans_free_extent(tp, efdp, &fake);
+			xfs_extent_free_put_group(&fake);
+		}
+
+		/*
+		 * If we can't free the extent without potentially deadlocking,
+		 * requeue the rest of the extents to a new so that they get
+		 * run again later with a new transaction context.
+		 */
+		if (error == -EAGAIN || requeue_only) {
+			error = xfs_free_extent_later(tp, fake.xefi_startblock,
+					fake.xefi_blockcount,
+					&XFS_RMAP_OINFO_ANY_OWNER,
+					fake.xefi_type);
+			if (!error) {
+				requeue_only = true;
+				error = 0;
+				continue;
+			}
+		};
+
 		if (error == -EFSCORRUPTED)
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 					extp, sizeof(*extp));

From patchwork Tue Jun 27 22:44:09 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 13295070
Return-Path: <linux-xfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6E279C001DD
	for <linux-xfs@archiver.kernel.org>; Tue, 27 Jun 2023 22:44:28 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229482AbjF0Wo1 (ORCPT <rfc822;linux-xfs@archiver.kernel.org>);
        Tue, 27 Jun 2023 18:44:27 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60336 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229958AbjF0WoZ (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Tue, 27 Jun 2023 18:44:25 -0400
Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com
 [IPv6:2607:f8b0:4864:20::42c])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 54B6E26A8
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:20 -0700 (PDT)
Received: by mail-pf1-x42c.google.com with SMTP id
 d2e1a72fcca58-66869feb7d1so2976537b3a.3
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905860;
 x=1690497860;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:from:to:cc:subject:date:message-id
         :reply-to;
        bh=ZNZF8d+9xx3Te2bu0f2qE+wgSJ+bidn/R8m8MNQ1Va4=;
        b=0P1PKBs43cbbu+ws5orKebuzlAEBQqx5HPNfjheJ0ScKXfBTifQ2zq5Hl8IOtNf81V
         t+bsm6GbAisF+6/wggED4BotZ70s1Wa7W/2wL4NQXlmgc26H81cYOSSN6C2dAyC0DS7a
         02RHd8Fh1N6qKL10DgUnoHAPLzz7vlf8lVh0kZGS2YCod1pmhq3LBVQGd7HJ/JNbyUmo
         yv4+L1C2UihlXgYstEZGkjfF4TWbMqnmKeRS/51L70awyda/Iuc7Ven3K6acYJ7IEJxx
         WJ4VxXFdDjnSmwpS9MdBoQ43kv08cbNuq4Bs/8PKxqDJPUzmFWEmhOjKS+slhSGmzBh4
         b+9g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1687905860; x=1690497860;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=ZNZF8d+9xx3Te2bu0f2qE+wgSJ+bidn/R8m8MNQ1Va4=;
        b=PPYhQc4lmgbVo2Xl2qaavPJwekvdqzO6ol9+Nn2XVynTf938vA3g+OmBI5XCCJ1g5U
         bBnLkGejtFeoLqgiClpNrFg3D88OfVWWAnTQrLaXU8sjgtPyj9vltKmcDmV5Smn2XUwj
         MXoh6yMcd4/nyKOr4a5Y9V6FKCTHgdwxCA4vRgF7nQ98ke6AArFXmBLq1Yxy62leO4DL
         mGbJEWpU9O04AFCTTs/weTkN2rCvbqudub6glyqrM0LQ9qmulLzXZtHesVSZuz3x+02s
         Dz6tgesO1E6mLEBu2tEj7rcHHKEFIBVwCevWg0EzNSKvSTs1k7B85SubJfJW382q8GOz
         HegA==
X-Gm-Message-State: AC+VfDzLzew6TTYmplcU4WyG6faIqxrara3umCP7tx35NVmAzW6ZWdit
        s6ihje/vJ6YsJUPsKdwNiWm1aBEIZXKbNhWf4XU=
X-Google-Smtp-Source: 
 ACHHUZ7bjhKg4MChEVBZeCN571yh+pQufO190ckxCGSv8jTpHreAHyx5ODh4TXTbReLF3llS6OAxOQ==
X-Received: by 2002:a05:6a21:900c:b0:123:2973:672c with SMTP id
 tq12-20020a056a21900c00b001232973672cmr13340032pzb.57.1687905859691;
        Tue, 27 Jun 2023 15:44:19 -0700 (PDT)
Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au.
 [49.186.94.37])
        by smtp.gmail.com with ESMTPSA id
 g23-20020a62e317000000b00679b7d2bd57sm3226647pfh.192.2023.06.27.15.44.17
        for <linux-xfs@vger.kernel.org>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 27 Jun 2023 15:44:18 -0700 (PDT)
Received: from [192.168.253.23] (helo=devoid.disaster.area)
        by dread.disaster.area with esmtp (Exim 4.96)
        (envelope-from <dave@fromorbit.com>)
        id 1qEHPz-00Gzwe-1f
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:15 +1000
Received: from dave by devoid.disaster.area with local (Exim 4.96)
        (envelope-from <dave@devoid.disaster.area>)
        id 1qEHPz-009ZmQ-0V
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:15 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 5/8] xfs: don't block in busy flushing when freeing extents
Date: Wed, 28 Jun 2023 08:44:09 +1000
Message-Id: <20230627224412.2242198-6-david@fromorbit.com>
X-Mailer: git-send-email 2.40.1
In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com>
References: <20230627224412.2242198-1-david@fromorbit.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

From: Dave Chinner <dchinner@redhat.com>

If the current transaction holds a busy extent and we are trying to
allocate a new extent to fix up the free list, we can deadlock if
the AG is entirely empty except for the busy extent held by the
transaction.

This can occur at runtime processing an XEFI with multiple extents
in this path:

__schedule+0x22f at ffffffff81f75e8f
schedule+0x46 at ffffffff81f76366
xfs_extent_busy_flush+0x69 at ffffffff81477d99
xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a
xfs_alloc_ag_vextent+0x19b at ffffffff81417edb
xfs_alloc_fix_freelist+0x22f at ffffffff8141896f
xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a
__xfs_free_extent+0x99 at ffffffff81419499
xfs_trans_free_extent+0x3e at ffffffff814a6fee
xfs_extent_free_finish_item+0x24 at ffffffff814a70d4
xfs_defer_finish_noroll+0x1f7 at ffffffff81441407
xfs_defer_finish+0x11 at ffffffff814417e1
xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd
xfs_inactive_truncate+0xb9 at ffffffff8148bb89
xfs_inactive+0x227 at ffffffff8148c4f7
xfs_fs_destroy_inode+0xb8 at ffffffff81496898
destroy_inode+0x3b at ffffffff8127d2ab
do_unlinkat+0x1d1 at ffffffff81270df1
do_syscall_64+0x40 at ffffffff81f6b5f0
entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c

This can also happen in log recovery when processing an EFI
with multiple extents through this path:

context_switch() kernel/sched/core.c:3881
__schedule() kernel/sched/core.c:5111
schedule() kernel/sched/core.c:5186
xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
__xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
xfs_log_mount_finish() fs/xfs/xfs_log.c:764
xfs_mountfs() fs/xfs/xfs_mount.c:978
xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
mount_bdev() fs/super.c:1417
xfs_fs_mount() fs/xfs/xfs_super.c:1985
legacy_get_tree() fs/fs_context.c:647
vfs_get_tree() fs/super.c:1547
do_new_mount() fs/namespace.c:2843
do_mount() fs/namespace.c:3163
ksys_mount() fs/namespace.c:3372
__do_sys_mount() fs/namespace.c:3386
__se_sys_mount() fs/namespace.c:3383
__x64_sys_mount() fs/namespace.c:3383
do_syscall_64() arch/x86/entry/common.c:296
entry_SYSCALL_64() arch/x86/entry/entry_64.S:180

To avoid this deadlock, we should not block in
xfs_extent_busy_flush() if we hold a busy extent in the current
transaction.

Now that the EFI processing code can handle requeuing a partially
completed EFI, we can detect this situation in
xfs_extent_busy_flush() and return -EAGAIN rather than going to
sleep forever. The -EAGAIN get propagated back out to the
xfs_trans_free_extent() context, where the EFD is populated and the
transaction is rolled, thereby moving the busy extents into the CIL.

At this point, we can retry the extent free operation again with a
clean transaction. If we hit the same "all free extents are busy"
situation when trying to fix up the free list, we can safely call
xfs_extent_busy_flush() and wait for the busy extents to resolve
and wake us. At this point, the allocation search can make progress
again and we can fix up the free list.

This deadlock was first reported by Chandan in mid-2021, but I
couldn't make myself understood during review, and didn't have time
to fix it myself.

It was reported again in March 2023, and again I have found myself
unable to explain the complexities of the solution needed during
review.

As such, I don't have hours more time to waste trying to get the
fix written the way it needs to be written, so I'm just doing it
myself. This patchset is largely based on Wengang Wang's last patch,
but with all the unnecessary stuff removed, split up into multiple
patches and cleaned up somewhat.

Reported-by: Chandan Babu R <chandanrlinux@gmail.com>
Reported-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c | 68 ++++++++++++++++++++++++++++-----------
 fs/xfs/libxfs/xfs_alloc.h | 11 ++++---
 fs/xfs/xfs_extent_busy.c  | 33 ++++++++++++++++---
 fs/xfs/xfs_extent_busy.h  |  6 ++--
 4 files changed, 88 insertions(+), 30 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index ba929f6b5c9b..1e72b91daff6 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -1556,6 +1556,8 @@ xfs_alloc_ag_vextent_near(
 	if (args->agbno > args->max_agbno)
 		args->agbno = args->max_agbno;
 
+	/* Retry once quickly if we find busy extents before blocking. */
+	alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH;
 restart:
 	len = 0;
 
@@ -1611,9 +1613,20 @@ xfs_alloc_ag_vextent_near(
 	 */
 	if (!acur.len) {
 		if (acur.busy) {
+			/*
+			 * Our only valid extents must have been busy. Flush and
+			 * retry the allocation again. If we get an -EAGAIN
+			 * error, we're being told that a deadlock was avoided
+			 * and the current transaction needs committing before
+			 * the allocation can be retried.
+			 */
 			trace_xfs_alloc_near_busy(args);
-			xfs_extent_busy_flush(args->mp, args->pag,
-					      acur.busy_gen, alloc_flags);
+			error = xfs_extent_busy_flush(args->tp, args->pag,
+					acur.busy_gen, alloc_flags);
+			if (error)
+				goto out;
+
+			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
 			goto restart;
 		}
 		trace_xfs_alloc_size_neither(args);
@@ -1653,6 +1666,8 @@ xfs_alloc_ag_vextent_size(
 	int			error;
 	int			i;
 
+	/* Retry once quickly if we find busy extents before blocking. */
+	alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH;
 restart:
 	/*
 	 * Allocate and initialize a cursor for the by-size btree.
@@ -1710,19 +1725,25 @@ xfs_alloc_ag_vextent_size(
 			error = xfs_btree_increment(cnt_cur, 0, &i);
 			if (error)
 				goto error0;
-			if (i == 0) {
-				/*
-				 * Our only valid extents must have been busy.
-				 * Make it unbusy by forcing the log out and
-				 * retrying.
-				 */
-				xfs_btree_del_cursor(cnt_cur,
-						     XFS_BTREE_NOERROR);
-				trace_xfs_alloc_size_busy(args);
-				xfs_extent_busy_flush(args->mp, args->pag,
-						busy_gen, alloc_flags);
-				goto restart;
-			}
+			if (i)
+				continue;
+
+			/*
+			 * Our only valid extents must have been busy. Flush and
+			 * retry the allocation again. If we get an -EAGAIN
+			 * error, we're being told that a deadlock was avoided
+			 * and the current transaction needs committing before
+			 * the allocation can be retried.
+			 */
+			trace_xfs_alloc_size_busy(args);
+			error = xfs_extent_busy_flush(args->tp, args->pag,
+					busy_gen, alloc_flags);
+			if (error)
+				goto error0;
+
+			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
+			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
+			goto restart;
 		}
 	}
 
@@ -1802,10 +1823,21 @@ xfs_alloc_ag_vextent_size(
 	args->len = rlen;
 	if (rlen < args->minlen) {
 		if (busy) {
-			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
+			/*
+			 * Our only valid extents must have been busy. Flush and
+			 * retry the allocation again. If we get an -EAGAIN
+			 * error, we're being told that a deadlock was avoided
+			 * and the current transaction needs committing before
+			 * the allocation can be retried.
+			 */
 			trace_xfs_alloc_size_busy(args);
-			xfs_extent_busy_flush(args->mp, args->pag, busy_gen,
-					alloc_flags);
+			error = xfs_extent_busy_flush(args->tp, args->pag,
+					busy_gen, alloc_flags);
+			if (error)
+				goto error0;
+
+			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
+			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
 			goto restart;
 		}
 		goto out_nominleft;
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 50119ebaede9..29e56fdf25e4 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -19,11 +19,12 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp);
 /*
  * Flags for xfs_alloc_fix_freelist.
  */
-#define	XFS_ALLOC_FLAG_TRYLOCK	0x00000001  /* use trylock for buffer locking */
-#define	XFS_ALLOC_FLAG_FREEING	0x00000002  /* indicate caller is freeing extents*/
-#define	XFS_ALLOC_FLAG_NORMAP	0x00000004  /* don't modify the rmapbt */
-#define	XFS_ALLOC_FLAG_NOSHRINK	0x00000008  /* don't shrink the freelist */
-#define	XFS_ALLOC_FLAG_CHECK	0x00000010  /* test only, don't modify args */
+#define	XFS_ALLOC_FLAG_TRYLOCK	(1U << 0)  /* use trylock for buffer locking */
+#define	XFS_ALLOC_FLAG_FREEING	(1U << 1)  /* indicate caller is freeing extents*/
+#define	XFS_ALLOC_FLAG_NORMAP	(1U << 2)  /* don't modify the rmapbt */
+#define	XFS_ALLOC_FLAG_NOSHRINK	(1U << 3)  /* don't shrink the freelist */
+#define	XFS_ALLOC_FLAG_CHECK	(1U << 4)  /* test only, don't modify args */
+#define	XFS_ALLOC_FLAG_TRYFLUSH	(1U << 5)  /* don't wait in busy extent flush */
 
 /*
  * Argument structure for xfs_alloc routines.
diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
index 5f44936eae1c..7c2fdc71e42d 100644
--- a/fs/xfs/xfs_extent_busy.c
+++ b/fs/xfs/xfs_extent_busy.c
@@ -566,10 +566,21 @@ xfs_extent_busy_clear(
 
 /*
  * Flush out all busy extents for this AG.
+ *
+ * If the current transaction is holding busy extents, the caller may not want
+ * to wait for committed busy extents to resolve. If we are being told just to
+ * try a flush or progress has been made since we last skipped a busy extent,
+ * return immediately to allow the caller to try again.
+ *
+ * If we are freeing extents, we might actually be holding the only free extents
+ * in the transaction busy list and the log force won't resolve that situation.
+ * In this case, we must return -EAGAIN to avoid a deadlock by informing the
+ * caller it needs to commit the busy extents it holds before retrying the
+ * extent free operation.
  */
-void
+int
 xfs_extent_busy_flush(
-	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
 	struct xfs_perag	*pag,
 	unsigned		busy_gen,
 	uint32_t		alloc_flags)
@@ -577,10 +588,23 @@ xfs_extent_busy_flush(
 	DEFINE_WAIT		(wait);
 	int			error;
 
-	error = xfs_log_force(mp, XFS_LOG_SYNC);
+	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
 	if (error)
-		return;
+		return error;
 
+	/* Avoid deadlocks on uncommitted busy extents. */
+	if (!list_empty(&tp->t_busy)) {
+		if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH)
+			return 0;
+
+		if (busy_gen != READ_ONCE(pag->pagb_gen))
+			return 0;
+
+		if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
+			return -EAGAIN;
+	}
+
+	/* Wait for committed busy extents to resolve. */
 	do {
 		prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
 		if  (busy_gen != READ_ONCE(pag->pagb_gen))
@@ -589,6 +613,7 @@ xfs_extent_busy_flush(
 	} while (1);
 
 	finish_wait(&pag->pagb_wait, &wait);
+	return 0;
 }
 
 void
diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
index 7a82c689bfa4..c37bf87e6781 100644
--- a/fs/xfs/xfs_extent_busy.h
+++ b/fs/xfs/xfs_extent_busy.h
@@ -51,9 +51,9 @@ bool
 xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
 		xfs_extlen_t *len, unsigned *busy_gen);
 
-void
-xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
-	unsigned busy_gen, uint32_t alloc_flags);
+int
+xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag,
+		unsigned busy_gen, uint32_t alloc_flags);
 
 void
 xfs_extent_busy_wait_all(struct xfs_mount *mp);

From patchwork Tue Jun 27 22:44:10 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 13295075
Return-Path: <linux-xfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E1C8AC001B0
	for <linux-xfs@archiver.kernel.org>; Tue, 27 Jun 2023 22:44:32 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229948AbjF0Wob (ORCPT <rfc822;linux-xfs@archiver.kernel.org>);
        Tue, 27 Jun 2023 18:44:31 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60384 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229982AbjF0Wo2 (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Tue, 27 Jun 2023 18:44:28 -0400
Received: from mail-pf1-x42f.google.com (mail-pf1-x42f.google.com
 [IPv6:2607:f8b0:4864:20::42f])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C44302947
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:22 -0700 (PDT)
Received: by mail-pf1-x42f.google.com with SMTP id
 d2e1a72fcca58-668730696a4so2754812b3a.1
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905862;
 x=1690497862;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:from:to:cc:subject:date:message-id
         :reply-to;
        bh=4O2k46lv/gI7xyqpvXtBgdwqM0g/QagZozuLf4U9K4c=;
        b=cKgdrZ43YU4na7peEhU2pWxrT3ycVrcStWKlDD3d2SXyyHQ7Gjh4tFS42r8SJJiH/6
         /zUvTh93MiFUtvP/TE9L811XX2NhSCzEH1QuSQcW6uJy/ovkdhFpHhnNnZpcxxQapB7g
         nX4xeBx+GTklHAdtT6qGawUEtlumdu3EFexBoXxeveA1Q29hzIPGCs51wDWz6hLQZKTv
         LsUg13Q4jpjrOOBbPTaIWTOWu8DfBrb629f2GL/PMcBHZ2kcoN0YU0Rm4OqoNULWeLzb
         dtpSPFqCDE9m89SGLt6h2ptDUkOXkYCwHSNI/mPe4eu8ECPrd5NdXithj+aAJUCEwZVh
         mcbQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1687905862; x=1690497862;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=4O2k46lv/gI7xyqpvXtBgdwqM0g/QagZozuLf4U9K4c=;
        b=C94+GuXpolvh+I1JHhgM5QYneJbs9P9DJCyRNpZ35wqytRTIndtK2ams9aW2OLsa4w
         r5uSSxSIysWqhubjtmULJeB1kDDj70s0SvGZdRJiRcbRDvpSeFAfaBiZDEPg9IOew1UK
         6NnuhY8gQ7ekT/sAc7tNKCqfHsK2S56nuxqP+Y0TQqAdYzJSqvS7vuHjqp8RRQdCxdAU
         8sCprtU8qiLcnHLeY4NnVD1I89kXftuUuPi5+l6w4TpFT3uONuVwDgZYjodOk1AZrWhu
         JEhCUbMmrJeyHoBmQ37JLOsxnD46jHcZY7FySumPI3m15nDKlv5Cknlp/V0B5CB5jR72
         I20A==
X-Gm-Message-State: AC+VfDwTH+Twc7AQberqwn65hlUSgxypFgabSy/hE4ZZOUAXIAIyLeAy
        cNfQ8sexSeHUMQ2aB5ZBLM30DCfWr7joIcWFjrk=
X-Google-Smtp-Source: 
 ACHHUZ67g6fUTtjziPJRgLpZIRQA/oKWq6w5y0mzV755YfeWH4uh55MmcDS+M5kxUbWXnjw+4Rwd2w==
X-Received: by 2002:a05:6a00:3919:b0:643:aa8d:8cd7 with SMTP id
 fh25-20020a056a00391900b00643aa8d8cd7mr24682916pfb.32.1687905861960;
        Tue, 27 Jun 2023 15:44:21 -0700 (PDT)
Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au.
 [49.186.94.37])
        by smtp.gmail.com with ESMTPSA id
 y11-20020a62b50b000000b0063f0068cf6csm5878864pfe.198.2023.06.27.15.44.18
        for <linux-xfs@vger.kernel.org>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 27 Jun 2023 15:44:20 -0700 (PDT)
Received: from [192.168.253.23] (helo=devoid.disaster.area)
        by dread.disaster.area with esmtp (Exim 4.96)
        (envelope-from <dave@fromorbit.com>)
        id 1qEHPz-00Gzwh-1m
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:15 +1000
Received: from dave by devoid.disaster.area with local (Exim 4.96)
        (envelope-from <dave@devoid.disaster.area>)
        id 1qEHPz-009ZmV-0e
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:15 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 6/8] xfs: journal geometry is not properly bounds checked
Date: Wed, 28 Jun 2023 08:44:10 +1000
Message-Id: <20230627224412.2242198-7-david@fromorbit.com>
X-Mailer: git-send-email 2.40.1
In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com>
References: <20230627224412.2242198-1-david@fromorbit.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

From: Dave Chinner <dchinner@redhat.com>

If the journal geometry results in a sector or log stripe unit
validation problem, it indicates that we cannot set the log up to
safely write to the the journal. In these cases, we must abort the
mount because the corruption needs external intervention to resolve.
Similarly, a journal that is too large cannot be written to safely,
either, so we shouldn't allow those geometries to mount, either.

If the log is too small, we risk having transaction reservations
overruning the available log space and the system hanging waiting
for space it can never provide. This is purely a runtime hang issue,
not a corruption issue as per the first cases listed above. We abort
mounts of the log is too small for V5 filesystems, but we must allow
v4 filesystems to mount because, historically, there was no log size
validity checking and so some systems may still be out there with
undersized logs.

The problem is that on V4 filesystems, when we discover a log
geometry problem, we skip all the remaining checks and then allow
the log to continue mounting. This mean that if one of the log size
checks fails, we skip the log stripe unit check. i.e. we allow the
mount because a "non-fatal" geometry is violated, and then fail to
check the hard fail geometries that should fail the mount.

Move all these fatal checks to the superblock verifier, and add a
new check for the two log sector size geometry variables having the
same values. This will prevent any attempt to mount a log that has
invalid or inconsistent geometries long before we attempt to mount
the log.

However, for the minimum log size checks, we can only do that once
we've setup up the log and calculated all the iclog sizes and
roundoffs. Hence this needs to remain in the log mount code after
the log has been initialised. It is also the only case where we
should allow a v4 filesystem to continue running, so leave that
handling in place, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_sb.c | 56 +++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_log.c       | 47 +++++++++++------------------------
 2 files changed, 70 insertions(+), 33 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index ba0f17bc1dc0..5e174685a77c 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -412,7 +412,6 @@ xfs_validate_sb_common(
 	    sbp->sb_inodelog < XFS_DINODE_MIN_LOG			||
 	    sbp->sb_inodelog > XFS_DINODE_MAX_LOG			||
 	    sbp->sb_inodesize != (1 << sbp->sb_inodelog)		||
-	    sbp->sb_logsunit > XLOG_MAX_RECORD_BSIZE			||
 	    sbp->sb_inopblock != howmany(sbp->sb_blocksize,sbp->sb_inodesize) ||
 	    XFS_FSB_TO_B(mp, sbp->sb_agblocks) < XFS_MIN_AG_BYTES	||
 	    XFS_FSB_TO_B(mp, sbp->sb_agblocks) > XFS_MAX_AG_BYTES	||
@@ -430,6 +429,61 @@ xfs_validate_sb_common(
 		return -EFSCORRUPTED;
 	}
 
+	/*
+	 * Logs that are too large are not supported at all. Reject them
+	 * outright. Logs that are too small are tolerated on v4 filesystems,
+	 * but we can only check that when mounting the log. Hence we skip
+	 * those checks here.
+	 */
+	if (sbp->sb_logblocks > XFS_MAX_LOG_BLOCKS) {
+		xfs_notice(mp,
+		"Log size 0x%x blocks too large, maximum size is 0x%llx blocks",
+			 sbp->sb_logblocks, XFS_MAX_LOG_BLOCKS);
+		return -EFSCORRUPTED;
+	}
+
+	if (XFS_FSB_TO_B(mp, sbp->sb_logblocks) > XFS_MAX_LOG_BYTES) {
+		xfs_warn(mp,
+		"log size 0x%llx bytes too large, maximum size is 0x%llx bytes",
+			 XFS_FSB_TO_B(mp, sbp->sb_logblocks),
+			 XFS_MAX_LOG_BYTES);
+		return -EFSCORRUPTED;
+	}
+
+	/*
+	 * Do not allow filesystems with corrupted log sector or stripe units to
+	 * be mounted. We cannot safely size the iclogs or write to the log if
+	 * the log stripe unit is not valid.
+	 */
+	if (sbp->sb_versionnum & XFS_SB_VERSION_SECTORBIT) {
+		if (sbp->sb_logsectsize != (1U << sbp->sb_logsectlog)) {
+			xfs_notice(mp,
+			"log sector size in bytes/log2 (0x%x/0x%x) must match",
+				sbp->sb_logsectsize, 1U << sbp->sb_logsectlog);
+			return -EFSCORRUPTED;
+		}
+	} else if (sbp->sb_logsectsize || sbp->sb_logsectlog) {
+		xfs_notice(mp,
+		"log sector size in bytes/log2 (0x%x/0x%x) are not zero",
+			sbp->sb_logsectsize, sbp->sb_logsectlog);
+		return -EFSCORRUPTED;
+	}
+
+	if (sbp->sb_logsunit > 1) {
+		if (sbp->sb_logsunit % sbp->sb_blocksize) {
+			xfs_notice(mp,
+		"log stripe unit 0x%x bytes must be a multiple of block size",
+				sbp->sb_logsunit);
+			return -EFSCORRUPTED;
+		}
+		if (sbp->sb_logsunit > XLOG_MAX_RECORD_BSIZE) {
+			xfs_notice(mp,
+		"log stripe unit 0x%x bytes over maximum size (0x%x bytes)",
+				sbp->sb_logsunit, XLOG_MAX_RECORD_BSIZE);
+			return -EFSCORRUPTED;
+		}
+	}
+
 	/* Validate the realtime geometry; stolen from xfs_repair */
 	if (sbp->sb_rextsize * sbp->sb_blocksize > XFS_MAX_RTEXTSIZE ||
 	    sbp->sb_rextsize * sbp->sb_blocksize < XFS_MIN_RTEXTSIZE) {
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index fc61cc024023..79004d193e54 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -639,7 +639,6 @@ xfs_log_mount(
 	int		num_bblks)
 {
 	struct xlog	*log;
-	bool		fatal = xfs_has_crc(mp);
 	int		error = 0;
 	int		min_logfsbs;
 
@@ -663,53 +662,37 @@ xfs_log_mount(
 	mp->m_log = log;
 
 	/*
-	 * Validate the given log space and drop a critical message via syslog
-	 * if the log size is too small that would lead to some unexpected
-	 * situations in transaction log space reservation stage.
+	 * Now that we have set up the log and it's internal geometry
+	 * parameters, we can validate the given log space and drop a critical
+	 * message via syslog if the log size is too small. A log that is too
+	 * small can lead to unexpected situations in transaction log space
+	 * reservation stage. The superblock verifier has already validated all
+	 * the other log geometry constraints, so we don't have to check those
+	 * here.
 	 *
-	 * Note: we can't just reject the mount if the validation fails.  This
-	 * would mean that people would have to downgrade their kernel just to
-	 * remedy the situation as there is no way to grow the log (short of
-	 * black magic surgery with xfs_db).
+	 * Note: For v4 filesystems, we can't just reject the mount if the
+	 * validation fails.  This would mean that people would have to
+	 * downgrade their kernel just to remedy the situation as there is no
+	 * way to grow the log (short of black magic surgery with xfs_db).
 	 *
-	 * We can, however, reject mounts for CRC format filesystems, as the
+	 * We can, however, reject mounts for V5 format filesystems, as the
 	 * mkfs binary being used to make the filesystem should never create a
 	 * filesystem with a log that is too small.
 	 */
 	min_logfsbs = xfs_log_calc_minimum_size(mp);
-
 	if (mp->m_sb.sb_logblocks < min_logfsbs) {
 		xfs_warn(mp,
 		"Log size %d blocks too small, minimum size is %d blocks",
 			 mp->m_sb.sb_logblocks, min_logfsbs);
-		error = -EINVAL;
-	} else if (mp->m_sb.sb_logblocks > XFS_MAX_LOG_BLOCKS) {
-		xfs_warn(mp,
-		"Log size %d blocks too large, maximum size is %lld blocks",
-			 mp->m_sb.sb_logblocks, XFS_MAX_LOG_BLOCKS);
-		error = -EINVAL;
-	} else if (XFS_FSB_TO_B(mp, mp->m_sb.sb_logblocks) > XFS_MAX_LOG_BYTES) {
-		xfs_warn(mp,
-		"log size %lld bytes too large, maximum size is %lld bytes",
-			 XFS_FSB_TO_B(mp, mp->m_sb.sb_logblocks),
-			 XFS_MAX_LOG_BYTES);
-		error = -EINVAL;
-	} else if (mp->m_sb.sb_logsunit > 1 &&
-		   mp->m_sb.sb_logsunit % mp->m_sb.sb_blocksize) {
-		xfs_warn(mp,
-		"log stripe unit %u bytes must be a multiple of block size",
-			 mp->m_sb.sb_logsunit);
-		error = -EINVAL;
-		fatal = true;
-	}
-	if (error) {
+
 		/*
 		 * Log check errors are always fatal on v5; or whenever bad
 		 * metadata leads to a crash.
 		 */
-		if (fatal) {
+		if (xfs_has_crc(mp)) {
 			xfs_crit(mp, "AAIEEE! Log failed size checks. Abort!");
 			ASSERT(0);
+			error = -EINVAL;
 			goto out_free_log;
 		}
 		xfs_crit(mp, "Log size out of supported range.");

From patchwork Tue Jun 27 22:44:11 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 13295071
Return-Path: <linux-xfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 28FFDEB64DC
	for <linux-xfs@archiver.kernel.org>; Tue, 27 Jun 2023 22:44:29 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229840AbjF0Wo2 (ORCPT <rfc822;linux-xfs@archiver.kernel.org>);
        Tue, 27 Jun 2023 18:44:28 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60358 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229944AbjF0Wo0 (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Tue, 27 Jun 2023 18:44:26 -0400
Received: from mail-pf1-x432.google.com (mail-pf1-x432.google.com
 [IPv6:2607:f8b0:4864:20::432])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 217812944
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:22 -0700 (PDT)
Received: by mail-pf1-x432.google.com with SMTP id
 d2e1a72fcca58-666e3b15370so3359840b3a.0
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905861;
 x=1690497861;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:from:to:cc:subject:date:message-id
         :reply-to;
        bh=XgXhhqNbFi+/MFXpR0K7xhzjAOkJwBg8lJjaklWSvcA=;
        b=itAlwsAQZXNYjX+H/2/t6ZfBld3O31tQQMTJbzz/MpxIyTpNGkKoU7fE+8lfqirKRh
         Yyi5U8aNwzIvfjYgmg3u992g5uUzuPmvly/0F6GTjik/hu+D8GYk/jXj23/ObH50I0gc
         si3L0z9MN8cjJ/FhboVLkDnyn/cnC7aApLkv8jBv/nHxQbdqzrUEe/WdsYlgJjD9WkKd
         ZeVaJ9MFcZ/gB/MsE2CZl/wldBaQIYEFATEwlt2YNQSNPmnVwhhsFxBF+XL+11PafdqY
         /NbJvLd5h+kjW2ggxclXxOhzbo1Xdr2qGGcsm4Vqtjk/PUWVZEm/C6AndHG5tF5H1YM/
         DSXg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1687905861; x=1690497861;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=XgXhhqNbFi+/MFXpR0K7xhzjAOkJwBg8lJjaklWSvcA=;
        b=AZvZvaxX0UeSWypw1qLR0cziwjpebkV3BOkMHBVbQbaNbh7/cQwYtJIsmNkuY5tspL
         FZ8KjlfU2qxi+Z0n9arYL9thEETkaap7MdXiRNTiC0D2cXCO4+U7WWkBsedjvpFeTH/h
         WFps0c9TKkkTREMCxTRh+khO3PjMK6TugeXoU47aQ55iNPLlg/L2lhShBYT9y8+eixxS
         BMxw5Z9bRnacNM+Y1Fm2NXq0pEK+l9Dg2iixdrYJXvIOm4Yj9fMeM5Ur39uLMfK8I3oA
         GJFopMgAejwfGTyrD/UjTqpGuAm2WeINj9AiToBXQvYJLExvWBWowpJS4W+pkw7pyzHe
         CMaA==
X-Gm-Message-State: AC+VfDwDIEY3zwzTcrX83JRRxi1XTbLmOmb2INkVJ5n2s0qW4vlttHCq
        bifWxVtVHZ/wwzkZjrpmqSZ/4C39bdUNBq22yIs=
X-Google-Smtp-Source: 
 ACHHUZ49aXbvIzsF9AeTj++Gor4TVgJelxbvNl1FVSZxXSSsfsnLLt8xpYOEGMdev0yrWeHWUQ/nsA==
X-Received: by 2002:a05:6a00:99c:b0:673:8dfb:af32 with SMTP id
 u28-20020a056a00099c00b006738dfbaf32mr8611000pfg.26.1687905861294;
        Tue, 27 Jun 2023 15:44:21 -0700 (PDT)
Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au.
 [49.186.94.37])
        by smtp.gmail.com with ESMTPSA id
 p15-20020a63e64f000000b0051b0e564963sm6350434pgj.49.2023.06.27.15.44.18
        for <linux-xfs@vger.kernel.org>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 27 Jun 2023 15:44:20 -0700 (PDT)
Received: from [192.168.253.23] (helo=devoid.disaster.area)
        by dread.disaster.area with esmtp (Exim 4.96)
        (envelope-from <dave@fromorbit.com>)
        id 1qEHPz-00Gzwk-1y
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:15 +1000
Received: from dave by devoid.disaster.area with local (Exim 4.96)
        (envelope-from <dave@devoid.disaster.area>)
        id 1qEHPz-009Zma-0o
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:15 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 7/8] xfs: AGF length has never been bounds checked
Date: Wed, 28 Jun 2023 08:44:11 +1000
Message-Id: <20230627224412.2242198-8-david@fromorbit.com>
X-Mailer: git-send-email 2.40.1
In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com>
References: <20230627224412.2242198-1-david@fromorbit.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

From: Dave Chinner <dchinner@redhat.com>

The AGF verifier does not check that the AGF length field is within
known good bounds. This has never been checked by runtime kernel
code (i.e. the lack of verification goes back to 1993) yet we assume
in many places that it is correct and verify other metdata against
it.

Add length verification to the AGF verifier. The length of the AGF
must be equal to the size of the AG specified in the superblock,
unless it is the last AG in the filesystem. In that case, it must be
less than or equal to sb->sb_agblocks and greater than
XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs operation will
allow to exist.

This requires a bit of rework of the verifier function. We want to
verify metadata before we use it to verify other metadata. Hence
we need to verify the AGF sequence numbers before using them to
verify the length of the AGF. Then we can verify the AGF length
before we verify AGFL fields. Then we can verifier other fields that
are bounds limited by the AGF length.

And, finally, by calculating agf_length only once into a local
variable, we can collapse repeated "if (xfs_has_foo() &&"
conditionaly checks into single checks. This makes the code much
easier to follow as all the checks for a given feature are obviously
in the same place.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c | 86 +++++++++++++++++++++++----------------
 1 file changed, 51 insertions(+), 35 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 1e72b91daff6..7c86a69354fb 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2974,6 +2974,7 @@ xfs_agf_verify(
 {
 	struct xfs_mount	*mp = bp->b_mount;
 	struct xfs_agf		*agf = bp->b_addr;
+	uint32_t		agf_length = be32_to_cpu(agf->agf_length);
 
 	if (xfs_has_crc(mp)) {
 		if (!uuid_equal(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid))
@@ -2985,18 +2986,43 @@ xfs_agf_verify(
 	if (!xfs_verify_magic(bp, agf->agf_magicnum))
 		return __this_address;
 
-	if (!(XFS_AGF_GOOD_VERSION(be32_to_cpu(agf->agf_versionnum)) &&
-	      be32_to_cpu(agf->agf_freeblks) <= be32_to_cpu(agf->agf_length) &&
-	      be32_to_cpu(agf->agf_flfirst) < xfs_agfl_size(mp) &&
-	      be32_to_cpu(agf->agf_fllast) < xfs_agfl_size(mp) &&
-	      be32_to_cpu(agf->agf_flcount) <= xfs_agfl_size(mp)))
+	if (!XFS_AGF_GOOD_VERSION(be32_to_cpu(agf->agf_versionnum)))
 		return __this_address;
 
-	if (be32_to_cpu(agf->agf_length) > mp->m_sb.sb_dblocks)
+	/*
+	 * Both agf_seqno and agf_length need to validated before anything else
+	 * block number related in the AGF or AGFL can be checked.
+	 *
+	 * During growfs operations, the perag is not fully initialised,
+	 * so we can't use it for any useful checking. growfs ensures we can't
+	 * use it by using uncached buffers that don't have the perag attached
+	 * so we can detect and avoid this problem.
+	 */
+	if (bp->b_pag && be32_to_cpu(agf->agf_seqno) != bp->b_pag->pag_agno)
+		return __this_address;
+
+	/*
+	 * Only the last AGF in the filesytsem is allowed to be shorter
+	 * than the AG size recorded in the superblock.
+	 */
+	if (agf_length != mp->m_sb.sb_agblocks) {
+		if (be32_to_cpu(agf->agf_seqno) != mp->m_sb.sb_agcount - 1)
+			return __this_address;
+		if (agf_length < XFS_MIN_AG_BLOCKS)
+			return __this_address;
+		if (agf_length > mp->m_sb.sb_agblocks)
+			return __this_address;
+	}
+
+	if (be32_to_cpu(agf->agf_flfirst) >= xfs_agfl_size(mp))
+		return __this_address;
+	if (be32_to_cpu(agf->agf_fllast) >= xfs_agfl_size(mp))
+		return __this_address;
+	if (be32_to_cpu(agf->agf_flcount) > xfs_agfl_size(mp))
 		return __this_address;
 
 	if (be32_to_cpu(agf->agf_freeblks) < be32_to_cpu(agf->agf_longest) ||
-	    be32_to_cpu(agf->agf_freeblks) > be32_to_cpu(agf->agf_length))
+	    be32_to_cpu(agf->agf_freeblks) > agf_length)
 		return __this_address;
 
 	if (be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]) < 1 ||
@@ -3007,38 +3033,28 @@ xfs_agf_verify(
 						mp->m_alloc_maxlevels)
 		return __this_address;
 
-	if (xfs_has_rmapbt(mp) &&
-	    (be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) < 1 ||
-	     be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) >
-						mp->m_rmap_maxlevels))
-		return __this_address;
-
-	if (xfs_has_rmapbt(mp) &&
-	    be32_to_cpu(agf->agf_rmap_blocks) > be32_to_cpu(agf->agf_length))
-		return __this_address;
-
-	/*
-	 * during growfs operations, the perag is not fully initialised,
-	 * so we can't use it for any useful checking. growfs ensures we can't
-	 * use it by using uncached buffers that don't have the perag attached
-	 * so we can detect and avoid this problem.
-	 */
-	if (bp->b_pag && be32_to_cpu(agf->agf_seqno) != bp->b_pag->pag_agno)
-		return __this_address;
-
 	if (xfs_has_lazysbcount(mp) &&
-	    be32_to_cpu(agf->agf_btreeblks) > be32_to_cpu(agf->agf_length))
+	    be32_to_cpu(agf->agf_btreeblks) > agf_length)
 		return __this_address;
 
-	if (xfs_has_reflink(mp) &&
-	    be32_to_cpu(agf->agf_refcount_blocks) >
-	    be32_to_cpu(agf->agf_length))
-		return __this_address;
+	if (xfs_has_rmapbt(mp)) {
+		if (be32_to_cpu(agf->agf_rmap_blocks) > agf_length)
+			return __this_address;
 
-	if (xfs_has_reflink(mp) &&
-	    (be32_to_cpu(agf->agf_refcount_level) < 1 ||
-	     be32_to_cpu(agf->agf_refcount_level) > mp->m_refc_maxlevels))
-		return __this_address;
+		if (be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) < 1 ||
+		    be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) >
+							mp->m_rmap_maxlevels)
+			return __this_address;
+	}
+
+	if (xfs_has_reflink(mp)) {
+		if (be32_to_cpu(agf->agf_refcount_blocks) > agf_length)
+			return __this_address;
+
+		if (be32_to_cpu(agf->agf_refcount_level) < 1 ||
+		    be32_to_cpu(agf->agf_refcount_level) > mp->m_refc_maxlevels)
+			return __this_address;
+	}
 
 	return NULL;
 }

From patchwork Tue Jun 27 22:44:12 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 13295072
Return-Path: <linux-xfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1A661C001B0
	for <linux-xfs@archiver.kernel.org>; Tue, 27 Jun 2023 22:44:30 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229944AbjF0Wo2 (ORCPT <rfc822;linux-xfs@archiver.kernel.org>);
        Tue, 27 Jun 2023 18:44:28 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60306 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229982AbjF0WoZ (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Tue, 27 Jun 2023 18:44:25 -0400
Received: from mail-pf1-x435.google.com (mail-pf1-x435.google.com
 [IPv6:2607:f8b0:4864:20::435])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3B99326A4
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:21 -0700 (PDT)
Received: by mail-pf1-x435.google.com with SMTP id
 d2e1a72fcca58-666eec46206so4589609b3a.3
        for <linux-xfs@vger.kernel.org>; Tue, 27 Jun 2023 15:44:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905860;
 x=1690497860;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:from:to:cc:subject:date:message-id
         :reply-to;
        bh=Y8HfHfWhYbh6r3O4pO/b8MI4sO9R7L0l1wq6+nMOuKU=;
        b=u4ZJ0WzqPl0HW5OMvFaacz90LZ8ardlFocLk93osQIX6YML5lcMSkNC6BK97kbnUT8
         EU7p38p0TfqCSc3/zs59Arc9MlVpZiBZgUUFlkC3dyX0fAc8efe4kwS220BzvD5KmOej
         FC/HsDLUaJIgk0XiHVg4Ez7O/PZjMO38Csk/FxB4/ggYe+3Zf28qmmJit0mvOjMCSGIG
         553iH3uI3heIyAU7Dd+Bwju9TKAhjjOFCZe4C38xVlcyKr7XXchGKgpyDqKd2lWyed1E
         vX9qvBR14O95/0Asi74Ljhpd0dgcN/dBynlgIx/7cd03biFNl2mW/nM97ai5hoSpj0y5
         yE7w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1687905860; x=1690497860;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=Y8HfHfWhYbh6r3O4pO/b8MI4sO9R7L0l1wq6+nMOuKU=;
        b=U0IqsFJamRwwW6NRvuzcwXnapm/R8TX90sR1iBe6aayydCNvi8KuHU6HShOkUwIYXu
         +pjk/WHdiFIFME/z3rmgimu33QdaCsITsFll8mZPrTTd3zhHyrz4AXR3YKUgXyK2x8Km
         ft+iA5KN9VUzM/uHBimBHS1ND8SkqtvVWfXklBmbg5Z0kUx977hMeFlADvlAshJa6ORc
         D36YPZ/BotuxQPXQsqs/YJy6QQW4hB9+3PEGfEX3AwhP9/wMuQesCI7x/EYQKm+KZqhb
         I7EPJcF3/MRdA4isdH3SGLZwm0vLLIB7v3Yc6aWHgy+rbectJ7jqjoNtxuG6bd/p6+01
         O7aw==
X-Gm-Message-State: AC+VfDzz4tt+M37vbsbSZArMUZ4Hzkj4OabM1TAJVWEcdsmQQgmQpcf2
        IRvKMioKPPbjf+JJPN/SYcwAl281erzsUQszPAE=
X-Google-Smtp-Source: 
 ACHHUZ5xEf5S3eUoKxKZ619s8YgkgDXaGKXpo+RNNC0Jd+6uP1F6w2es7l0XCFnUlzP8C8BfjT/odg==
X-Received: by 2002:a05:6a20:138a:b0:129:3c9:ffab with SMTP id
 hn10-20020a056a20138a00b0012903c9ffabmr4994299pzc.45.1687905860719;
        Tue, 27 Jun 2023 15:44:20 -0700 (PDT)
Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au.
 [49.186.94.37])
        by smtp.gmail.com with ESMTPSA id
 h3-20020a635303000000b0051b9e82d6d6sm6195234pgb.40.2023.06.27.15.44.18
        for <linux-xfs@vger.kernel.org>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 27 Jun 2023 15:44:20 -0700 (PDT)
Received: from [192.168.253.23] (helo=devoid.disaster.area)
        by dread.disaster.area with esmtp (Exim 4.96)
        (envelope-from <dave@fromorbit.com>)
        id 1qEHPz-00Gzwn-27
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:15 +1000
Received: from dave by devoid.disaster.area with local (Exim 4.96)
        (envelope-from <dave@devoid.disaster.area>)
        id 1qEHPz-009Zmf-0y
        for linux-xfs@vger.kernel.org;
        Wed, 28 Jun 2023 08:44:15 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 8/8] xfs: fix bounds check in xfs_defer_agfl_block()
Date: Wed, 28 Jun 2023 08:44:12 +1000
Message-Id: <20230627224412.2242198-9-david@fromorbit.com>
X-Mailer: git-send-email 2.40.1
In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com>
References: <20230627224412.2242198-1-david@fromorbit.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

From: Dave Chinner <dchinner@redhat.com>

Need to happen before we allocate and then leak the xefi. Found by
coverity via an xfsprogs libxfs scan.

Fixes: 7dfee17b13e5 ("xfs: validate block number being freed before adding to xefi")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 7c86a69354fb..9919fdfe1d7e 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2470,25 +2470,26 @@ static int
 xfs_defer_agfl_block(
 	struct xfs_trans		*tp,
 	xfs_agnumber_t			agno,
-	xfs_fsblock_t			agbno,
+	xfs_agblock_t			agbno,
 	struct xfs_owner_info		*oinfo)
 {
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_extent_free_item	*xefi;
+	xfs_fsblock_t			fsbno = XFS_AGB_TO_FSB(mp, agno, agbno);
 
 	ASSERT(xfs_extfree_item_cache != NULL);
 	ASSERT(oinfo != NULL);
 
+	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbno(mp, fsbno)))
+		return -EFSCORRUPTED;
+
 	xefi = kmem_cache_zalloc(xfs_extfree_item_cache,
 			       GFP_KERNEL | __GFP_NOFAIL);
-	xefi->xefi_startblock = XFS_AGB_TO_FSB(mp, agno, agbno);
+	xefi->xefi_startblock = fsbno;
 	xefi->xefi_blockcount = 1;
 	xefi->xefi_owner = oinfo->oi_owner;
 	xefi->xefi_type = XFS_AG_RESV_AGFL;
 
-	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbno(mp, xefi->xefi_startblock)))
-		return -EFSCORRUPTED;
-
 	trace_xfs_agfl_free_defer(mp, agno, 0, agbno, 1);
 
 	xfs_extent_free_get_group(mp, xefi);

From patchwork Thu Jun 29 19:42:30 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Darrick J. Wong" <djwong@kernel.org>
X-Patchwork-Id: 13297252
Return-Path: <linux-xfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 89D83EB64D9
	for <linux-xfs@archiver.kernel.org>; Thu, 29 Jun 2023 19:42:35 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229742AbjF2Tme (ORCPT <rfc822;linux-xfs@archiver.kernel.org>);
        Thu, 29 Jun 2023 15:42:34 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57584 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232066AbjF2Tmd (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Thu, 29 Jun 2023 15:42:33 -0400
Received: from dfw.source.kernel.org (dfw.source.kernel.org
 [IPv6:2604:1380:4641:c500::1])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4307C2D4E
        for <linux-xfs@vger.kernel.org>; Thu, 29 Jun 2023 12:42:32 -0700 (PDT)
Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits))
        (No client certificate requested)
        by dfw.source.kernel.org (Postfix) with ESMTPS id 711E361615
        for <linux-xfs@vger.kernel.org>; Thu, 29 Jun 2023 19:42:31 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id CE0BCC433C8;
        Thu, 29 Jun 2023 19:42:30 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
        s=k20201202; t=1688067750;
        bh=IBN+6BTxGjXE1reY/HLR6yMBhAjZqYYLo4fs2n1tal8=;
        h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
        b=Zw5w3MqQC0HBnP2iRBpUssn9Nm5OH9HcfOrNBmwRZ4dV8QU3yUL7l/zNpKzKxq3wT
         20T15A/Idt8QX8hKzxyBZrS0558fJUQABXJTuowYVEoZGnDLRYpHGtgGH7P33L+Ns4
         jLeXqTHm4mIDczpg8+TgBySeJaI52gU4XoSBXpfIQAE2t7hdiCQz5jGI9Lmm9ZMhXb
         9vmY3snlbbNNs+L/SbEUIbwVqvNseV+GfHjuwf0N3vXkIxJTvq2J93qx4RWPKMwTqc
         cm2c3SOhWomaXNp1QFRHmgCmnPv17Lj1Hae0NuUn+BkzULGe3dQs1DrNuReigKdR1L
         g7yitsrJ9ogjg==
Date: Thu, 29 Jun 2023 12:42:30 -0700
From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: [RFC PATCH 9/8] xfs: AGI length should be bounds checked
Message-ID: <20230629194230.GH11441@frogsfrogsfrogs>
References: <20230627224412.2242198-1-david@fromorbit.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com>
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

From: Darrick J. Wong <djwong@kernel.org>

Similar to the recent patch strengthening the AGF agf_length
verification, the AGI verifier does not check that the AGI length field
is within known good bounds.  This isn't currently checked by runtime
kernel code, yet we assume in many places that it is correct and verify
other metadata against it.

Add length verification to the AGI verifier.  Just like the AGF length
checking, the length of the AGI must be equal to the size of the AG
specified in the superblock, unless it is the last AG in the filesystem.
In that case, it must be less than or equal to sb->sb_agblocks and
greater than XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs
operation will allow to exist.

There's only one place in the filesystem that actually uses agi_length,
but let's not leave it vulnerable to the same weird nonsense that
generates syzbot bugs, eh?

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
NOTE: Untested, but this patch builds...
---
 fs/xfs/libxfs/xfs_ialloc.c |   49 ++++++++++++++++++++++++++++++++------------
 1 file changed, 36 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 1e5fafbc0cdb..fec6713e1fa9 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2486,11 +2486,12 @@ xfs_ialloc_log_agi(
 
 static xfs_failaddr_t
 xfs_agi_verify(
-	struct xfs_buf	*bp)
+	struct xfs_buf		*bp)
 {
-	struct xfs_mount *mp = bp->b_mount;
-	struct xfs_agi	*agi = bp->b_addr;
-	int		i;
+	struct xfs_mount	*mp = bp->b_mount;
+	struct xfs_agi		*agi = bp->b_addr;
+	uint32_t		agi_length = be32_to_cpu(agi->agi_length);
+	int			i;
 
 	if (xfs_has_crc(mp)) {
 		if (!uuid_equal(&agi->agi_uuid, &mp->m_sb.sb_meta_uuid))
@@ -2507,6 +2508,37 @@ xfs_agi_verify(
 	if (!XFS_AGI_GOOD_VERSION(be32_to_cpu(agi->agi_versionnum)))
 		return __this_address;
 
+	/*
+	 * Both agi_seqno and agi_length need to validated before anything else
+	 * block number related in the AGI can be checked.
+	 *
+	 * During growfs operations, the perag is not fully initialised,
+	 * so we can't use it for any useful checking. growfs ensures we can't
+	 * use it by using uncached buffers that don't have the perag attached
+	 * so we can detect and avoid this problem.
+	 */
+	if (bp->b_pag && be32_to_cpu(agi->agi_seqno) != bp->b_pag->pag_agno)
+		return __this_address;
+
+	/*
+	 * Only the last AGI in the filesytsem is allowed to be shorter
+	 * than the AG size recorded in the superblock.
+	 */
+	if (agi_length != mp->m_sb.sb_agblocks) {
+		/*
+		 * During growfs, the new last AGI can get here before we
+		 * have updated the superblock. Give it a pass on the seqno
+		 * check.
+		 */
+		if (bp->b_pag &&
+		    be32_to_cpu(agi->agi_seqno) != mp->m_sb.sb_agcount - 1)
+			return __this_address;
+		if (agi_length < XFS_MIN_AG_BLOCKS)
+			return __this_address;
+		if (agi_length > mp->m_sb.sb_agblocks)
+			return __this_address;
+	}
+
 	if (be32_to_cpu(agi->agi_level) < 1 ||
 	    be32_to_cpu(agi->agi_level) > M_IGEO(mp)->inobt_maxlevels)
 		return __this_address;
@@ -2516,15 +2548,6 @@ xfs_agi_verify(
 	     be32_to_cpu(agi->agi_free_level) > M_IGEO(mp)->inobt_maxlevels))
 		return __this_address;
 
-	/*
-	 * during growfs operations, the perag is not fully initialised,
-	 * so we can't use it for any useful checking. growfs ensures we can't
-	 * use it by using uncached buffers that don't have the perag attached
-	 * so we can detect and avoid this problem.
-	 */
-	if (bp->b_pag && be32_to_cpu(agi->agi_seqno) != bp->b_pag->pag_agno)
-		return __this_address;
-
 	for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) {
 		if (agi->agi_unlinked[i] == cpu_to_be32(NULLAGINO))
 			continue;