From patchwork Tue Jun 27 22:44:05 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 13295068 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E3925C001B0 for ; Tue, 27 Jun 2023 22:44:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229895AbjF0WoZ (ORCPT ); Tue, 27 Jun 2023 18:44:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60312 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229501AbjF0WoX (ORCPT ); Tue, 27 Jun 2023 18:44:23 -0400 Received: from mail-pl1-x62a.google.com (mail-pl1-x62a.google.com [IPv6:2607:f8b0:4864:20::62a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 64B8E272D for ; Tue, 27 Jun 2023 15:44:19 -0700 (PDT) Received: by mail-pl1-x62a.google.com with SMTP id d9443c01a7336-1b51488ad67so28100925ad.3 for ; Tue, 27 Jun 2023 15:44:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905859; x=1690497859; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=8Pt9hTh40OmTegOekLcRzx5llx2d/nej7rlH4ks8rf4=; b=mom7otQF0a2HN08fs9kdMIgmJR1YOyc7Tea2SHuWXX277jsbkhuVfpETqFNatTyNEo AQSI2f8ASRNhYHzvceu4kph06vQ9//cIGRUPu3xNWbSnbyPHuU7XAtT2PYethdfj01EV N2Za+WmccMR9ygztBScc17aLFOZ2Q9TGESZ5+lwKcwXqxywleVBV3R8cFpImntM0EH8/ SwHB0mpiChMVeHii+VLrU4xW8vdROZFJ8Y5CAzNXbai9ren1iIORRIiUreLRTT7mUAJq 9Njj2k3vYMgVmEdQuDvjoJTHUi6Krhis4xm+DbGLlV2uKhFf6Ab9AaVwlSdWi3+NAgOR xMFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687905859; x=1690497859; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8Pt9hTh40OmTegOekLcRzx5llx2d/nej7rlH4ks8rf4=; b=WavzeHeZwkwzCCNhDLpiBSbAiJXtAxU4uS83sAHukj7pEdvLmvhMpgxbmH/IUJveY7 oIy3WS5HQCnniTZup4nF/LhkDzSMp9516Y1kUhKtS7iSIGtgBNi/cpt2MT0r6sNrwzmw PeVh4idqBgoaQfZecvLfH37CMsQ4iUF+NmxDdc+cwBgeXaL8tY4D5n3qZc5dhjHyjRf4 pGiYDWTjkIjcVLpyLRSjlWSECAHLtjSJnQDBfRToxLhjY4duJDby+7xtAieSNS0yi64N b5ZSWvpDH9s0kjIksqK2OujAcCabRDRVbultip+V/kIWsWV5zGXmlbT5U3YXTngs2R9R UIXQ== X-Gm-Message-State: AC+VfDzLcd878wxDUen8WNnYEAiNF24UdfoRp9QZqx+qPSCBYk9Vg7W8 fSWKM+eyze7KwZ3RlFEHb+JZyOwvwnVlp2o5KYg= X-Google-Smtp-Source: ACHHUZ6haVb/ehzeHZFWdzfjctyqRHDgzI7KB0p32JOJcvehJ+yGPjuuQe+Ov6hZmK7jPUD10lmF7A== X-Received: by 2002:a17:903:2343:b0:1b5:54cc:fcb7 with SMTP id c3-20020a170903234300b001b554ccfcb7mr9955372plh.34.1687905858863; Tue, 27 Jun 2023 15:44:18 -0700 (PDT) Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au. [49.186.94.37]) by smtp.gmail.com with ESMTPSA id jn16-20020a170903051000b001b53472353csm6403955plb.211.2023.06.27.15.44.17 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Jun 2023 15:44:17 -0700 (PDT) Received: from [192.168.253.23] (helo=devoid.disaster.area) by dread.disaster.area with esmtp (Exim 4.96) (envelope-from ) id 1qEHPz-00GzwS-0z for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:15 +1000 Received: from dave by devoid.disaster.area with local (Exim 4.96) (envelope-from ) id 1qEHPy-009Zm7-37 for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:14 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 1/8] xfs: don't reverse order of items in bulk AIL insertion Date: Wed, 28 Jun 2023 08:44:05 +1000 Message-Id: <20230627224412.2242198-2-david@fromorbit.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com> References: <20230627224412.2242198-1-david@fromorbit.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner XFS has strict metadata ordering requirements. One of the things it does is maintain the commit order of items from transaction commit through the CIL and into the AIL. That is, if a transaction logs item A before item B in a modification, then they will be inserted into the CIL in the order {A, B}. These items are then written into the iclog during checkpointing in the order {A, B}. When the checkpoint commits, they are supposed to be inserted into the AIL in the order {A, B}, and when they are pushed from the AIL, they are pushed in the order {A, B}. If we crash, log recovery then replays the two items from the checkpoint in the order {A, B}, resulting in the objects the items apply to being queued for writeback at the end of the checkpoint in the order {A, B}. This means recovery behaves the same way as the runtime code. In places, we have subtle dependencies on this ordering being maintained. One of this place is performing intent recovery from the log. It assumes that recovering an intent will result in a non-intent object being the first thing that is modified in the recovery transaction, and so when the transaction commits and the journal flushes, the first object inserted into the AIL beyond the intent recovery range will be a non-intent item. It uses the transistion from intent items to non-intent items to stop the recovery pass. A recent log recovery issue indicated that an intent was appearing as the first item in the AIL beyond the recovery range, hence breaking the end of recovery detection that exists. Tracing indicated insertion of the items into the AIL was apparently occurring in the right order (the intent was last in the commit item list), but the intent was appearing first in the AIL. IOWs, the order of items in the AIL was {D,C,B,A}, not {A,B,C,D}, and bulk insertion was reversing the order of the items in the batch of items being inserted. Lucky for us, all the items fed to bulk insertion have the same LSN, so the reversal of order does not affect the log head/tail tracking that is based on the contents of the AIL. It only impacts on code that has implicit, subtle dependencies on object order, and AFAICT only the intent recovery loop is impacted by it. Make sure bulk AIL insertion does not reorder items incorrectly. Fixes: 0e57f6a36f9b ("xfs: bulk AIL insertion during transaction commit") Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Chandan Babu R Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_trans_ail.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 7d4109af193e..1098452e7f95 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -823,7 +823,7 @@ xfs_trans_ail_update_bulk( trace_xfs_ail_insert(lip, 0, lsn); } lip->li_lsn = lsn; - list_add(&lip->li_ail, &tmp); + list_add_tail(&lip->li_ail, &tmp); } if (!list_empty(&tmp)) From patchwork Tue Jun 27 22:44:06 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 13295074 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9496CEB64DC for ; Tue, 27 Jun 2023 22:44:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230004AbjF0Wo3 (ORCPT ); Tue, 27 Jun 2023 18:44:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60370 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229501AbjF0Wo2 (ORCPT ); Tue, 27 Jun 2023 18:44:28 -0400 Received: from mail-oi1-x231.google.com (mail-oi1-x231.google.com [IPv6:2607:f8b0:4864:20::231]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7B51F2945 for ; Tue, 27 Jun 2023 15:44:22 -0700 (PDT) Received: by mail-oi1-x231.google.com with SMTP id 5614622812f47-3a0423ea749so4126043b6e.0 for ; Tue, 27 Jun 2023 15:44:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905861; x=1690497861; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=HQFgV6ssEPI/mbuynXLID8AB1DFeefdovCY9d+I6BuQ=; b=QClWJY1CdrPrdWcb5pjZWkFt47ZfzXO6KAVf3efpVcz0hTmTmEB4kFALdPAEMjRal8 lPkimnNmD/eKZK3jGmVuOoafB+vJCvEUeKpGC0887DfIPHAi95tJb5wYB/TdIoyZe0lY f2/bXcuCPKsbfSawx89zF8xUKNg1dBq8lnZ/UmfMbe6isWyHPkXLkhT4LA4i2NGm9QYI /ec7EfJpPqZ0N4nV8jaXhXSS8nT6aYq1Rw6lUyLu3Xgg4c2xRxMb2FAKjTelr7RdDmfO J0nqkrTub/7/6bfStYdDPZ5pxthkC6XA1pZdqh1JNYgGXP/oFiiVvaoPDPd4sicpW2Kd 9Ygw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687905861; x=1690497861; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HQFgV6ssEPI/mbuynXLID8AB1DFeefdovCY9d+I6BuQ=; b=Tk8X5XAj853vTgLtCZjYzDm8yHH3u2o/MjqzJVRECPeKQ/DCNPw+nuMsnZOL6j4lDZ 75R9MbVsFfwjtQKmsc0qtts2Fqj5EMSL6QHLSL42wQmrpOFC97+Qxzu3c0GTYKNpTDmB CI71jFaw596OkJAbQ+MtVxnyIvoUcjLg+GRnVeIvgPbmmDwILzFVu7YWC6SfJLsFyJLG rJy7sESyIYcTrLbYkHYfv4dViSQFEQB3Gk2qm250XbLO0RURRjxNQEl3jyYGWFLqh/pg WGHj+ErOKGbZ1wZGFur0l4oWo3ueCKb2lWNgYg/AlyIVirm0BdyS8ug81jix1pevDiJ8 YcZA== X-Gm-Message-State: AC+VfDxyb1BejIYWFxSeqRd8lp+sAc0J7u5Iu038um+xiWS6yHutu2on ApnHE/QrNBKj/Ou2S1M/Dbx9r5oVxfB9ziCUk7w= X-Google-Smtp-Source: ACHHUZ6d4ehbGkUOlK4xHgqtEh9YG6Z1t9rHGr6CbaVatR/Woe8u/fii0I+NxjmB1tZ1G2cGOpQzeg== X-Received: by 2002:a05:6808:1983:b0:3a1:e222:97db with SMTP id bj3-20020a056808198300b003a1e22297dbmr6994023oib.30.1687905861650; Tue, 27 Jun 2023 15:44:21 -0700 (PDT) Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au. [49.186.94.37]) by smtp.gmail.com with ESMTPSA id 29-20020a17090a199d00b0026334010a4fsm520032pji.35.2023.06.27.15.44.18 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Jun 2023 15:44:20 -0700 (PDT) Received: from [192.168.253.23] (helo=devoid.disaster.area) by dread.disaster.area with esmtp (Exim 4.96) (envelope-from ) id 1qEHPz-00GzwV-1C for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:15 +1000 Received: from dave by devoid.disaster.area with local (Exim 4.96) (envelope-from ) id 1qEHPz-009ZmB-01 for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:15 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 2/8] xfs: use deferred frees for btree block freeing Date: Wed, 28 Jun 2023 08:44:06 +1000 Message-Id: <20230627224412.2242198-3-david@fromorbit.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com> References: <20230627224412.2242198-1-david@fromorbit.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Btrees that aren't freespace management trees use the normal extent allocation and freeing routines for their blocks. Hence when a btree block is freed, a direct call to xfs_free_extent() is made and the extent is immediately freed. This puts the entire free space management btrees under this path, so we are stacking btrees on btrees in the call stack. The inobt, finobt and refcount btrees all do this. However, the bmap btree does not do this - it calls xfs_free_extent_later() to defer the extent free operation via an XEFI and hence it gets processed in deferred operation processing during the commit of the primary transaction (i.e. via intent chaining). We need to change xfs_free_extent() to behave in a non-blocking manner so that we can avoid deadlocks with busy extents near ENOSPC in transactions that free multiple extents. Inserting or removing a record from a btree can cause a multi-level tree merge operation and that will free multiple blocks from the btree in a single transaction. i.e. we can call xfs_free_extent() multiple times, and hence the btree manipulation transaction is vulnerable to this busy extent deadlock vector. To fix this, convert all the remaining callers of xfs_free_extent() to use xfs_free_extent_later() to queue XEFIs and hence defer processing of the extent frees to a context that can be safely restarted if a deadlock condition is detected. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Chandan Babu R --- fs/xfs/libxfs/xfs_ag.c | 2 +- fs/xfs/libxfs/xfs_alloc.c | 4 ++++ fs/xfs/libxfs/xfs_alloc.h | 8 +++++--- fs/xfs/libxfs/xfs_bmap.c | 8 +++++--- fs/xfs/libxfs/xfs_bmap_btree.c | 3 ++- fs/xfs/libxfs/xfs_ialloc.c | 8 ++++---- fs/xfs/libxfs/xfs_ialloc_btree.c | 3 +-- fs/xfs/libxfs/xfs_refcount.c | 9 ++++++--- fs/xfs/libxfs/xfs_refcount_btree.c | 8 +------- fs/xfs/xfs_extfree_item.c | 3 ++- fs/xfs/xfs_reflink.c | 3 ++- 11 files changed, 33 insertions(+), 26 deletions(-) diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c index ee84835ebc66..e9cc481b4ddf 100644 --- a/fs/xfs/libxfs/xfs_ag.c +++ b/fs/xfs/libxfs/xfs_ag.c @@ -985,7 +985,7 @@ xfs_ag_shrink_space( goto resv_err; err2 = __xfs_free_extent_later(*tpp, args.fsbno, delta, NULL, - true); + XFS_AG_RESV_NONE, true); if (err2) goto resv_err; diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index c20fe99405d8..cc3f7b905ea1 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2449,6 +2449,7 @@ xfs_defer_agfl_block( xefi->xefi_startblock = XFS_AGB_TO_FSB(mp, agno, agbno); xefi->xefi_blockcount = 1; xefi->xefi_owner = oinfo->oi_owner; + xefi->xefi_type = XFS_AG_RESV_AGFL; if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbno(mp, xefi->xefi_startblock))) return -EFSCORRUPTED; @@ -2470,6 +2471,7 @@ __xfs_free_extent_later( xfs_fsblock_t bno, xfs_filblks_t len, const struct xfs_owner_info *oinfo, + enum xfs_ag_resv_type type, bool skip_discard) { struct xfs_extent_free_item *xefi; @@ -2490,6 +2492,7 @@ __xfs_free_extent_later( ASSERT(agbno + len <= mp->m_sb.sb_agblocks); #endif ASSERT(xfs_extfree_item_cache != NULL); + ASSERT(type != XFS_AG_RESV_AGFL); if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, bno, len))) return -EFSCORRUPTED; @@ -2498,6 +2501,7 @@ __xfs_free_extent_later( GFP_KERNEL | __GFP_NOFAIL); xefi->xefi_startblock = bno; xefi->xefi_blockcount = (xfs_extlen_t)len; + xefi->xefi_type = type; if (skip_discard) xefi->xefi_flags |= XFS_EFI_SKIP_DISCARD; if (oinfo) { diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h index 85ac470be0da..121faf1e11ad 100644 --- a/fs/xfs/libxfs/xfs_alloc.h +++ b/fs/xfs/libxfs/xfs_alloc.h @@ -232,7 +232,7 @@ xfs_buf_to_agfl_bno( int __xfs_free_extent_later(struct xfs_trans *tp, xfs_fsblock_t bno, xfs_filblks_t len, const struct xfs_owner_info *oinfo, - bool skip_discard); + enum xfs_ag_resv_type type, bool skip_discard); /* * List of extents to be free "later". @@ -245,6 +245,7 @@ struct xfs_extent_free_item { xfs_extlen_t xefi_blockcount;/* number of blocks in extent */ struct xfs_perag *xefi_pag; unsigned int xefi_flags; + enum xfs_ag_resv_type xefi_type; }; void xfs_extent_free_get_group(struct xfs_mount *mp, @@ -259,9 +260,10 @@ xfs_free_extent_later( struct xfs_trans *tp, xfs_fsblock_t bno, xfs_filblks_t len, - const struct xfs_owner_info *oinfo) + const struct xfs_owner_info *oinfo, + enum xfs_ag_resv_type type) { - return __xfs_free_extent_later(tp, bno, len, oinfo, false); + return __xfs_free_extent_later(tp, bno, len, oinfo, type, false); } diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index fef35696adb7..30c931b38853 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -574,7 +574,8 @@ xfs_bmap_btree_to_extents( return error; xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork); - error = xfs_free_extent_later(cur->bc_tp, cbno, 1, &oinfo); + error = xfs_free_extent_later(cur->bc_tp, cbno, 1, &oinfo, + XFS_AG_RESV_NONE); if (error) return error; @@ -5236,8 +5237,9 @@ xfs_bmap_del_extent_real( } else { error = __xfs_free_extent_later(tp, del->br_startblock, del->br_blockcount, NULL, - (bflags & XFS_BMAPI_NODISCARD) || - del->br_state == XFS_EXT_UNWRITTEN); + XFS_AG_RESV_NONE, + ((bflags & XFS_BMAPI_NODISCARD) || + del->br_state == XFS_EXT_UNWRITTEN)); if (error) goto done; } diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c index 36564ae3084f..bf3f1b36fdd2 100644 --- a/fs/xfs/libxfs/xfs_bmap_btree.c +++ b/fs/xfs/libxfs/xfs_bmap_btree.c @@ -271,7 +271,8 @@ xfs_bmbt_free_block( int error; xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, cur->bc_ino.whichfork); - error = xfs_free_extent_later(cur->bc_tp, fsbno, 1, &oinfo); + error = xfs_free_extent_later(cur->bc_tp, fsbno, 1, &oinfo, + XFS_AG_RESV_NONE); if (error) return error; diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c index 34600f94c2f4..1e5fafbc0cdb 100644 --- a/fs/xfs/libxfs/xfs_ialloc.c +++ b/fs/xfs/libxfs/xfs_ialloc.c @@ -1853,8 +1853,8 @@ xfs_difree_inode_chunk( /* not sparse, calculate extent info directly */ return xfs_free_extent_later(tp, XFS_AGB_TO_FSB(mp, agno, sagbno), - M_IGEO(mp)->ialloc_blks, - &XFS_RMAP_OINFO_INODES); + M_IGEO(mp)->ialloc_blks, &XFS_RMAP_OINFO_INODES, + XFS_AG_RESV_NONE); } /* holemask is only 16-bits (fits in an unsigned long) */ @@ -1899,8 +1899,8 @@ xfs_difree_inode_chunk( ASSERT(agbno % mp->m_sb.sb_spino_align == 0); ASSERT(contigblk % mp->m_sb.sb_spino_align == 0); error = xfs_free_extent_later(tp, - XFS_AGB_TO_FSB(mp, agno, agbno), - contigblk, &XFS_RMAP_OINFO_INODES); + XFS_AGB_TO_FSB(mp, agno, agbno), contigblk, + &XFS_RMAP_OINFO_INODES, XFS_AG_RESV_NONE); if (error) return error; diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c index 5a945ae21b5d..9258f01c0015 100644 --- a/fs/xfs/libxfs/xfs_ialloc_btree.c +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c @@ -160,8 +160,7 @@ __xfs_inobt_free_block( xfs_inobt_mod_blockcount(cur, -1); fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp)); - return xfs_free_extent(cur->bc_tp, cur->bc_ag.pag, - XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1, + return xfs_free_extent_later(cur->bc_tp, fsbno, 1, &XFS_RMAP_OINFO_INOBT, resv); } diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c index b6e21433925c..70ab113c9cea 100644 --- a/fs/xfs/libxfs/xfs_refcount.c +++ b/fs/xfs/libxfs/xfs_refcount.c @@ -1152,7 +1152,8 @@ xfs_refcount_adjust_extents( cur->bc_ag.pag->pag_agno, tmp.rc_startblock); error = xfs_free_extent_later(cur->bc_tp, fsbno, - tmp.rc_blockcount, NULL); + tmp.rc_blockcount, NULL, + XFS_AG_RESV_NONE); if (error) goto out_error; } @@ -1213,7 +1214,8 @@ xfs_refcount_adjust_extents( cur->bc_ag.pag->pag_agno, ext.rc_startblock); error = xfs_free_extent_later(cur->bc_tp, fsbno, - ext.rc_blockcount, NULL); + ext.rc_blockcount, NULL, + XFS_AG_RESV_NONE); if (error) goto out_error; } @@ -1981,7 +1983,8 @@ xfs_refcount_recover_cow_leftovers( /* Free the block. */ error = xfs_free_extent_later(tp, fsb, - rr->rr_rrec.rc_blockcount, NULL); + rr->rr_rrec.rc_blockcount, NULL, + XFS_AG_RESV_NONE); if (error) goto out_trans; diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c index d4afc5f4e6a5..5c3987d8dc24 100644 --- a/fs/xfs/libxfs/xfs_refcount_btree.c +++ b/fs/xfs/libxfs/xfs_refcount_btree.c @@ -106,19 +106,13 @@ xfs_refcountbt_free_block( struct xfs_buf *agbp = cur->bc_ag.agbp; struct xfs_agf *agf = agbp->b_addr; xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp)); - int error; trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_ag.pag->pag_agno, XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1); be32_add_cpu(&agf->agf_refcount_blocks, -1); xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS); - error = xfs_free_extent(cur->bc_tp, cur->bc_ag.pag, - XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1, + return xfs_free_extent_later(cur->bc_tp, fsbno, 1, &XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA); - if (error) - return error; - - return error; } STATIC int diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c index f9e36b810663..79e65bb6b0a2 100644 --- a/fs/xfs/xfs_extfree_item.c +++ b/fs/xfs/xfs_extfree_item.c @@ -365,7 +365,7 @@ xfs_trans_free_extent( agbno, xefi->xefi_blockcount); error = __xfs_free_extent(tp, xefi->xefi_pag, agbno, - xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE, + xefi->xefi_blockcount, &oinfo, xefi->xefi_type, xefi->xefi_flags & XFS_EFI_SKIP_DISCARD); /* @@ -644,6 +644,7 @@ xfs_efi_item_recover( for (i = 0; i < efip->efi_format.efi_nextents; i++) { struct xfs_extent_free_item fake = { .xefi_owner = XFS_RMAP_OWN_UNKNOWN, + .xefi_type = XFS_AG_RESV_NONE, }; struct xfs_extent *extp; diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index abcc559f3c64..eb9102453aff 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -617,7 +617,8 @@ xfs_reflink_cancel_cow_blocks( del.br_blockcount); error = xfs_free_extent_later(*tpp, del.br_startblock, - del.br_blockcount, NULL); + del.br_blockcount, NULL, + XFS_AG_RESV_NONE); if (error) break; From patchwork Tue Jun 27 22:44:07 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 13295073 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BBFF5EB64D9 for ; Tue, 27 Jun 2023 22:44:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229987AbjF0Wo3 (ORCPT ); Tue, 27 Jun 2023 18:44:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60366 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229995AbjF0Wo0 (ORCPT ); Tue, 27 Jun 2023 18:44:26 -0400 Received: from mail-pj1-x102f.google.com (mail-pj1-x102f.google.com [IPv6:2607:f8b0:4864:20::102f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C4EFB26A3 for ; Tue, 27 Jun 2023 15:44:20 -0700 (PDT) Received: by mail-pj1-x102f.google.com with SMTP id 98e67ed59e1d1-2632a72f289so567226a91.2 for ; Tue, 27 Jun 2023 15:44:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905860; x=1690497860; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=aWhBapU75WqXp3rBQnncpYPP9KljAfbysUq2fskWm48=; b=pIvizUS4zLEGKzIiqdgB/83Lact5bQ5Fyge++x94J8tX2sD7ggX+vjaEJmG9ixMA0s vABHdUaTo7yY91IRYcwgQSrwa6tGKgucaU16fNxrRNGnk/0vLGrbmSJZagbLlXjHz5Lo qFcOdU7+23KTT43z9v6Q/UT4NKvtwYhc9RpIGrFT2wLzK9JAgDCwNWeh3uq1GX1WCEfb tbHRf/aiyQJjokoszosK7o5U20Otcwf8OffzahbK1NEpF6KOZwJnAmMsmp+69J/PfDG8 0quTAY9mmJdol04BNsoW1YOYICWyUhxc/MgURtgyVswN1G7RhbFqxtqxHWaMnd1zEh3I tLNg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687905860; x=1690497860; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=aWhBapU75WqXp3rBQnncpYPP9KljAfbysUq2fskWm48=; b=XgQ9ABRSMafadhaTZYm0J1LJS9C13DnuTCjY8eNKPb9EhH17kkITxwiK0M1VXbQj0e C4lmY51ZXa0Lh4zBwQUq4P05n+jGtAJpmg6xmGz8nFSIbkdHr9f55PE2v4NKJVMCRVVC flWtCKvfrSQMFeOd9WN/b+u3h1+AWoPPvZgI3EgB34TF/beLxRl4IQq2j77GHG7WRXkT PgwjEelBW5w21yp2eyOMyWrHIosE3q12JmkXg1HogOc18QFCcGFvkTZiVaqHWvQ7zxGq XUTyCIaA35Sh0N9XE9BN7VK9NXuBYStSeTy5bq5azIxA+uuCp7GOh40I/ZQwzMb81v/Y 5HWQ== X-Gm-Message-State: AC+VfDwfCp0hBaIDWRTK54oKyq0HyP/saOoBpJFDufI9RV8nNDlXCUKe +3lEYCRXvB2cAoo2VMkq24KiCxSrP4pu49fzaIA= X-Google-Smtp-Source: ACHHUZ7iVuIG7upCLL//q7OessuDj8ZOwLRAbumCN/soSxUXhgtiWGhngKRBtIOK1bWrnPN6BxHxrA== X-Received: by 2002:a17:90a:898c:b0:256:cf39:afce with SMTP id v12-20020a17090a898c00b00256cf39afcemr22523164pjn.43.1687905860172; Tue, 27 Jun 2023 15:44:20 -0700 (PDT) Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au. [49.186.94.37]) by smtp.gmail.com with ESMTPSA id v8-20020a17090ad58800b00256a4d59bfasm8326551pju.23.2023.06.27.15.44.17 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Jun 2023 15:44:18 -0700 (PDT) Received: from [192.168.253.23] (helo=devoid.disaster.area) by dread.disaster.area with esmtp (Exim 4.96) (envelope-from ) id 1qEHPz-00GzwY-1J for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:15 +1000 Received: from dave by devoid.disaster.area with local (Exim 4.96) (envelope-from ) id 1qEHPz-009ZmG-0B for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:15 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 3/8] xfs: pass alloc flags through to xfs_extent_busy_flush() Date: Wed, 28 Jun 2023 08:44:07 +1000 Message-Id: <20230627224412.2242198-4-david@fromorbit.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com> References: <20230627224412.2242198-1-david@fromorbit.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner To avoid blocking in xfs_extent_busy_flush() when freeing extents and the only busy extents are held by the current transaction, we need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way into xfs_extent_busy_flush(). Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Chandan Babu R --- fs/xfs/libxfs/xfs_alloc.c | 96 +++++++++++++++++++++------------------ fs/xfs/libxfs/xfs_alloc.h | 2 +- fs/xfs/xfs_extent_busy.c | 3 +- fs/xfs/xfs_extent_busy.h | 2 +- 4 files changed, 56 insertions(+), 47 deletions(-) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index cc3f7b905ea1..ba929f6b5c9b 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock( */ STATIC int xfs_alloc_ag_vextent_near( - struct xfs_alloc_arg *args) + struct xfs_alloc_arg *args, + uint32_t alloc_flags) { struct xfs_alloc_cur acur = {}; int error; /* error code */ @@ -1612,7 +1613,7 @@ xfs_alloc_ag_vextent_near( if (acur.busy) { trace_xfs_alloc_near_busy(args); xfs_extent_busy_flush(args->mp, args->pag, - acur.busy_gen); + acur.busy_gen, alloc_flags); goto restart; } trace_xfs_alloc_size_neither(args); @@ -1635,21 +1636,22 @@ xfs_alloc_ag_vextent_near( * and of the form k * prod + mod unless there's nothing that large. * Return the starting a.g. block, or NULLAGBLOCK if we can't do it. */ -STATIC int /* error */ +static int xfs_alloc_ag_vextent_size( - xfs_alloc_arg_t *args) /* allocation argument structure */ + struct xfs_alloc_arg *args, + uint32_t alloc_flags) { - struct xfs_agf *agf = args->agbp->b_addr; - struct xfs_btree_cur *bno_cur; /* cursor for bno btree */ - struct xfs_btree_cur *cnt_cur; /* cursor for cnt btree */ - int error; /* error result */ - xfs_agblock_t fbno; /* start of found freespace */ - xfs_extlen_t flen; /* length of found freespace */ - int i; /* temp status variable */ - xfs_agblock_t rbno; /* returned block number */ - xfs_extlen_t rlen; /* length of returned extent */ - bool busy; - unsigned busy_gen; + struct xfs_agf *agf = args->agbp->b_addr; + struct xfs_btree_cur *bno_cur; + struct xfs_btree_cur *cnt_cur; + xfs_agblock_t fbno; /* start of found freespace */ + xfs_extlen_t flen; /* length of found freespace */ + xfs_agblock_t rbno; /* returned block number */ + xfs_extlen_t rlen; /* length of returned extent */ + bool busy; + unsigned busy_gen; + int error; + int i; restart: /* @@ -1717,8 +1719,8 @@ xfs_alloc_ag_vextent_size( xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); trace_xfs_alloc_size_busy(args); - xfs_extent_busy_flush(args->mp, - args->pag, busy_gen); + xfs_extent_busy_flush(args->mp, args->pag, + busy_gen, alloc_flags); goto restart; } } @@ -1802,7 +1804,8 @@ xfs_alloc_ag_vextent_size( if (busy) { xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); trace_xfs_alloc_size_busy(args); - xfs_extent_busy_flush(args->mp, args->pag, busy_gen); + xfs_extent_busy_flush(args->mp, args->pag, busy_gen, + alloc_flags); goto restart; } goto out_nominleft; @@ -2572,7 +2575,7 @@ xfs_exact_minlen_extent_available( int /* error */ xfs_alloc_fix_freelist( struct xfs_alloc_arg *args, /* allocation argument structure */ - int flags) /* XFS_ALLOC_FLAG_... */ + uint32_t alloc_flags) { struct xfs_mount *mp = args->mp; struct xfs_perag *pag = args->pag; @@ -2588,7 +2591,7 @@ xfs_alloc_fix_freelist( ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); if (!xfs_perag_initialised_agf(pag)) { - error = xfs_alloc_read_agf(pag, tp, flags, &agbp); + error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp); if (error) { /* Couldn't lock the AGF so skip this AG. */ if (error == -EAGAIN) @@ -2604,13 +2607,13 @@ xfs_alloc_fix_freelist( */ if (xfs_perag_prefers_metadata(pag) && (args->datatype & XFS_ALLOC_USERDATA) && - (flags & XFS_ALLOC_FLAG_TRYLOCK)) { - ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING)); + (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)) { + ASSERT(!(alloc_flags & XFS_ALLOC_FLAG_FREEING)); goto out_agbp_relse; } need = xfs_alloc_min_freelist(mp, pag); - if (!xfs_alloc_space_available(args, need, flags | + if (!xfs_alloc_space_available(args, need, alloc_flags | XFS_ALLOC_FLAG_CHECK)) goto out_agbp_relse; @@ -2619,7 +2622,7 @@ xfs_alloc_fix_freelist( * Can fail if we're not blocking on locks, and it's held. */ if (!agbp) { - error = xfs_alloc_read_agf(pag, tp, flags, &agbp); + error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp); if (error) { /* Couldn't lock the AGF so skip this AG. */ if (error == -EAGAIN) @@ -2634,7 +2637,7 @@ xfs_alloc_fix_freelist( /* If there isn't enough total space or single-extent, reject it. */ need = xfs_alloc_min_freelist(mp, pag); - if (!xfs_alloc_space_available(args, need, flags)) + if (!xfs_alloc_space_available(args, need, alloc_flags)) goto out_agbp_relse; #ifdef DEBUG @@ -2672,11 +2675,12 @@ xfs_alloc_fix_freelist( */ memset(&targs, 0, sizeof(targs)); /* struct copy below */ - if (flags & XFS_ALLOC_FLAG_NORMAP) + if (alloc_flags & XFS_ALLOC_FLAG_NORMAP) targs.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE; else targs.oinfo = XFS_RMAP_OINFO_AG; - while (!(flags & XFS_ALLOC_FLAG_NOSHRINK) && pag->pagf_flcount > need) { + while (!(alloc_flags & XFS_ALLOC_FLAG_NOSHRINK) && + pag->pagf_flcount > need) { error = xfs_alloc_get_freelist(pag, tp, agbp, &bno, 0); if (error) goto out_agbp_relse; @@ -2704,7 +2708,7 @@ xfs_alloc_fix_freelist( targs.resv = XFS_AG_RESV_AGFL; /* Allocate as many blocks as possible at once. */ - error = xfs_alloc_ag_vextent_size(&targs); + error = xfs_alloc_ag_vextent_size(&targs, alloc_flags); if (error) goto out_agflbp_relse; @@ -2714,7 +2718,7 @@ xfs_alloc_fix_freelist( * on a completely full ag. */ if (targs.agbno == NULLAGBLOCK) { - if (flags & XFS_ALLOC_FLAG_FREEING) + if (alloc_flags & XFS_ALLOC_FLAG_FREEING) break; goto out_agflbp_relse; } @@ -3230,7 +3234,7 @@ xfs_alloc_vextent_check_args( static int xfs_alloc_vextent_prepare_ag( struct xfs_alloc_arg *args, - uint32_t flags) + uint32_t alloc_flags) { bool need_pag = !args->pag; int error; @@ -3239,7 +3243,7 @@ xfs_alloc_vextent_prepare_ag( args->pag = xfs_perag_get(args->mp, args->agno); args->agbp = NULL; - error = xfs_alloc_fix_freelist(args, flags); + error = xfs_alloc_fix_freelist(args, alloc_flags); if (error) { trace_xfs_alloc_vextent_nofix(args); if (need_pag) @@ -3361,6 +3365,7 @@ xfs_alloc_vextent_this_ag( { struct xfs_mount *mp = args->mp; xfs_agnumber_t minimum_agno; + uint32_t alloc_flags = 0; int error; ASSERT(args->pag != NULL); @@ -3379,9 +3384,9 @@ xfs_alloc_vextent_this_ag( return error; } - error = xfs_alloc_vextent_prepare_ag(args, 0); + error = xfs_alloc_vextent_prepare_ag(args, alloc_flags); if (!error && args->agbp) - error = xfs_alloc_ag_vextent_size(args); + error = xfs_alloc_ag_vextent_size(args, alloc_flags); return xfs_alloc_vextent_finish(args, minimum_agno, error, false); } @@ -3410,20 +3415,20 @@ xfs_alloc_vextent_iterate_ags( xfs_agnumber_t minimum_agno, xfs_agnumber_t start_agno, xfs_agblock_t target_agbno, - uint32_t flags) + uint32_t alloc_flags) { struct xfs_mount *mp = args->mp; xfs_agnumber_t restart_agno = minimum_agno; xfs_agnumber_t agno; int error = 0; - if (flags & XFS_ALLOC_FLAG_TRYLOCK) + if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK) restart_agno = 0; restart: for_each_perag_wrap_range(mp, start_agno, restart_agno, mp->m_sb.sb_agcount, agno, args->pag) { args->agno = agno; - error = xfs_alloc_vextent_prepare_ag(args, flags); + error = xfs_alloc_vextent_prepare_ag(args, alloc_flags); if (error) break; if (!args->agbp) { @@ -3437,10 +3442,10 @@ xfs_alloc_vextent_iterate_ags( */ if (args->agno == start_agno && target_agbno) { args->agbno = target_agbno; - error = xfs_alloc_ag_vextent_near(args); + error = xfs_alloc_ag_vextent_near(args, alloc_flags); } else { args->agbno = 0; - error = xfs_alloc_ag_vextent_size(args); + error = xfs_alloc_ag_vextent_size(args, alloc_flags); } break; } @@ -3457,8 +3462,8 @@ xfs_alloc_vextent_iterate_ags( * constraining flags by the caller, drop them and retry the allocation * without any constraints being set. */ - if (flags) { - flags = 0; + if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK) { + alloc_flags &= ~XFS_ALLOC_FLAG_TRYLOCK; restart_agno = minimum_agno; goto restart; } @@ -3486,6 +3491,7 @@ xfs_alloc_vextent_start_ag( xfs_agnumber_t start_agno; xfs_agnumber_t rotorstep = xfs_rotorstep; bool bump_rotor = false; + uint32_t alloc_flags = XFS_ALLOC_FLAG_TRYLOCK; int error; ASSERT(args->pag == NULL); @@ -3512,7 +3518,7 @@ xfs_alloc_vextent_start_ag( start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target)); error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno, - XFS_FSB_TO_AGBNO(mp, target), XFS_ALLOC_FLAG_TRYLOCK); + XFS_FSB_TO_AGBNO(mp, target), alloc_flags); if (bump_rotor) { if (args->agno == start_agno) @@ -3539,6 +3545,7 @@ xfs_alloc_vextent_first_ag( struct xfs_mount *mp = args->mp; xfs_agnumber_t minimum_agno; xfs_agnumber_t start_agno; + uint32_t alloc_flags = XFS_ALLOC_FLAG_TRYLOCK; int error; ASSERT(args->pag == NULL); @@ -3557,7 +3564,7 @@ xfs_alloc_vextent_first_ag( start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target)); error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno, - XFS_FSB_TO_AGBNO(mp, target), 0); + XFS_FSB_TO_AGBNO(mp, target), alloc_flags); return xfs_alloc_vextent_finish(args, minimum_agno, error, true); } @@ -3610,6 +3617,7 @@ xfs_alloc_vextent_near_bno( struct xfs_mount *mp = args->mp; xfs_agnumber_t minimum_agno; bool needs_perag = args->pag == NULL; + uint32_t alloc_flags = 0; int error; if (!needs_perag) @@ -3630,9 +3638,9 @@ xfs_alloc_vextent_near_bno( if (needs_perag) args->pag = xfs_perag_grab(mp, args->agno); - error = xfs_alloc_vextent_prepare_ag(args, 0); + error = xfs_alloc_vextent_prepare_ag(args, alloc_flags); if (!error && args->agbp) - error = xfs_alloc_ag_vextent_near(args); + error = xfs_alloc_ag_vextent_near(args, alloc_flags); return xfs_alloc_vextent_finish(args, minimum_agno, error, needs_perag); } diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h index 121faf1e11ad..50119ebaede9 100644 --- a/fs/xfs/libxfs/xfs_alloc.h +++ b/fs/xfs/libxfs/xfs_alloc.h @@ -195,7 +195,7 @@ int xfs_alloc_read_agfl(struct xfs_perag *pag, struct xfs_trans *tp, struct xfs_buf **bpp); int xfs_free_agfl_block(struct xfs_trans *, xfs_agnumber_t, xfs_agblock_t, struct xfs_buf *, struct xfs_owner_info *); -int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, int flags); +int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, uint32_t alloc_flags); int xfs_free_extent_fix_freelist(struct xfs_trans *tp, struct xfs_perag *pag, struct xfs_buf **agbp); diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c index f3d328e4a440..5f44936eae1c 100644 --- a/fs/xfs/xfs_extent_busy.c +++ b/fs/xfs/xfs_extent_busy.c @@ -571,7 +571,8 @@ void xfs_extent_busy_flush( struct xfs_mount *mp, struct xfs_perag *pag, - unsigned busy_gen) + unsigned busy_gen, + uint32_t alloc_flags) { DEFINE_WAIT (wait); int error; diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h index 4a118131059f..7a82c689bfa4 100644 --- a/fs/xfs/xfs_extent_busy.h +++ b/fs/xfs/xfs_extent_busy.h @@ -53,7 +53,7 @@ xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno, void xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag, - unsigned busy_gen); + unsigned busy_gen, uint32_t alloc_flags); void xfs_extent_busy_wait_all(struct xfs_mount *mp); From patchwork Tue Jun 27 22:44:08 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 13295069 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id ACED0C001B3 for ; Tue, 27 Jun 2023 22:44:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229985AbjF0Wo0 (ORCPT ); Tue, 27 Jun 2023 18:44:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60326 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229482AbjF0WoY (ORCPT ); Tue, 27 Jun 2023 18:44:24 -0400 Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com [IPv6:2607:f8b0:4864:20::42c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EC633211E for ; Tue, 27 Jun 2023 15:44:19 -0700 (PDT) Received: by mail-pf1-x42c.google.com with SMTP id d2e1a72fcca58-6687096c6ddso2791866b3a.0 for ; Tue, 27 Jun 2023 15:44:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905859; x=1690497859; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=nkNsZl7vYsW79WSqtxpYRKvhd3dwbJ/b3snusgn1D54=; b=Xx8R0bsI8bvoYq/tKAujz1eJiT58NNO1T6yylfJj2R7a9nWdc1dooEN77u468T+DmS 1tSYDR3hOsHSuR0i+RIAvUY3gTwMP3J/Ha53TqeSpv6r05zTScVQNKzuR7x8+NKXTrNK rt+2ZwzZh0WdGLqUCamj+v4Lhp7uimB+jNUXqTh8Et3aOUHgteoI7VkGuLadmcggkPJJ FpcVcKw4TiaIwGRjkGy5SvgDmDgt5qOAS6Cr985w6MChJbiFBhJCsepC4Omn7yY0jOaI ylC9DCFQoDNxoJcpp02SeLUIr0UwY2mGd+9y2pmbCy4hls8PDSNW6ubve0DIH5ZnLblU yKLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687905859; x=1690497859; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nkNsZl7vYsW79WSqtxpYRKvhd3dwbJ/b3snusgn1D54=; b=IhiOKKmn6QhgcJrGfurO9+oOJ9nvoUwAQxXJRf1QYxzuRxxSyEu360YgzUaI0cm7nU 0atEiGIK37Zy3f7waffHE8NfbIWimJA21x5xagoD61T0vvyVM5tyrZqgvTZXMayqSr6J t1WRwmASulMsCQv5xPU7KgcDZnPN0GK7xzHR3A257lypkNSmYa+nmi94IocnAjH+bnHV AAij1mAFiYKoBxtRJ2vy7AG9FPgzQheFiq2NwaJoUr8rFBM6m1A1ZGxdlvcS8AqMqzzi IcgJJ+yN/v1i6sb+LB65ReEjaUEcYZnsVxABAzcvfVGmkVI24+wBeYAiU2ozJYON7fYF N8kQ== X-Gm-Message-State: AC+VfDwO4asSYBCFByfbzTlwBIClFsWmWWF4+gC/fZ4TY6quuxm+CeS7 kIsIQ7IJsMYy3E9j0f/z+Y/i0k+33wx8NfGsY2I= X-Google-Smtp-Source: ACHHUZ5JIsisapbs8/L+mOS3fJ+LSNbgWMZmAnFkv6n76rHWvajXVsnqaHikNEOWjH6oSyxvsRxQow== X-Received: by 2002:a05:6a20:7349:b0:10e:e218:d3f0 with SMTP id v9-20020a056a20734900b0010ee218d3f0mr25799779pzc.54.1687905859304; Tue, 27 Jun 2023 15:44:19 -0700 (PDT) Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au. [49.186.94.37]) by smtp.gmail.com with ESMTPSA id x21-20020aa793b5000000b00673e652985esm4448967pff.44.2023.06.27.15.44.17 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Jun 2023 15:44:17 -0700 (PDT) Received: from [192.168.253.23] (helo=devoid.disaster.area) by dread.disaster.area with esmtp (Exim 4.96) (envelope-from ) id 1qEHPz-00Gzwb-1R for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:15 +1000 Received: from dave by devoid.disaster.area with local (Exim 4.96) (envelope-from ) id 1qEHPz-009ZmL-0L for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:15 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 4/8] xfs: allow extent free intents to be retried Date: Wed, 28 Jun 2023 08:44:08 +1000 Message-Id: <20230627224412.2242198-5-david@fromorbit.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com> References: <20230627224412.2242198-1-david@fromorbit.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Extent freeing neeeds to be able to avoid a busy extent deadlock when the transaction itself holds the only busy extents in the allocation group. This may occur if we have an EFI that contains multiple extents to be freed, and the freeing the second intent requires the space the first extent free released to expand the AGFL. If we block on the busy extent at this point, we deadlock. We hold a dirty transaction that contains a entire atomic extent free operations within it, so if we can abort the extent free operation and commit the progress that we've made, the busy extent can be resolved by a log force. Hence we can restart the aborted extent free with a new transaction and continue to make progress without risking deadlocks. To enable this, we need the EFI processing code to be able to handle an -EAGAIN error to tell it to commit the current transaction and retry again. This mechanism is already built into the defer ops processing (used bythe refcount btree modification intents), so there's relatively little handling we need to add to the EFI code to enable this. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Chandan Babu R --- fs/xfs/xfs_extfree_item.c | 73 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 70 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c index 79e65bb6b0a2..098420cbd4a0 100644 --- a/fs/xfs/xfs_extfree_item.c +++ b/fs/xfs/xfs_extfree_item.c @@ -336,6 +336,34 @@ xfs_trans_get_efd( return efdp; } +/* + * Fill the EFD with all extents from the EFI when we need to roll the + * transaction and continue with a new EFI. + * + * This simply copies all the extents in the EFI to the EFD rather than make + * assumptions about which extents in the EFI have already been processed. We + * currently keep the xefi list in the same order as the EFI extent list, but + * that may not always be the case. Copying everything avoids leaving a landmine + * were we fail to cancel all the extents in an EFI if the xefi list is + * processed in a different order to the extents in the EFI. + */ +static void +xfs_efd_from_efi( + struct xfs_efd_log_item *efdp) +{ + struct xfs_efi_log_item *efip = efdp->efd_efip; + uint i; + + ASSERT(efip->efi_format.efi_nextents > 0); + ASSERT(efdp->efd_next_extent < efip->efi_format.efi_nextents); + + for (i = 0; i < efip->efi_format.efi_nextents; i++) { + efdp->efd_format.efd_extents[i] = + efip->efi_format.efi_extents[i]; + } + efdp->efd_next_extent = efip->efi_format.efi_nextents; +} + /* * Free an extent and log it to the EFD. Note that the transaction is marked * dirty regardless of whether the extent free succeeds or fails to support the @@ -378,6 +406,17 @@ xfs_trans_free_extent( tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE; set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags); + /* + * If we need a new transaction to make progress, the caller will log a + * new EFI with the current contents. It will also log an EFD to cancel + * the existing EFI, and so we need to copy all the unprocessed extents + * in this EFI to the EFD so this works correctly. + */ + if (error == -EAGAIN) { + xfs_efd_from_efi(efdp); + return error; + } + next_extent = efdp->efd_next_extent; ASSERT(next_extent < efdp->efd_format.efd_nextents); extp = &(efdp->efd_format.efd_extents[next_extent]); @@ -495,6 +534,13 @@ xfs_extent_free_finish_item( error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi); + /* + * Don't free the XEFI if we need a new transaction to complete + * processing of it. + */ + if (error == -EAGAIN) + return error; + xfs_extent_free_put_group(xefi); kmem_cache_free(xfs_extfree_item_cache, xefi); return error; @@ -620,6 +666,7 @@ xfs_efi_item_recover( struct xfs_trans *tp; int i; int error = 0; + bool requeue_only = false; /* * First check the validity of the extents described by the @@ -653,9 +700,29 @@ xfs_efi_item_recover( fake.xefi_startblock = extp->ext_start; fake.xefi_blockcount = extp->ext_len; - xfs_extent_free_get_group(mp, &fake); - error = xfs_trans_free_extent(tp, efdp, &fake); - xfs_extent_free_put_group(&fake); + if (!requeue_only) { + xfs_extent_free_get_group(mp, &fake); + error = xfs_trans_free_extent(tp, efdp, &fake); + xfs_extent_free_put_group(&fake); + } + + /* + * If we can't free the extent without potentially deadlocking, + * requeue the rest of the extents to a new so that they get + * run again later with a new transaction context. + */ + if (error == -EAGAIN || requeue_only) { + error = xfs_free_extent_later(tp, fake.xefi_startblock, + fake.xefi_blockcount, + &XFS_RMAP_OINFO_ANY_OWNER, + fake.xefi_type); + if (!error) { + requeue_only = true; + error = 0; + continue; + } + }; + if (error == -EFSCORRUPTED) XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, extp, sizeof(*extp)); From patchwork Tue Jun 27 22:44:09 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 13295070 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6E279C001DD for ; Tue, 27 Jun 2023 22:44:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229482AbjF0Wo1 (ORCPT ); Tue, 27 Jun 2023 18:44:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60336 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229958AbjF0WoZ (ORCPT ); Tue, 27 Jun 2023 18:44:25 -0400 Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com [IPv6:2607:f8b0:4864:20::42c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 54B6E26A8 for ; Tue, 27 Jun 2023 15:44:20 -0700 (PDT) Received: by mail-pf1-x42c.google.com with SMTP id d2e1a72fcca58-66869feb7d1so2976537b3a.3 for ; Tue, 27 Jun 2023 15:44:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905860; x=1690497860; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=ZNZF8d+9xx3Te2bu0f2qE+wgSJ+bidn/R8m8MNQ1Va4=; b=0P1PKBs43cbbu+ws5orKebuzlAEBQqx5HPNfjheJ0ScKXfBTifQ2zq5Hl8IOtNf81V t+bsm6GbAisF+6/wggED4BotZ70s1Wa7W/2wL4NQXlmgc26H81cYOSSN6C2dAyC0DS7a 02RHd8Fh1N6qKL10DgUnoHAPLzz7vlf8lVh0kZGS2YCod1pmhq3LBVQGd7HJ/JNbyUmo yv4+L1C2UihlXgYstEZGkjfF4TWbMqnmKeRS/51L70awyda/Iuc7Ven3K6acYJ7IEJxx WJ4VxXFdDjnSmwpS9MdBoQ43kv08cbNuq4Bs/8PKxqDJPUzmFWEmhOjKS+slhSGmzBh4 b+9g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687905860; x=1690497860; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ZNZF8d+9xx3Te2bu0f2qE+wgSJ+bidn/R8m8MNQ1Va4=; b=PPYhQc4lmgbVo2Xl2qaavPJwekvdqzO6ol9+Nn2XVynTf938vA3g+OmBI5XCCJ1g5U bBnLkGejtFeoLqgiClpNrFg3D88OfVWWAnTQrLaXU8sjgtPyj9vltKmcDmV5Smn2XUwj MXoh6yMcd4/nyKOr4a5Y9V6FKCTHgdwxCA4vRgF7nQ98ke6AArFXmBLq1Yxy62leO4DL mGbJEWpU9O04AFCTTs/weTkN2rCvbqudub6glyqrM0LQ9qmulLzXZtHesVSZuz3x+02s Dz6tgesO1E6mLEBu2tEj7rcHHKEFIBVwCevWg0EzNSKvSTs1k7B85SubJfJW382q8GOz HegA== X-Gm-Message-State: AC+VfDzLzew6TTYmplcU4WyG6faIqxrara3umCP7tx35NVmAzW6ZWdit s6ihje/vJ6YsJUPsKdwNiWm1aBEIZXKbNhWf4XU= X-Google-Smtp-Source: ACHHUZ7bjhKg4MChEVBZeCN571yh+pQufO190ckxCGSv8jTpHreAHyx5ODh4TXTbReLF3llS6OAxOQ== X-Received: by 2002:a05:6a21:900c:b0:123:2973:672c with SMTP id tq12-20020a056a21900c00b001232973672cmr13340032pzb.57.1687905859691; Tue, 27 Jun 2023 15:44:19 -0700 (PDT) Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au. [49.186.94.37]) by smtp.gmail.com with ESMTPSA id g23-20020a62e317000000b00679b7d2bd57sm3226647pfh.192.2023.06.27.15.44.17 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Jun 2023 15:44:18 -0700 (PDT) Received: from [192.168.253.23] (helo=devoid.disaster.area) by dread.disaster.area with esmtp (Exim 4.96) (envelope-from ) id 1qEHPz-00Gzwe-1f for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:15 +1000 Received: from dave by devoid.disaster.area with local (Exim 4.96) (envelope-from ) id 1qEHPz-009ZmQ-0V for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:15 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 5/8] xfs: don't block in busy flushing when freeing extents Date: Wed, 28 Jun 2023 08:44:09 +1000 Message-Id: <20230627224412.2242198-6-david@fromorbit.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com> References: <20230627224412.2242198-1-david@fromorbit.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner If the current transaction holds a busy extent and we are trying to allocate a new extent to fix up the free list, we can deadlock if the AG is entirely empty except for the busy extent held by the transaction. This can occur at runtime processing an XEFI with multiple extents in this path: __schedule+0x22f at ffffffff81f75e8f schedule+0x46 at ffffffff81f76366 xfs_extent_busy_flush+0x69 at ffffffff81477d99 xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a xfs_alloc_ag_vextent+0x19b at ffffffff81417edb xfs_alloc_fix_freelist+0x22f at ffffffff8141896f xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a __xfs_free_extent+0x99 at ffffffff81419499 xfs_trans_free_extent+0x3e at ffffffff814a6fee xfs_extent_free_finish_item+0x24 at ffffffff814a70d4 xfs_defer_finish_noroll+0x1f7 at ffffffff81441407 xfs_defer_finish+0x11 at ffffffff814417e1 xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd xfs_inactive_truncate+0xb9 at ffffffff8148bb89 xfs_inactive+0x227 at ffffffff8148c4f7 xfs_fs_destroy_inode+0xb8 at ffffffff81496898 destroy_inode+0x3b at ffffffff8127d2ab do_unlinkat+0x1d1 at ffffffff81270df1 do_syscall_64+0x40 at ffffffff81f6b5f0 entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c This can also happen in log recovery when processing an EFI with multiple extents through this path: context_switch() kernel/sched/core.c:3881 __schedule() kernel/sched/core.c:5111 schedule() kernel/sched/core.c:5186 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824 xfs_log_mount_finish() fs/xfs/xfs_log.c:764 xfs_mountfs() fs/xfs/xfs_mount.c:978 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908 mount_bdev() fs/super.c:1417 xfs_fs_mount() fs/xfs/xfs_super.c:1985 legacy_get_tree() fs/fs_context.c:647 vfs_get_tree() fs/super.c:1547 do_new_mount() fs/namespace.c:2843 do_mount() fs/namespace.c:3163 ksys_mount() fs/namespace.c:3372 __do_sys_mount() fs/namespace.c:3386 __se_sys_mount() fs/namespace.c:3383 __x64_sys_mount() fs/namespace.c:3383 do_syscall_64() arch/x86/entry/common.c:296 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180 To avoid this deadlock, we should not block in xfs_extent_busy_flush() if we hold a busy extent in the current transaction. Now that the EFI processing code can handle requeuing a partially completed EFI, we can detect this situation in xfs_extent_busy_flush() and return -EAGAIN rather than going to sleep forever. The -EAGAIN get propagated back out to the xfs_trans_free_extent() context, where the EFD is populated and the transaction is rolled, thereby moving the busy extents into the CIL. At this point, we can retry the extent free operation again with a clean transaction. If we hit the same "all free extents are busy" situation when trying to fix up the free list, we can safely call xfs_extent_busy_flush() and wait for the busy extents to resolve and wake us. At this point, the allocation search can make progress again and we can fix up the free list. This deadlock was first reported by Chandan in mid-2021, but I couldn't make myself understood during review, and didn't have time to fix it myself. It was reported again in March 2023, and again I have found myself unable to explain the complexities of the solution needed during review. As such, I don't have hours more time to waste trying to get the fix written the way it needs to be written, so I'm just doing it myself. This patchset is largely based on Wengang Wang's last patch, but with all the unnecessary stuff removed, split up into multiple patches and cleaned up somewhat. Reported-by: Chandan Babu R Reported-by: Wengang Wang Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/libxfs/xfs_alloc.c | 68 ++++++++++++++++++++++++++++----------- fs/xfs/libxfs/xfs_alloc.h | 11 ++++--- fs/xfs/xfs_extent_busy.c | 33 ++++++++++++++++--- fs/xfs/xfs_extent_busy.h | 6 ++-- 4 files changed, 88 insertions(+), 30 deletions(-) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index ba929f6b5c9b..1e72b91daff6 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -1556,6 +1556,8 @@ xfs_alloc_ag_vextent_near( if (args->agbno > args->max_agbno) args->agbno = args->max_agbno; + /* Retry once quickly if we find busy extents before blocking. */ + alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH; restart: len = 0; @@ -1611,9 +1613,20 @@ xfs_alloc_ag_vextent_near( */ if (!acur.len) { if (acur.busy) { + /* + * Our only valid extents must have been busy. Flush and + * retry the allocation again. If we get an -EAGAIN + * error, we're being told that a deadlock was avoided + * and the current transaction needs committing before + * the allocation can be retried. + */ trace_xfs_alloc_near_busy(args); - xfs_extent_busy_flush(args->mp, args->pag, - acur.busy_gen, alloc_flags); + error = xfs_extent_busy_flush(args->tp, args->pag, + acur.busy_gen, alloc_flags); + if (error) + goto out; + + alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; goto restart; } trace_xfs_alloc_size_neither(args); @@ -1653,6 +1666,8 @@ xfs_alloc_ag_vextent_size( int error; int i; + /* Retry once quickly if we find busy extents before blocking. */ + alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH; restart: /* * Allocate and initialize a cursor for the by-size btree. @@ -1710,19 +1725,25 @@ xfs_alloc_ag_vextent_size( error = xfs_btree_increment(cnt_cur, 0, &i); if (error) goto error0; - if (i == 0) { - /* - * Our only valid extents must have been busy. - * Make it unbusy by forcing the log out and - * retrying. - */ - xfs_btree_del_cursor(cnt_cur, - XFS_BTREE_NOERROR); - trace_xfs_alloc_size_busy(args); - xfs_extent_busy_flush(args->mp, args->pag, - busy_gen, alloc_flags); - goto restart; - } + if (i) + continue; + + /* + * Our only valid extents must have been busy. Flush and + * retry the allocation again. If we get an -EAGAIN + * error, we're being told that a deadlock was avoided + * and the current transaction needs committing before + * the allocation can be retried. + */ + trace_xfs_alloc_size_busy(args); + error = xfs_extent_busy_flush(args->tp, args->pag, + busy_gen, alloc_flags); + if (error) + goto error0; + + alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; + xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); + goto restart; } } @@ -1802,10 +1823,21 @@ xfs_alloc_ag_vextent_size( args->len = rlen; if (rlen < args->minlen) { if (busy) { - xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); + /* + * Our only valid extents must have been busy. Flush and + * retry the allocation again. If we get an -EAGAIN + * error, we're being told that a deadlock was avoided + * and the current transaction needs committing before + * the allocation can be retried. + */ trace_xfs_alloc_size_busy(args); - xfs_extent_busy_flush(args->mp, args->pag, busy_gen, - alloc_flags); + error = xfs_extent_busy_flush(args->tp, args->pag, + busy_gen, alloc_flags); + if (error) + goto error0; + + alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; + xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); goto restart; } goto out_nominleft; diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h index 50119ebaede9..29e56fdf25e4 100644 --- a/fs/xfs/libxfs/xfs_alloc.h +++ b/fs/xfs/libxfs/xfs_alloc.h @@ -19,11 +19,12 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp); /* * Flags for xfs_alloc_fix_freelist. */ -#define XFS_ALLOC_FLAG_TRYLOCK 0x00000001 /* use trylock for buffer locking */ -#define XFS_ALLOC_FLAG_FREEING 0x00000002 /* indicate caller is freeing extents*/ -#define XFS_ALLOC_FLAG_NORMAP 0x00000004 /* don't modify the rmapbt */ -#define XFS_ALLOC_FLAG_NOSHRINK 0x00000008 /* don't shrink the freelist */ -#define XFS_ALLOC_FLAG_CHECK 0x00000010 /* test only, don't modify args */ +#define XFS_ALLOC_FLAG_TRYLOCK (1U << 0) /* use trylock for buffer locking */ +#define XFS_ALLOC_FLAG_FREEING (1U << 1) /* indicate caller is freeing extents*/ +#define XFS_ALLOC_FLAG_NORMAP (1U << 2) /* don't modify the rmapbt */ +#define XFS_ALLOC_FLAG_NOSHRINK (1U << 3) /* don't shrink the freelist */ +#define XFS_ALLOC_FLAG_CHECK (1U << 4) /* test only, don't modify args */ +#define XFS_ALLOC_FLAG_TRYFLUSH (1U << 5) /* don't wait in busy extent flush */ /* * Argument structure for xfs_alloc routines. diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c index 5f44936eae1c..7c2fdc71e42d 100644 --- a/fs/xfs/xfs_extent_busy.c +++ b/fs/xfs/xfs_extent_busy.c @@ -566,10 +566,21 @@ xfs_extent_busy_clear( /* * Flush out all busy extents for this AG. + * + * If the current transaction is holding busy extents, the caller may not want + * to wait for committed busy extents to resolve. If we are being told just to + * try a flush or progress has been made since we last skipped a busy extent, + * return immediately to allow the caller to try again. + * + * If we are freeing extents, we might actually be holding the only free extents + * in the transaction busy list and the log force won't resolve that situation. + * In this case, we must return -EAGAIN to avoid a deadlock by informing the + * caller it needs to commit the busy extents it holds before retrying the + * extent free operation. */ -void +int xfs_extent_busy_flush( - struct xfs_mount *mp, + struct xfs_trans *tp, struct xfs_perag *pag, unsigned busy_gen, uint32_t alloc_flags) @@ -577,10 +588,23 @@ xfs_extent_busy_flush( DEFINE_WAIT (wait); int error; - error = xfs_log_force(mp, XFS_LOG_SYNC); + error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC); if (error) - return; + return error; + /* Avoid deadlocks on uncommitted busy extents. */ + if (!list_empty(&tp->t_busy)) { + if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH) + return 0; + + if (busy_gen != READ_ONCE(pag->pagb_gen)) + return 0; + + if (alloc_flags & XFS_ALLOC_FLAG_FREEING) + return -EAGAIN; + } + + /* Wait for committed busy extents to resolve. */ do { prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE); if (busy_gen != READ_ONCE(pag->pagb_gen)) @@ -589,6 +613,7 @@ xfs_extent_busy_flush( } while (1); finish_wait(&pag->pagb_wait, &wait); + return 0; } void diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h index 7a82c689bfa4..c37bf87e6781 100644 --- a/fs/xfs/xfs_extent_busy.h +++ b/fs/xfs/xfs_extent_busy.h @@ -51,9 +51,9 @@ bool xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno, xfs_extlen_t *len, unsigned *busy_gen); -void -xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag, - unsigned busy_gen, uint32_t alloc_flags); +int +xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag, + unsigned busy_gen, uint32_t alloc_flags); void xfs_extent_busy_wait_all(struct xfs_mount *mp); From patchwork Tue Jun 27 22:44:10 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 13295075 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E1C8AC001B0 for ; Tue, 27 Jun 2023 22:44:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229948AbjF0Wob (ORCPT ); Tue, 27 Jun 2023 18:44:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60384 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229982AbjF0Wo2 (ORCPT ); Tue, 27 Jun 2023 18:44:28 -0400 Received: from mail-pf1-x42f.google.com (mail-pf1-x42f.google.com [IPv6:2607:f8b0:4864:20::42f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C44302947 for ; Tue, 27 Jun 2023 15:44:22 -0700 (PDT) Received: by mail-pf1-x42f.google.com with SMTP id d2e1a72fcca58-668730696a4so2754812b3a.1 for ; Tue, 27 Jun 2023 15:44:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905862; x=1690497862; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=4O2k46lv/gI7xyqpvXtBgdwqM0g/QagZozuLf4U9K4c=; b=cKgdrZ43YU4na7peEhU2pWxrT3ycVrcStWKlDD3d2SXyyHQ7Gjh4tFS42r8SJJiH/6 /zUvTh93MiFUtvP/TE9L811XX2NhSCzEH1QuSQcW6uJy/ovkdhFpHhnNnZpcxxQapB7g nX4xeBx+GTklHAdtT6qGawUEtlumdu3EFexBoXxeveA1Q29hzIPGCs51wDWz6hLQZKTv LsUg13Q4jpjrOOBbPTaIWTOWu8DfBrb629f2GL/PMcBHZ2kcoN0YU0Rm4OqoNULWeLzb dtpSPFqCDE9m89SGLt6h2ptDUkOXkYCwHSNI/mPe4eu8ECPrd5NdXithj+aAJUCEwZVh mcbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687905862; x=1690497862; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4O2k46lv/gI7xyqpvXtBgdwqM0g/QagZozuLf4U9K4c=; b=C94+GuXpolvh+I1JHhgM5QYneJbs9P9DJCyRNpZ35wqytRTIndtK2ams9aW2OLsa4w r5uSSxSIysWqhubjtmULJeB1kDDj70s0SvGZdRJiRcbRDvpSeFAfaBiZDEPg9IOew1UK 6NnuhY8gQ7ekT/sAc7tNKCqfHsK2S56nuxqP+Y0TQqAdYzJSqvS7vuHjqp8RRQdCxdAU 8sCprtU8qiLcnHLeY4NnVD1I89kXftuUuPi5+l6w4TpFT3uONuVwDgZYjodOk1AZrWhu JEhCUbMmrJeyHoBmQ37JLOsxnD46jHcZY7FySumPI3m15nDKlv5Cknlp/V0B5CB5jR72 I20A== X-Gm-Message-State: AC+VfDwTH+Twc7AQberqwn65hlUSgxypFgabSy/hE4ZZOUAXIAIyLeAy cNfQ8sexSeHUMQ2aB5ZBLM30DCfWr7joIcWFjrk= X-Google-Smtp-Source: ACHHUZ67g6fUTtjziPJRgLpZIRQA/oKWq6w5y0mzV755YfeWH4uh55MmcDS+M5kxUbWXnjw+4Rwd2w== X-Received: by 2002:a05:6a00:3919:b0:643:aa8d:8cd7 with SMTP id fh25-20020a056a00391900b00643aa8d8cd7mr24682916pfb.32.1687905861960; Tue, 27 Jun 2023 15:44:21 -0700 (PDT) Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au. [49.186.94.37]) by smtp.gmail.com with ESMTPSA id y11-20020a62b50b000000b0063f0068cf6csm5878864pfe.198.2023.06.27.15.44.18 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Jun 2023 15:44:20 -0700 (PDT) Received: from [192.168.253.23] (helo=devoid.disaster.area) by dread.disaster.area with esmtp (Exim 4.96) (envelope-from ) id 1qEHPz-00Gzwh-1m for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:15 +1000 Received: from dave by devoid.disaster.area with local (Exim 4.96) (envelope-from ) id 1qEHPz-009ZmV-0e for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:15 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 6/8] xfs: journal geometry is not properly bounds checked Date: Wed, 28 Jun 2023 08:44:10 +1000 Message-Id: <20230627224412.2242198-7-david@fromorbit.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com> References: <20230627224412.2242198-1-david@fromorbit.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner If the journal geometry results in a sector or log stripe unit validation problem, it indicates that we cannot set the log up to safely write to the the journal. In these cases, we must abort the mount because the corruption needs external intervention to resolve. Similarly, a journal that is too large cannot be written to safely, either, so we shouldn't allow those geometries to mount, either. If the log is too small, we risk having transaction reservations overruning the available log space and the system hanging waiting for space it can never provide. This is purely a runtime hang issue, not a corruption issue as per the first cases listed above. We abort mounts of the log is too small for V5 filesystems, but we must allow v4 filesystems to mount because, historically, there was no log size validity checking and so some systems may still be out there with undersized logs. The problem is that on V4 filesystems, when we discover a log geometry problem, we skip all the remaining checks and then allow the log to continue mounting. This mean that if one of the log size checks fails, we skip the log stripe unit check. i.e. we allow the mount because a "non-fatal" geometry is violated, and then fail to check the hard fail geometries that should fail the mount. Move all these fatal checks to the superblock verifier, and add a new check for the two log sector size geometry variables having the same values. This will prevent any attempt to mount a log that has invalid or inconsistent geometries long before we attempt to mount the log. However, for the minimum log size checks, we can only do that once we've setup up the log and calculated all the iclog sizes and roundoffs. Hence this needs to remain in the log mount code after the log has been initialised. It is also the only case where we should allow a v4 filesystem to continue running, so leave that handling in place, too. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong --- fs/xfs/libxfs/xfs_sb.c | 56 +++++++++++++++++++++++++++++++++++++++++- fs/xfs/xfs_log.c | 47 +++++++++++------------------------ 2 files changed, 70 insertions(+), 33 deletions(-) diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c index ba0f17bc1dc0..5e174685a77c 100644 --- a/fs/xfs/libxfs/xfs_sb.c +++ b/fs/xfs/libxfs/xfs_sb.c @@ -412,7 +412,6 @@ xfs_validate_sb_common( sbp->sb_inodelog < XFS_DINODE_MIN_LOG || sbp->sb_inodelog > XFS_DINODE_MAX_LOG || sbp->sb_inodesize != (1 << sbp->sb_inodelog) || - sbp->sb_logsunit > XLOG_MAX_RECORD_BSIZE || sbp->sb_inopblock != howmany(sbp->sb_blocksize,sbp->sb_inodesize) || XFS_FSB_TO_B(mp, sbp->sb_agblocks) < XFS_MIN_AG_BYTES || XFS_FSB_TO_B(mp, sbp->sb_agblocks) > XFS_MAX_AG_BYTES || @@ -430,6 +429,61 @@ xfs_validate_sb_common( return -EFSCORRUPTED; } + /* + * Logs that are too large are not supported at all. Reject them + * outright. Logs that are too small are tolerated on v4 filesystems, + * but we can only check that when mounting the log. Hence we skip + * those checks here. + */ + if (sbp->sb_logblocks > XFS_MAX_LOG_BLOCKS) { + xfs_notice(mp, + "Log size 0x%x blocks too large, maximum size is 0x%llx blocks", + sbp->sb_logblocks, XFS_MAX_LOG_BLOCKS); + return -EFSCORRUPTED; + } + + if (XFS_FSB_TO_B(mp, sbp->sb_logblocks) > XFS_MAX_LOG_BYTES) { + xfs_warn(mp, + "log size 0x%llx bytes too large, maximum size is 0x%llx bytes", + XFS_FSB_TO_B(mp, sbp->sb_logblocks), + XFS_MAX_LOG_BYTES); + return -EFSCORRUPTED; + } + + /* + * Do not allow filesystems with corrupted log sector or stripe units to + * be mounted. We cannot safely size the iclogs or write to the log if + * the log stripe unit is not valid. + */ + if (sbp->sb_versionnum & XFS_SB_VERSION_SECTORBIT) { + if (sbp->sb_logsectsize != (1U << sbp->sb_logsectlog)) { + xfs_notice(mp, + "log sector size in bytes/log2 (0x%x/0x%x) must match", + sbp->sb_logsectsize, 1U << sbp->sb_logsectlog); + return -EFSCORRUPTED; + } + } else if (sbp->sb_logsectsize || sbp->sb_logsectlog) { + xfs_notice(mp, + "log sector size in bytes/log2 (0x%x/0x%x) are not zero", + sbp->sb_logsectsize, sbp->sb_logsectlog); + return -EFSCORRUPTED; + } + + if (sbp->sb_logsunit > 1) { + if (sbp->sb_logsunit % sbp->sb_blocksize) { + xfs_notice(mp, + "log stripe unit 0x%x bytes must be a multiple of block size", + sbp->sb_logsunit); + return -EFSCORRUPTED; + } + if (sbp->sb_logsunit > XLOG_MAX_RECORD_BSIZE) { + xfs_notice(mp, + "log stripe unit 0x%x bytes over maximum size (0x%x bytes)", + sbp->sb_logsunit, XLOG_MAX_RECORD_BSIZE); + return -EFSCORRUPTED; + } + } + /* Validate the realtime geometry; stolen from xfs_repair */ if (sbp->sb_rextsize * sbp->sb_blocksize > XFS_MAX_RTEXTSIZE || sbp->sb_rextsize * sbp->sb_blocksize < XFS_MIN_RTEXTSIZE) { diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index fc61cc024023..79004d193e54 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -639,7 +639,6 @@ xfs_log_mount( int num_bblks) { struct xlog *log; - bool fatal = xfs_has_crc(mp); int error = 0; int min_logfsbs; @@ -663,53 +662,37 @@ xfs_log_mount( mp->m_log = log; /* - * Validate the given log space and drop a critical message via syslog - * if the log size is too small that would lead to some unexpected - * situations in transaction log space reservation stage. + * Now that we have set up the log and it's internal geometry + * parameters, we can validate the given log space and drop a critical + * message via syslog if the log size is too small. A log that is too + * small can lead to unexpected situations in transaction log space + * reservation stage. The superblock verifier has already validated all + * the other log geometry constraints, so we don't have to check those + * here. * - * Note: we can't just reject the mount if the validation fails. This - * would mean that people would have to downgrade their kernel just to - * remedy the situation as there is no way to grow the log (short of - * black magic surgery with xfs_db). + * Note: For v4 filesystems, we can't just reject the mount if the + * validation fails. This would mean that people would have to + * downgrade their kernel just to remedy the situation as there is no + * way to grow the log (short of black magic surgery with xfs_db). * - * We can, however, reject mounts for CRC format filesystems, as the + * We can, however, reject mounts for V5 format filesystems, as the * mkfs binary being used to make the filesystem should never create a * filesystem with a log that is too small. */ min_logfsbs = xfs_log_calc_minimum_size(mp); - if (mp->m_sb.sb_logblocks < min_logfsbs) { xfs_warn(mp, "Log size %d blocks too small, minimum size is %d blocks", mp->m_sb.sb_logblocks, min_logfsbs); - error = -EINVAL; - } else if (mp->m_sb.sb_logblocks > XFS_MAX_LOG_BLOCKS) { - xfs_warn(mp, - "Log size %d blocks too large, maximum size is %lld blocks", - mp->m_sb.sb_logblocks, XFS_MAX_LOG_BLOCKS); - error = -EINVAL; - } else if (XFS_FSB_TO_B(mp, mp->m_sb.sb_logblocks) > XFS_MAX_LOG_BYTES) { - xfs_warn(mp, - "log size %lld bytes too large, maximum size is %lld bytes", - XFS_FSB_TO_B(mp, mp->m_sb.sb_logblocks), - XFS_MAX_LOG_BYTES); - error = -EINVAL; - } else if (mp->m_sb.sb_logsunit > 1 && - mp->m_sb.sb_logsunit % mp->m_sb.sb_blocksize) { - xfs_warn(mp, - "log stripe unit %u bytes must be a multiple of block size", - mp->m_sb.sb_logsunit); - error = -EINVAL; - fatal = true; - } - if (error) { + /* * Log check errors are always fatal on v5; or whenever bad * metadata leads to a crash. */ - if (fatal) { + if (xfs_has_crc(mp)) { xfs_crit(mp, "AAIEEE! Log failed size checks. Abort!"); ASSERT(0); + error = -EINVAL; goto out_free_log; } xfs_crit(mp, "Log size out of supported range."); From patchwork Tue Jun 27 22:44:11 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 13295071 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 28FFDEB64DC for ; Tue, 27 Jun 2023 22:44:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229840AbjF0Wo2 (ORCPT ); Tue, 27 Jun 2023 18:44:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60358 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229944AbjF0Wo0 (ORCPT ); Tue, 27 Jun 2023 18:44:26 -0400 Received: from mail-pf1-x432.google.com (mail-pf1-x432.google.com [IPv6:2607:f8b0:4864:20::432]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 217812944 for ; Tue, 27 Jun 2023 15:44:22 -0700 (PDT) Received: by mail-pf1-x432.google.com with SMTP id d2e1a72fcca58-666e3b15370so3359840b3a.0 for ; Tue, 27 Jun 2023 15:44:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905861; x=1690497861; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=XgXhhqNbFi+/MFXpR0K7xhzjAOkJwBg8lJjaklWSvcA=; b=itAlwsAQZXNYjX+H/2/t6ZfBld3O31tQQMTJbzz/MpxIyTpNGkKoU7fE+8lfqirKRh Yyi5U8aNwzIvfjYgmg3u992g5uUzuPmvly/0F6GTjik/hu+D8GYk/jXj23/ObH50I0gc si3L0z9MN8cjJ/FhboVLkDnyn/cnC7aApLkv8jBv/nHxQbdqzrUEe/WdsYlgJjD9WkKd ZeVaJ9MFcZ/gB/MsE2CZl/wldBaQIYEFATEwlt2YNQSNPmnVwhhsFxBF+XL+11PafdqY /NbJvLd5h+kjW2ggxclXxOhzbo1Xdr2qGGcsm4Vqtjk/PUWVZEm/C6AndHG5tF5H1YM/ DSXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687905861; x=1690497861; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XgXhhqNbFi+/MFXpR0K7xhzjAOkJwBg8lJjaklWSvcA=; b=AZvZvaxX0UeSWypw1qLR0cziwjpebkV3BOkMHBVbQbaNbh7/cQwYtJIsmNkuY5tspL FZ8KjlfU2qxi+Z0n9arYL9thEETkaap7MdXiRNTiC0D2cXCO4+U7WWkBsedjvpFeTH/h WFps0c9TKkkTREMCxTRh+khO3PjMK6TugeXoU47aQ55iNPLlg/L2lhShBYT9y8+eixxS BMxw5Z9bRnacNM+Y1Fm2NXq0pEK+l9Dg2iixdrYJXvIOm4Yj9fMeM5Ur39uLMfK8I3oA GJFopMgAejwfGTyrD/UjTqpGuAm2WeINj9AiToBXQvYJLExvWBWowpJS4W+pkw7pyzHe CMaA== X-Gm-Message-State: AC+VfDwDIEY3zwzTcrX83JRRxi1XTbLmOmb2INkVJ5n2s0qW4vlttHCq bifWxVtVHZ/wwzkZjrpmqSZ/4C39bdUNBq22yIs= X-Google-Smtp-Source: ACHHUZ49aXbvIzsF9AeTj++Gor4TVgJelxbvNl1FVSZxXSSsfsnLLt8xpYOEGMdev0yrWeHWUQ/nsA== X-Received: by 2002:a05:6a00:99c:b0:673:8dfb:af32 with SMTP id u28-20020a056a00099c00b006738dfbaf32mr8611000pfg.26.1687905861294; Tue, 27 Jun 2023 15:44:21 -0700 (PDT) Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au. [49.186.94.37]) by smtp.gmail.com with ESMTPSA id p15-20020a63e64f000000b0051b0e564963sm6350434pgj.49.2023.06.27.15.44.18 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Jun 2023 15:44:20 -0700 (PDT) Received: from [192.168.253.23] (helo=devoid.disaster.area) by dread.disaster.area with esmtp (Exim 4.96) (envelope-from ) id 1qEHPz-00Gzwk-1y for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:15 +1000 Received: from dave by devoid.disaster.area with local (Exim 4.96) (envelope-from ) id 1qEHPz-009Zma-0o for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:15 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 7/8] xfs: AGF length has never been bounds checked Date: Wed, 28 Jun 2023 08:44:11 +1000 Message-Id: <20230627224412.2242198-8-david@fromorbit.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com> References: <20230627224412.2242198-1-david@fromorbit.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner The AGF verifier does not check that the AGF length field is within known good bounds. This has never been checked by runtime kernel code (i.e. the lack of verification goes back to 1993) yet we assume in many places that it is correct and verify other metdata against it. Add length verification to the AGF verifier. The length of the AGF must be equal to the size of the AG specified in the superblock, unless it is the last AG in the filesystem. In that case, it must be less than or equal to sb->sb_agblocks and greater than XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs operation will allow to exist. This requires a bit of rework of the verifier function. We want to verify metadata before we use it to verify other metadata. Hence we need to verify the AGF sequence numbers before using them to verify the length of the AGF. Then we can verify the AGF length before we verify AGFL fields. Then we can verifier other fields that are bounds limited by the AGF length. And, finally, by calculating agf_length only once into a local variable, we can collapse repeated "if (xfs_has_foo() &&" conditionaly checks into single checks. This makes the code much easier to follow as all the checks for a given feature are obviously in the same place. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong --- fs/xfs/libxfs/xfs_alloc.c | 86 +++++++++++++++++++++++---------------- 1 file changed, 51 insertions(+), 35 deletions(-) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 1e72b91daff6..7c86a69354fb 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2974,6 +2974,7 @@ xfs_agf_verify( { struct xfs_mount *mp = bp->b_mount; struct xfs_agf *agf = bp->b_addr; + uint32_t agf_length = be32_to_cpu(agf->agf_length); if (xfs_has_crc(mp)) { if (!uuid_equal(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid)) @@ -2985,18 +2986,43 @@ xfs_agf_verify( if (!xfs_verify_magic(bp, agf->agf_magicnum)) return __this_address; - if (!(XFS_AGF_GOOD_VERSION(be32_to_cpu(agf->agf_versionnum)) && - be32_to_cpu(agf->agf_freeblks) <= be32_to_cpu(agf->agf_length) && - be32_to_cpu(agf->agf_flfirst) < xfs_agfl_size(mp) && - be32_to_cpu(agf->agf_fllast) < xfs_agfl_size(mp) && - be32_to_cpu(agf->agf_flcount) <= xfs_agfl_size(mp))) + if (!XFS_AGF_GOOD_VERSION(be32_to_cpu(agf->agf_versionnum))) return __this_address; - if (be32_to_cpu(agf->agf_length) > mp->m_sb.sb_dblocks) + /* + * Both agf_seqno and agf_length need to validated before anything else + * block number related in the AGF or AGFL can be checked. + * + * During growfs operations, the perag is not fully initialised, + * so we can't use it for any useful checking. growfs ensures we can't + * use it by using uncached buffers that don't have the perag attached + * so we can detect and avoid this problem. + */ + if (bp->b_pag && be32_to_cpu(agf->agf_seqno) != bp->b_pag->pag_agno) + return __this_address; + + /* + * Only the last AGF in the filesytsem is allowed to be shorter + * than the AG size recorded in the superblock. + */ + if (agf_length != mp->m_sb.sb_agblocks) { + if (be32_to_cpu(agf->agf_seqno) != mp->m_sb.sb_agcount - 1) + return __this_address; + if (agf_length < XFS_MIN_AG_BLOCKS) + return __this_address; + if (agf_length > mp->m_sb.sb_agblocks) + return __this_address; + } + + if (be32_to_cpu(agf->agf_flfirst) >= xfs_agfl_size(mp)) + return __this_address; + if (be32_to_cpu(agf->agf_fllast) >= xfs_agfl_size(mp)) + return __this_address; + if (be32_to_cpu(agf->agf_flcount) > xfs_agfl_size(mp)) return __this_address; if (be32_to_cpu(agf->agf_freeblks) < be32_to_cpu(agf->agf_longest) || - be32_to_cpu(agf->agf_freeblks) > be32_to_cpu(agf->agf_length)) + be32_to_cpu(agf->agf_freeblks) > agf_length) return __this_address; if (be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]) < 1 || @@ -3007,38 +3033,28 @@ xfs_agf_verify( mp->m_alloc_maxlevels) return __this_address; - if (xfs_has_rmapbt(mp) && - (be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) < 1 || - be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) > - mp->m_rmap_maxlevels)) - return __this_address; - - if (xfs_has_rmapbt(mp) && - be32_to_cpu(agf->agf_rmap_blocks) > be32_to_cpu(agf->agf_length)) - return __this_address; - - /* - * during growfs operations, the perag is not fully initialised, - * so we can't use it for any useful checking. growfs ensures we can't - * use it by using uncached buffers that don't have the perag attached - * so we can detect and avoid this problem. - */ - if (bp->b_pag && be32_to_cpu(agf->agf_seqno) != bp->b_pag->pag_agno) - return __this_address; - if (xfs_has_lazysbcount(mp) && - be32_to_cpu(agf->agf_btreeblks) > be32_to_cpu(agf->agf_length)) + be32_to_cpu(agf->agf_btreeblks) > agf_length) return __this_address; - if (xfs_has_reflink(mp) && - be32_to_cpu(agf->agf_refcount_blocks) > - be32_to_cpu(agf->agf_length)) - return __this_address; + if (xfs_has_rmapbt(mp)) { + if (be32_to_cpu(agf->agf_rmap_blocks) > agf_length) + return __this_address; - if (xfs_has_reflink(mp) && - (be32_to_cpu(agf->agf_refcount_level) < 1 || - be32_to_cpu(agf->agf_refcount_level) > mp->m_refc_maxlevels)) - return __this_address; + if (be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) < 1 || + be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) > + mp->m_rmap_maxlevels) + return __this_address; + } + + if (xfs_has_reflink(mp)) { + if (be32_to_cpu(agf->agf_refcount_blocks) > agf_length) + return __this_address; + + if (be32_to_cpu(agf->agf_refcount_level) < 1 || + be32_to_cpu(agf->agf_refcount_level) > mp->m_refc_maxlevels) + return __this_address; + } return NULL; } From patchwork Tue Jun 27 22:44:12 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 13295072 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A661C001B0 for ; Tue, 27 Jun 2023 22:44:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229944AbjF0Wo2 (ORCPT ); Tue, 27 Jun 2023 18:44:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60306 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229982AbjF0WoZ (ORCPT ); Tue, 27 Jun 2023 18:44:25 -0400 Received: from mail-pf1-x435.google.com (mail-pf1-x435.google.com [IPv6:2607:f8b0:4864:20::435]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3B99326A4 for ; Tue, 27 Jun 2023 15:44:21 -0700 (PDT) Received: by mail-pf1-x435.google.com with SMTP id d2e1a72fcca58-666eec46206so4589609b3a.3 for ; Tue, 27 Jun 2023 15:44:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687905860; x=1690497860; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=Y8HfHfWhYbh6r3O4pO/b8MI4sO9R7L0l1wq6+nMOuKU=; b=u4ZJ0WzqPl0HW5OMvFaacz90LZ8ardlFocLk93osQIX6YML5lcMSkNC6BK97kbnUT8 EU7p38p0TfqCSc3/zs59Arc9MlVpZiBZgUUFlkC3dyX0fAc8efe4kwS220BzvD5KmOej FC/HsDLUaJIgk0XiHVg4Ez7O/PZjMO38Csk/FxB4/ggYe+3Zf28qmmJit0mvOjMCSGIG 553iH3uI3heIyAU7Dd+Bwju9TKAhjjOFCZe4C38xVlcyKr7XXchGKgpyDqKd2lWyed1E vX9qvBR14O95/0Asi74Ljhpd0dgcN/dBynlgIx/7cd03biFNl2mW/nM97ai5hoSpj0y5 yE7w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687905860; x=1690497860; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Y8HfHfWhYbh6r3O4pO/b8MI4sO9R7L0l1wq6+nMOuKU=; b=U0IqsFJamRwwW6NRvuzcwXnapm/R8TX90sR1iBe6aayydCNvi8KuHU6HShOkUwIYXu +pjk/WHdiFIFME/z3rmgimu33QdaCsITsFll8mZPrTTd3zhHyrz4AXR3YKUgXyK2x8Km ft+iA5KN9VUzM/uHBimBHS1ND8SkqtvVWfXklBmbg5Z0kUx977hMeFlADvlAshJa6ORc D36YPZ/BotuxQPXQsqs/YJy6QQW4hB9+3PEGfEX3AwhP9/wMuQesCI7x/EYQKm+KZqhb I7EPJcF3/MRdA4isdH3SGLZwm0vLLIB7v3Yc6aWHgy+rbectJ7jqjoNtxuG6bd/p6+01 O7aw== X-Gm-Message-State: AC+VfDzz4tt+M37vbsbSZArMUZ4Hzkj4OabM1TAJVWEcdsmQQgmQpcf2 IRvKMioKPPbjf+JJPN/SYcwAl281erzsUQszPAE= X-Google-Smtp-Source: ACHHUZ5xEf5S3eUoKxKZ619s8YgkgDXaGKXpo+RNNC0Jd+6uP1F6w2es7l0XCFnUlzP8C8BfjT/odg== X-Received: by 2002:a05:6a20:138a:b0:129:3c9:ffab with SMTP id hn10-20020a056a20138a00b0012903c9ffabmr4994299pzc.45.1687905860719; Tue, 27 Jun 2023 15:44:20 -0700 (PDT) Received: from dread.disaster.area (pa49-186-94-37.pa.vic.optusnet.com.au. [49.186.94.37]) by smtp.gmail.com with ESMTPSA id h3-20020a635303000000b0051b9e82d6d6sm6195234pgb.40.2023.06.27.15.44.18 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Jun 2023 15:44:20 -0700 (PDT) Received: from [192.168.253.23] (helo=devoid.disaster.area) by dread.disaster.area with esmtp (Exim 4.96) (envelope-from ) id 1qEHPz-00Gzwn-27 for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:15 +1000 Received: from dave by devoid.disaster.area with local (Exim 4.96) (envelope-from ) id 1qEHPz-009Zmf-0y for linux-xfs@vger.kernel.org; Wed, 28 Jun 2023 08:44:15 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 8/8] xfs: fix bounds check in xfs_defer_agfl_block() Date: Wed, 28 Jun 2023 08:44:12 +1000 Message-Id: <20230627224412.2242198-9-david@fromorbit.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com> References: <20230627224412.2242198-1-david@fromorbit.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Need to happen before we allocate and then leak the xefi. Found by coverity via an xfsprogs libxfs scan. Fixes: 7dfee17b13e5 ("xfs: validate block number being freed before adding to xefi") Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong --- fs/xfs/libxfs/xfs_alloc.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 7c86a69354fb..9919fdfe1d7e 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2470,25 +2470,26 @@ static int xfs_defer_agfl_block( struct xfs_trans *tp, xfs_agnumber_t agno, - xfs_fsblock_t agbno, + xfs_agblock_t agbno, struct xfs_owner_info *oinfo) { struct xfs_mount *mp = tp->t_mountp; struct xfs_extent_free_item *xefi; + xfs_fsblock_t fsbno = XFS_AGB_TO_FSB(mp, agno, agbno); ASSERT(xfs_extfree_item_cache != NULL); ASSERT(oinfo != NULL); + if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbno(mp, fsbno))) + return -EFSCORRUPTED; + xefi = kmem_cache_zalloc(xfs_extfree_item_cache, GFP_KERNEL | __GFP_NOFAIL); - xefi->xefi_startblock = XFS_AGB_TO_FSB(mp, agno, agbno); + xefi->xefi_startblock = fsbno; xefi->xefi_blockcount = 1; xefi->xefi_owner = oinfo->oi_owner; xefi->xefi_type = XFS_AG_RESV_AGFL; - if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbno(mp, xefi->xefi_startblock))) - return -EFSCORRUPTED; - trace_xfs_agfl_free_defer(mp, agno, 0, agbno, 1); xfs_extent_free_get_group(mp, xefi); From patchwork Thu Jun 29 19:42:30 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 13297252 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 89D83EB64D9 for ; Thu, 29 Jun 2023 19:42:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229742AbjF2Tme (ORCPT ); Thu, 29 Jun 2023 15:42:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57584 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232066AbjF2Tmd (ORCPT ); Thu, 29 Jun 2023 15:42:33 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4307C2D4E for ; Thu, 29 Jun 2023 12:42:32 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 711E361615 for ; Thu, 29 Jun 2023 19:42:31 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id CE0BCC433C8; Thu, 29 Jun 2023 19:42:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1688067750; bh=IBN+6BTxGjXE1reY/HLR6yMBhAjZqYYLo4fs2n1tal8=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Zw5w3MqQC0HBnP2iRBpUssn9Nm5OH9HcfOrNBmwRZ4dV8QU3yUL7l/zNpKzKxq3wT 20T15A/Idt8QX8hKzxyBZrS0558fJUQABXJTuowYVEoZGnDLRYpHGtgGH7P33L+Ns4 jLeXqTHm4mIDczpg8+TgBySeJaI52gU4XoSBXpfIQAE2t7hdiCQz5jGI9Lmm9ZMhXb 9vmY3snlbbNNs+L/SbEUIbwVqvNseV+GfHjuwf0N3vXkIxJTvq2J93qx4RWPKMwTqc cm2c3SOhWomaXNp1QFRHmgCmnPv17Lj1Hae0NuUn+BkzULGe3dQs1DrNuReigKdR1L g7yitsrJ9ogjg== Date: Thu, 29 Jun 2023 12:42:30 -0700 From: "Darrick J. Wong" To: Dave Chinner Cc: linux-xfs@vger.kernel.org Subject: [RFC PATCH 9/8] xfs: AGI length should be bounds checked Message-ID: <20230629194230.GH11441@frogsfrogsfrogs> References: <20230627224412.2242198-1-david@fromorbit.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20230627224412.2242198-1-david@fromorbit.com> Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong Similar to the recent patch strengthening the AGF agf_length verification, the AGI verifier does not check that the AGI length field is within known good bounds. This isn't currently checked by runtime kernel code, yet we assume in many places that it is correct and verify other metadata against it. Add length verification to the AGI verifier. Just like the AGF length checking, the length of the AGI must be equal to the size of the AG specified in the superblock, unless it is the last AG in the filesystem. In that case, it must be less than or equal to sb->sb_agblocks and greater than XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs operation will allow to exist. There's only one place in the filesystem that actually uses agi_length, but let's not leave it vulnerable to the same weird nonsense that generates syzbot bugs, eh? Signed-off-by: Darrick J. Wong --- NOTE: Untested, but this patch builds... --- fs/xfs/libxfs/xfs_ialloc.c | 49 ++++++++++++++++++++++++++++++++------------ 1 file changed, 36 insertions(+), 13 deletions(-) diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c index 1e5fafbc0cdb..fec6713e1fa9 100644 --- a/fs/xfs/libxfs/xfs_ialloc.c +++ b/fs/xfs/libxfs/xfs_ialloc.c @@ -2486,11 +2486,12 @@ xfs_ialloc_log_agi( static xfs_failaddr_t xfs_agi_verify( - struct xfs_buf *bp) + struct xfs_buf *bp) { - struct xfs_mount *mp = bp->b_mount; - struct xfs_agi *agi = bp->b_addr; - int i; + struct xfs_mount *mp = bp->b_mount; + struct xfs_agi *agi = bp->b_addr; + uint32_t agi_length = be32_to_cpu(agi->agi_length); + int i; if (xfs_has_crc(mp)) { if (!uuid_equal(&agi->agi_uuid, &mp->m_sb.sb_meta_uuid)) @@ -2507,6 +2508,37 @@ xfs_agi_verify( if (!XFS_AGI_GOOD_VERSION(be32_to_cpu(agi->agi_versionnum))) return __this_address; + /* + * Both agi_seqno and agi_length need to validated before anything else + * block number related in the AGI can be checked. + * + * During growfs operations, the perag is not fully initialised, + * so we can't use it for any useful checking. growfs ensures we can't + * use it by using uncached buffers that don't have the perag attached + * so we can detect and avoid this problem. + */ + if (bp->b_pag && be32_to_cpu(agi->agi_seqno) != bp->b_pag->pag_agno) + return __this_address; + + /* + * Only the last AGI in the filesytsem is allowed to be shorter + * than the AG size recorded in the superblock. + */ + if (agi_length != mp->m_sb.sb_agblocks) { + /* + * During growfs, the new last AGI can get here before we + * have updated the superblock. Give it a pass on the seqno + * check. + */ + if (bp->b_pag && + be32_to_cpu(agi->agi_seqno) != mp->m_sb.sb_agcount - 1) + return __this_address; + if (agi_length < XFS_MIN_AG_BLOCKS) + return __this_address; + if (agi_length > mp->m_sb.sb_agblocks) + return __this_address; + } + if (be32_to_cpu(agi->agi_level) < 1 || be32_to_cpu(agi->agi_level) > M_IGEO(mp)->inobt_maxlevels) return __this_address; @@ -2516,15 +2548,6 @@ xfs_agi_verify( be32_to_cpu(agi->agi_free_level) > M_IGEO(mp)->inobt_maxlevels)) return __this_address; - /* - * during growfs operations, the perag is not fully initialised, - * so we can't use it for any useful checking. growfs ensures we can't - * use it by using uncached buffers that don't have the perag attached - * so we can detect and avoid this problem. - */ - if (bp->b_pag && be32_to_cpu(agi->agi_seqno) != bp->b_pag->pag_agno) - return __this_address; - for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) { if (agi->agi_unlinked[i] == cpu_to_be32(NULLAGINO)) continue;