Message ID | 20221228133204.4021519-1-guoxuenan@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | xfs: fix btree splitting failure when AG space about to be exhausted | expand |
Hi Xuenan, On Wed, Dec 28, 2022 at 09:32:04PM +0800, Guo Xuenan wrote: > Recently, I noticed an special problem on our products. The disk space > is sufficient, while encounter btree split failure. After looking inside > the disk, I found the specific AG space is about to be exhausted. > More seriously, under this special situation, btree split failure will > always be triggered and XFS filesystem is unavailable. > > After analysis the disk image and the AG, which seem same as Gao Xiang met > before [1], The slight difference is my problem is triggered by creating > new inode, I read through your discussion the mailing list[1], I think it's > probably the same root cause. > > As Dave pointed out, args->minleft has an *exact* meaning, both inode fork > allocations and inode chunk extent allocation pre-calculate args->minleft > to ensure inobt record insertion succeed in any circumstances. But, this > guarantee dosen't seem to be reliable, especially when it happens to meet > cnt&bno btree splitting. Gao Xiang proposed an solution by adding postalloc > to make current allocation reserve more space for inobt splitting, I think > it's ok to slove their own problem, but may not be sloved completely, since > inode chunk extent allocation may failed during inobt splitting too. > > Meanwhile, Gao Xiang also noticed strip align requirement may increase > probablility of the problem, which is totally true. I think the reason is > that align requirement may lead to one free extent divied into two, which > increase probablility of the problem. eg: we needs an extent length 4 and > align 4 and find a suitable free extent [3,10] ([start,length]), after this > allocation, the lefted extents are [3,1] and [9,5]. Therefore, alignment > allocation is more likely to increase the number of free extents, then may > lead cnt&bno btree splitting, which increases likelihood of the problem. > > In my opinion, XFS has avariety of btrees, in order to ensure the growth of > the btrees, XFS use args->minleft/agfl/perag reservation to achieve this, > which corresponds as follows: > > perag reservation: for reverse map & freeinode & refcount btree > args->minleft : for inode btree & inode/attr fork btree > agfl : for block btree (bnobt) & count btree (cntbt) > (rmapbt is exception, it has reservation but get free block from agfl, > since agfl blocks are considered as free when calculate available space, > and rmapbt allocates block from it's reservation, *rmapbt growth* don't > affect available space calculation, so don't care about it) > > Before each allocation need to calculate or prepare these reservation, > more precisely, call `xfs_alloc_space_available` to determine whether there > is enough space to complete current allocation, including those involved > tree growth. if xfs_alloc_space_available is true which means tree growth > can definitely success. > > I think the root cause of the current problem is when AG space is about to > exhausted and happened to encounter cnt&bno btree splitting, > `xfs_alloc_space_available` does't work well. > > Because, considering btree splitting during "space allocation", we will > meet block allocations many times for each "space allocation": > 1st. allocation for space required at the beginning, i.e extent A1. > 2nd. then need to *insert* or *update* free extent to cntbt & bnobt, which > *may* lead to btree splitting and need allocation (as explained above) > 3rd. extent A1 need to insert inode/attr fork btree or inobt etc.. which > *may* also lead to splitting and allocation > > So, during block allocations, which will calling xfs_alloc_space_available > at least 2 times (2nd don't call it, because bnt&cnt btree get block from > agfl). Since the 1st judgement of space available, it has guaranteed there > is enough space to complete 2nd and 3rd allocation, *BUT* after 2nd > allocation, if the height bno&cnt btree increase, min_freelist of agfl will > increase, more acurrate, xfs_alloc_min_freelist will increase, which may > lead to 3rd allocation failed, and 3rd allocation failure will make our xfs > filesystem unavailable. > > According to the above description, since every space allocation, we have > guaranteed agfl min free list is enough for bno&cnt btree growth by > calling `xfs_alloc_fix_freelist` to reserve enough agfl before we do 1st Personally I'm not sure if it's the right way since I don't think we should select _this AG_ at all in the 1st allocation (yet due to lack of necessary reservation for bnobt/cntbt splits, currently it could select it by mistake) rather than silently select _this AG_ which may could impact another reservation and cause even very rare corner cases to users. Anyway, I have a low-confident unfinished AGFL reservation patchset without verification (I don't have a real reproducer-- need to seek Zorro and I badly got COVID with fever since the last week..) From 7aed129bbd23fa2cc67c1818f6e904d681b43858 Mon Sep 17 00:00:00 2001 From: Gao Xiang <hsiangkao@linux.alibaba.com> Date: Mon, 19 Dec 2022 08:03:52 +0800 Subject: [PATCH 1/2] xfs: add AGFL refilling reservation Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> --- fs/xfs/libxfs/xfs_ag.h | 4 +-- fs/xfs/libxfs/xfs_ag_resv.c | 52 +++++++++++++++++++++++++++---------- fs/xfs/libxfs/xfs_ag_resv.h | 3 ++- 3 files changed, 42 insertions(+), 17 deletions(-) diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h index 191b22b9a35b..1e46d3068afe 100644 --- a/fs/xfs/libxfs/xfs_ag.h +++ b/fs/xfs/libxfs/xfs_ag.h @@ -61,8 +61,8 @@ struct xfs_perag { /* Blocks reserved for all kinds of metadata. */ struct xfs_ag_resv pag_meta_resv; - /* Blocks reserved for the reverse mapping btree. */ - struct xfs_ag_resv pag_rmapbt_resv; + /* Blocks reserved for the reverse mapping btree and AGFL refilling. */ + struct xfs_ag_resv pag_agfl_resv; /* for rcu-safe freeing */ struct rcu_head rcu_head; diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c index 5af123d13a63..f9bc190dd718 100644 --- a/fs/xfs/libxfs/xfs_ag_resv.c +++ b/fs/xfs/libxfs/xfs_ag_resv.c @@ -75,13 +75,13 @@ xfs_ag_resv_critical( switch (type) { case XFS_AG_RESV_METADATA: - avail = pag->pagf_freeblks - pag->pag_rmapbt_resv.ar_reserved; + avail = pag->pagf_freeblks - pag->pag_agfl_resv.ar_reserved; orig = pag->pag_meta_resv.ar_asked; break; case XFS_AG_RESV_RMAPBT: avail = pag->pagf_freeblks + pag->pagf_flcount - pag->pag_meta_resv.ar_reserved; - orig = pag->pag_rmapbt_resv.ar_asked; + orig = pag->pag_agfl_resv.ar_asked; break; default: ASSERT(0); @@ -107,10 +107,11 @@ xfs_ag_resv_needed( { xfs_extlen_t len; - len = pag->pag_meta_resv.ar_reserved + pag->pag_rmapbt_resv.ar_reserved; + len = pag->pag_meta_resv.ar_reserved + pag->pag_agfl_resv.ar_reserved; switch (type) { case XFS_AG_RESV_METADATA: case XFS_AG_RESV_RMAPBT: + case XFS_AG_RESV_AGFL: len -= xfs_perag_resv(pag, type)->ar_reserved; break; case XFS_AG_RESV_NONE: @@ -145,7 +146,7 @@ __xfs_ag_resv_free( * considered "free", so whatever was reserved at mount time must be * given back at umount. */ - if (type == XFS_AG_RESV_RMAPBT) + if (type == XFS_AG_RESV_RMAPBT || type == XFS_AG_RESV_AGFL) oldresv = resv->ar_orig_reserved; else oldresv = resv->ar_reserved; @@ -168,7 +169,7 @@ xfs_ag_resv_free( int error; int err2; - error = __xfs_ag_resv_free(pag, XFS_AG_RESV_RMAPBT); + error = __xfs_ag_resv_free(pag, XFS_AG_RESV_AGFL); err2 = __xfs_ag_resv_free(pag, XFS_AG_RESV_METADATA); if (err2 && !error) error = err2; @@ -191,7 +192,7 @@ __xfs_ag_resv_init( ask = used; switch (type) { - case XFS_AG_RESV_RMAPBT: + case XFS_AG_RESV_AGFL: /* * Space taken by the rmapbt is not subtracted from fdblocks * because the rmapbt lives in the free space. Here we must @@ -244,6 +245,27 @@ __xfs_ag_resv_init( return 0; } +int +xfs_agfl_calc_reserves( + struct xfs_mount *mp, + struct xfs_trans *tp, + struct xfs_perag *pag, + xfs_extlen_t *ask, + xfs_extlen_t *used) +{ + xfs_extlen_t len, allocbtres; + + ASSERT(mp->m_alloc_maxlevels); + + allocbtres = mp->m_alloc_maxlevels; + len = 2 * allocbtres * + max(mp->m_bm_maxlevels[0], mp->m_bm_maxlevels[1]); + len = max(len, allocbtres * M_IGEO(mp)->inobt_maxlevels); + + *ask += len; + return 0; +} + /* Create a per-AG block reservation. */ int xfs_ag_resv_init( @@ -296,15 +318,19 @@ xfs_ag_resv_init( has_resv = true; } - /* Create the RMAPBT metadata reservation */ - if (pag->pag_rmapbt_resv.ar_asked == 0) { + /* Create the RMAPBT metadata and AGFL refilling reservation */ + if (pag->pag_agfl_resv.ar_asked == 0) { ask = used = 0; error = xfs_rmapbt_calc_reserves(mp, tp, pag, &ask, &used); if (error) goto out; - error = __xfs_ag_resv_init(pag, XFS_AG_RESV_RMAPBT, ask, used); + error = xfs_agfl_calc_reserves(mp, tp, pag, &ask, &used); + if (error) + goto out; + + error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used); if (error) goto out; if (ask) @@ -336,7 +362,7 @@ xfs_ag_resv_init( */ if (!error && xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + - xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved > + xfs_perag_resv(pag, XFS_AG_RESV_AGFL)->ar_reserved > pag->pagf_freeblks + pag->pagf_flcount) error = -ENOSPC; } @@ -359,7 +385,6 @@ xfs_ag_resv_alloc_extent( switch (type) { case XFS_AG_RESV_AGFL: - return; case XFS_AG_RESV_METADATA: case XFS_AG_RESV_RMAPBT: resv = xfs_perag_resv(pag, type); @@ -376,7 +401,7 @@ xfs_ag_resv_alloc_extent( len = min_t(xfs_extlen_t, args->len, resv->ar_reserved); resv->ar_reserved -= len; - if (type == XFS_AG_RESV_RMAPBT) + if (type == XFS_AG_RESV_RMAPBT || type == XFS_AG_RESV_AGFL) return; /* Allocations of reserved blocks only need on-disk sb updates... */ xfs_trans_mod_sb(args->tp, XFS_TRANS_SB_RES_FDBLOCKS, -(int64_t)len); @@ -401,7 +426,6 @@ xfs_ag_resv_free_extent( switch (type) { case XFS_AG_RESV_AGFL: - return; case XFS_AG_RESV_METADATA: case XFS_AG_RESV_RMAPBT: resv = xfs_perag_resv(pag, type); @@ -416,7 +440,7 @@ xfs_ag_resv_free_extent( leftover = min_t(xfs_extlen_t, len, resv->ar_asked - resv->ar_reserved); resv->ar_reserved += leftover; - if (type == XFS_AG_RESV_RMAPBT) + if (type == XFS_AG_RESV_RMAPBT || type == XFS_AG_RESV_AGFL) return; /* Freeing into the reserved pool only requires on-disk update... */ xfs_trans_mod_sb(tp, XFS_TRANS_SB_RES_FDBLOCKS, len); diff --git a/fs/xfs/libxfs/xfs_ag_resv.h b/fs/xfs/libxfs/xfs_ag_resv.h index b74b210008ea..e9e963c8aebb 100644 --- a/fs/xfs/libxfs/xfs_ag_resv.h +++ b/fs/xfs/libxfs/xfs_ag_resv.h @@ -27,7 +27,8 @@ xfs_perag_resv( case XFS_AG_RESV_METADATA: return &pag->pag_meta_resv; case XFS_AG_RESV_RMAPBT: - return &pag->pag_rmapbt_resv; + case XFS_AG_RESV_AGFL: + return &pag->pag_agfl_resv; default: return NULL; } -- 2.24.4 From 88e218d652f62d07a48510f9069a6747fed9ec0a Mon Sep 17 00:00:00 2001 From: Gao Xiang <hsiangkao@linux.alibaba.com> Date: Thu, 3 Nov 2022 21:10:25 +0800 Subject: [PATCH 2/2] xfs: extend the freelist before available space check Reported-by: Zirong Lang <zlang@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> --- fs/xfs/libxfs/xfs_alloc.c | 189 ++++++++++++++++++++++++-------------- 1 file changed, 121 insertions(+), 68 deletions(-) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 989cf341779b..b72106bc9a94 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2362,7 +2362,6 @@ xfs_free_agfl_block( if (error) return error; xfs_trans_binval(tp, bp); - return 0; } @@ -2583,6 +2582,86 @@ xfs_exact_minlen_extent_available( } #endif +/* + * The freelist has to be in a good state before the available space check + * since multiple allocations could be performed from a single AG and + * transaction under certain conditions. For example, A bmbt allocation + * request made for inode extent to bmap format conversion after an extent + * allocation is expected to be satisfied by the same AG. Such bmap conversion + * allocation can fail due to the available space check if allocbt required an + * extra btree block from the freelist in the previous allocation but without + * making the freelist longer. + */ +int +xfs_fill_agfl( + struct xfs_alloc_arg *args, + int flags, + xfs_extlen_t need, + struct xfs_buf *agbp) +{ + struct xfs_trans *tp = args->tp; + struct xfs_perag *pag = agbp->b_pag; + struct xfs_alloc_arg targs = { + .tp = tp, + .mp = tp->t_mountp, + .agbp = agbp, + .agno = args->agno, + .alignment = 1, + .minlen = 1, + .prod = 1, + .type = XFS_ALLOCTYPE_THIS_AG, + .pag = pag, + }; + struct xfs_buf *agflbp = NULL; + xfs_agblock_t bno; + int error; + + if (flags & XFS_ALLOC_FLAG_NORMAP) + targs.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE; + else + targs.oinfo = XFS_RMAP_OINFO_AG; + + error = xfs_alloc_read_agfl(pag, tp, &agflbp); + if (error) + return error; + + /* Make the freelist longer if it's too short. */ + while (pag->pagf_flcount < need) { + targs.agbno = 0; + targs.maxlen = need - pag->pagf_flcount; + targs.resv = XFS_AG_RESV_AGFL; + + /* Allocate as many blocks as possible at once. */ + error = xfs_alloc_ag_vextent(&targs); + if (error) + goto out_agflbp_relse; + + /* + * Stop if we run out. Won't happen if callers are obeying + * the restrictions correctly. Can happen for free calls + * on a completely full ag. + */ + if (targs.agbno == NULLAGBLOCK) { + if (flags & XFS_ALLOC_FLAG_FREEING) + break; + goto out_agflbp_relse; + } + + /* + * Put each allocated block on the list. + */ + for (bno = targs.agbno; bno < targs.agbno + targs.len; bno++) { + error = xfs_alloc_put_freelist(pag, tp, agbp, + agflbp, bno, 0); + if (error) + goto out_agflbp_relse; + } + } +out_agflbp_relse: + xfs_trans_brelse(tp, agflbp); + return error; +} + /* * Decide whether to use this allocation group for this allocation. * If so, fix up the btree freelist's size. @@ -2596,8 +2675,7 @@ xfs_alloc_fix_freelist( struct xfs_perag *pag = args->pag; struct xfs_trans *tp = args->tp; struct xfs_buf *agbp = NULL; - struct xfs_buf *agflbp = NULL; - struct xfs_alloc_arg targs; /* local allocation arguments */ + struct xfs_owner_info oinfo; xfs_agblock_t bno; /* freelist block */ xfs_extlen_t need; /* total blocks needed in freelist */ int error = 0; @@ -2626,22 +2704,45 @@ xfs_alloc_fix_freelist( goto out_agbp_relse; } - need = xfs_alloc_min_freelist(mp, pag); - if (!xfs_alloc_space_available(args, need, flags | - XFS_ALLOC_FLAG_CHECK)) - goto out_agbp_relse; - /* - * Get the a.g. freespace buffer. - * Can fail if we're not blocking on locks, and it's held. + * See the comment above xfs_fill_agfl() for the reason why we need to + * make freelist longer here. Assumed that such case is quite rare, so + * in order to simplify the code, let's take agbp unconditionally. */ - if (!agbp) { - error = xfs_alloc_read_agf(pag, tp, flags, &agbp); - if (error) { - /* Couldn't lock the AGF so skip this AG. */ - if (error == -EAGAIN) - error = 0; - goto out_no_agbp; + need = xfs_alloc_min_freelist(mp, pag); + if (pag->pagf_flcount < need) { + /* + * Get the a.g. freespace buffer. + * Can fail if we're not blocking on locks, and it's held. + */ + if (!agbp) { + error = xfs_alloc_read_agf(pag, tp, flags, &agbp); + if (error) { + /* Couldn't lock the AGF so skip this AG. */ + if (error == -EAGAIN) + error = 0; + goto out_no_agbp; + } + } + + need = xfs_alloc_min_freelist(mp, pag); + error = xfs_fill_agfl(args, flags, need, agbp); + if (error) + goto out_agbp_relse; + } else { + if (!xfs_alloc_space_available(args, need, flags | + XFS_ALLOC_FLAG_CHECK)) + goto out_agbp_relse; + + /* Get the a.g. freespace buffer. */ + if (!agbp) { + error = xfs_alloc_read_agf(pag, tp, flags, &agbp); + if (error) { + /* Couldn't lock the AGF so skip this AG. */ + if (error == -EAGAIN) + error = 0; + goto out_no_agbp; + } } } @@ -2687,69 +2788,21 @@ xfs_alloc_fix_freelist( * regenerated AGFL, bnobt, and cntbt. See repair/phase5.c and * repair/rmap.c in xfsprogs for details. */ - memset(&targs, 0, sizeof(targs)); - /* struct copy below */ if (flags & XFS_ALLOC_FLAG_NORMAP) - targs.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE; + oinfo = XFS_RMAP_OINFO_SKIP_UPDATE; else - targs.oinfo = XFS_RMAP_OINFO_AG; + oinfo = XFS_RMAP_OINFO_AG; while (!(flags & XFS_ALLOC_FLAG_NOSHRINK) && pag->pagf_flcount > need) { error = xfs_alloc_get_freelist(pag, tp, agbp, &bno, 0); if (error) goto out_agbp_relse; /* defer agfl frees */ - xfs_defer_agfl_block(tp, args->agno, bno, &targs.oinfo); - } - - targs.tp = tp; - targs.mp = mp; - targs.agbp = agbp; - targs.agno = args->agno; - targs.alignment = targs.minlen = targs.prod = 1; - targs.type = XFS_ALLOCTYPE_THIS_AG; - targs.pag = pag; - error = xfs_alloc_read_agfl(pag, tp, &agflbp); - if (error) - goto out_agbp_relse; - - /* Make the freelist longer if it's too short. */ - while (pag->pagf_flcount < need) { - targs.agbno = 0; - targs.maxlen = need - pag->pagf_flcount; - targs.resv = XFS_AG_RESV_AGFL; - - /* Allocate as many blocks as possible at once. */ - error = xfs_alloc_ag_vextent(&targs); - if (error) - goto out_agflbp_relse; - - /* - * Stop if we run out. Won't happen if callers are obeying - * the restrictions correctly. Can happen for free calls - * on a completely full ag. - */ - if (targs.agbno == NULLAGBLOCK) { - if (flags & XFS_ALLOC_FLAG_FREEING) - break; - goto out_agflbp_relse; - } - /* - * Put each allocated block on the list. - */ - for (bno = targs.agbno; bno < targs.agbno + targs.len; bno++) { - error = xfs_alloc_put_freelist(pag, tp, agbp, - agflbp, bno, 0); - if (error) - goto out_agflbp_relse; - } + xfs_defer_agfl_block(tp, args->agno, bno, &oinfo); } - xfs_trans_brelse(tp, agflbp); args->agbp = agbp; return 0; -out_agflbp_relse: - xfs_trans_brelse(tp, agflbp); out_agbp_relse: if (agbp) xfs_trans_brelse(tp, agbp); -- 2.24.4
Hi Gao, I love your patch! Yet something to improve: [auto build test ERROR on xfs-linux/for-next] [also build test ERROR on linus/master v6.2-rc1 next-20221226] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Gao-Xiang/xfs-add-AGFL-refilling-reservation/20221228-234851 base: https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next patch link: https://lore.kernel.org/r/Y6xk2xwrkdF%2FBoXM%40B-P7TQMD6M-0146.local patch subject: [PATCH 1/2] xfs: add AGFL refilling reservation config: x86_64-rhel-8.3-rust compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/6d2b1213e868e4a8ac8031b7ac71abed64aa68fa git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Gao-Xiang/xfs-add-AGFL-refilling-reservation/20221228-234851 git checkout 6d2b1213e868e4a8ac8031b7ac71abed64aa68fa # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 olddefconfig COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash fs/xfs/ If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> All error/warnings (new ones prefixed by >>): >> fs/xfs/libxfs/xfs_alloc.c:2596:1: warning: no previous prototype for function 'xfs_fill_agfl' [-Wmissing-prototypes] xfs_fill_agfl( ^ fs/xfs/libxfs/xfs_alloc.c:2595:1: note: declare 'static' if the function is not intended to be used outside of this translation unit int ^ static 1 warning generated. -- >> fs/xfs/libxfs/xfs_ag_resv.c:249:1: warning: no previous prototype for function 'xfs_agfl_calc_reserves' [-Wmissing-prototypes] xfs_agfl_calc_reserves( ^ fs/xfs/libxfs/xfs_ag_resv.c:248:1: note: declare 'static' if the function is not intended to be used outside of this translation unit int ^ static 1 warning generated. -- >> fs/xfs/scrub/fscounters.c:248:25: error: no member named 'pag_rmapbt_resv' in 'struct xfs_perag' fsc->fdblocks -= pag->pag_rmapbt_resv.ar_orig_reserved; ~~~ ^ 1 error generated. vim +248 fs/xfs/scrub/fscounters.c e147a756ab263f Darrick J. Wong 2021-04-26 194 75efa57d0bf5fc Darrick J. Wong 2019-04-25 195 /* 75efa57d0bf5fc Darrick J. Wong 2019-04-25 196 * Calculate what the global in-core counters ought to be from the incore 75efa57d0bf5fc Darrick J. Wong 2019-04-25 197 * per-AG structure. Callers can compare this to the actual in-core counters 75efa57d0bf5fc Darrick J. Wong 2019-04-25 198 * to estimate by how much both in-core and on-disk counters need to be 75efa57d0bf5fc Darrick J. Wong 2019-04-25 199 * adjusted. 75efa57d0bf5fc Darrick J. Wong 2019-04-25 200 */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 201 STATIC int 75efa57d0bf5fc Darrick J. Wong 2019-04-25 202 xchk_fscount_aggregate_agcounts( 75efa57d0bf5fc Darrick J. Wong 2019-04-25 203 struct xfs_scrub *sc, 75efa57d0bf5fc Darrick J. Wong 2019-04-25 204 struct xchk_fscounters *fsc) 75efa57d0bf5fc Darrick J. Wong 2019-04-25 205 { 75efa57d0bf5fc Darrick J. Wong 2019-04-25 206 struct xfs_mount *mp = sc->mp; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 207 struct xfs_perag *pag; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 208 uint64_t delayed; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 209 xfs_agnumber_t agno; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 210 int tries = 8; 8ef34723eff088 Darrick J. Wong 2019-11-05 211 int error = 0; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 212 75efa57d0bf5fc Darrick J. Wong 2019-04-25 213 retry: 75efa57d0bf5fc Darrick J. Wong 2019-04-25 214 fsc->icount = 0; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 215 fsc->ifree = 0; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 216 fsc->fdblocks = 0; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 217 f250eedcf7621b Dave Chinner 2021-06-02 218 for_each_perag(mp, agno, pag) { f250eedcf7621b Dave Chinner 2021-06-02 219 if (xchk_should_terminate(sc, &error)) f250eedcf7621b Dave Chinner 2021-06-02 220 break; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 221 75efa57d0bf5fc Darrick J. Wong 2019-04-25 222 /* This somehow got unset since the warmup? */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 223 if (!pag->pagi_init || !pag->pagf_init) { f250eedcf7621b Dave Chinner 2021-06-02 224 error = -EFSCORRUPTED; f250eedcf7621b Dave Chinner 2021-06-02 225 break; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 226 } 75efa57d0bf5fc Darrick J. Wong 2019-04-25 227 75efa57d0bf5fc Darrick J. Wong 2019-04-25 228 /* Count all the inodes */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 229 fsc->icount += pag->pagi_count; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 230 fsc->ifree += pag->pagi_freecount; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 231 75efa57d0bf5fc Darrick J. Wong 2019-04-25 232 /* Add up the free/freelist/bnobt/cntbt blocks */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 233 fsc->fdblocks += pag->pagf_freeblks; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 234 fsc->fdblocks += pag->pagf_flcount; ebd9027d088b3a Dave Chinner 2021-08-18 235 if (xfs_has_lazysbcount(sc->mp)) { 75efa57d0bf5fc Darrick J. Wong 2019-04-25 236 fsc->fdblocks += pag->pagf_btreeblks; e147a756ab263f Darrick J. Wong 2021-04-26 237 } else { e147a756ab263f Darrick J. Wong 2021-04-26 238 error = xchk_fscount_btreeblks(sc, fsc, agno); f250eedcf7621b Dave Chinner 2021-06-02 239 if (error) e147a756ab263f Darrick J. Wong 2021-04-26 240 break; e147a756ab263f Darrick J. Wong 2021-04-26 241 } 75efa57d0bf5fc Darrick J. Wong 2019-04-25 242 75efa57d0bf5fc Darrick J. Wong 2019-04-25 243 /* 75efa57d0bf5fc Darrick J. Wong 2019-04-25 244 * Per-AG reservations are taken out of the incore counters, 75efa57d0bf5fc Darrick J. Wong 2019-04-25 245 * so they must be left out of the free blocks computation. 75efa57d0bf5fc Darrick J. Wong 2019-04-25 246 */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 247 fsc->fdblocks -= pag->pag_meta_resv.ar_reserved; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 @248 fsc->fdblocks -= pag->pag_rmapbt_resv.ar_orig_reserved; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 249 75efa57d0bf5fc Darrick J. Wong 2019-04-25 250 } f250eedcf7621b Dave Chinner 2021-06-02 251 if (pag) f250eedcf7621b Dave Chinner 2021-06-02 252 xfs_perag_put(pag); 11f97e68458346 Darrick J. Wong 2022-11-06 253 if (error) { 11f97e68458346 Darrick J. Wong 2022-11-06 254 xchk_set_incomplete(sc); 8ef34723eff088 Darrick J. Wong 2019-11-05 255 return error; 11f97e68458346 Darrick J. Wong 2022-11-06 256 } 8ef34723eff088 Darrick J. Wong 2019-11-05 257 75efa57d0bf5fc Darrick J. Wong 2019-04-25 258 /* 75efa57d0bf5fc Darrick J. Wong 2019-04-25 259 * The global incore space reservation is taken from the incore 75efa57d0bf5fc Darrick J. Wong 2019-04-25 260 * counters, so leave that out of the computation. 75efa57d0bf5fc Darrick J. Wong 2019-04-25 261 */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 262 fsc->fdblocks -= mp->m_resblks_avail; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 263 75efa57d0bf5fc Darrick J. Wong 2019-04-25 264 /* 75efa57d0bf5fc Darrick J. Wong 2019-04-25 265 * Delayed allocation reservations are taken out of the incore counters 75efa57d0bf5fc Darrick J. Wong 2019-04-25 266 * but not recorded on disk, so leave them and their indlen blocks out 75efa57d0bf5fc Darrick J. Wong 2019-04-25 267 * of the computation. 75efa57d0bf5fc Darrick J. Wong 2019-04-25 268 */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 269 delayed = percpu_counter_sum(&mp->m_delalloc_blks); 75efa57d0bf5fc Darrick J. Wong 2019-04-25 270 fsc->fdblocks -= delayed; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 271 75efa57d0bf5fc Darrick J. Wong 2019-04-25 272 trace_xchk_fscounters_calc(mp, fsc->icount, fsc->ifree, fsc->fdblocks, 75efa57d0bf5fc Darrick J. Wong 2019-04-25 273 delayed); 75efa57d0bf5fc Darrick J. Wong 2019-04-25 274 75efa57d0bf5fc Darrick J. Wong 2019-04-25 275 75efa57d0bf5fc Darrick J. Wong 2019-04-25 276 /* Bail out if the values we compute are totally nonsense. */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 277 if (fsc->icount < fsc->icount_min || fsc->icount > fsc->icount_max || 75efa57d0bf5fc Darrick J. Wong 2019-04-25 278 fsc->fdblocks > mp->m_sb.sb_dblocks || 75efa57d0bf5fc Darrick J. Wong 2019-04-25 279 fsc->ifree > fsc->icount_max) 75efa57d0bf5fc Darrick J. Wong 2019-04-25 280 return -EFSCORRUPTED; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 281 75efa57d0bf5fc Darrick J. Wong 2019-04-25 282 /* 75efa57d0bf5fc Darrick J. Wong 2019-04-25 283 * If ifree > icount then we probably had some perturbation in the 75efa57d0bf5fc Darrick J. Wong 2019-04-25 284 * counters while we were calculating things. We'll try a few times 75efa57d0bf5fc Darrick J. Wong 2019-04-25 285 * to maintain ifree <= icount before giving up. 75efa57d0bf5fc Darrick J. Wong 2019-04-25 286 */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 287 if (fsc->ifree > fsc->icount) { 75efa57d0bf5fc Darrick J. Wong 2019-04-25 288 if (tries--) 75efa57d0bf5fc Darrick J. Wong 2019-04-25 289 goto retry; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 290 xchk_set_incomplete(sc); 75efa57d0bf5fc Darrick J. Wong 2019-04-25 291 return 0; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 292 } 75efa57d0bf5fc Darrick J. Wong 2019-04-25 293 75efa57d0bf5fc Darrick J. Wong 2019-04-25 294 return 0; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 295 } 75efa57d0bf5fc Darrick J. Wong 2019-04-25 296
Hi Gao, I love your patch! Yet something to improve: [auto build test ERROR on xfs-linux/for-next] [also build test ERROR on linus/master v6.2-rc1 next-20221226] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Gao-Xiang/xfs-add-AGFL-refilling-reservation/20221228-234851 base: https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next patch link: https://lore.kernel.org/r/Y6xk2xwrkdF%2FBoXM%40B-P7TQMD6M-0146.local patch subject: [PATCH 1/2] xfs: add AGFL refilling reservation config: x86_64-rhel-8.3-func compiler: gcc-11 (Debian 11.3.0-8) 11.3.0 reproduce (this is a W=1 build): # https://github.com/intel-lab-lkp/linux/commit/6d2b1213e868e4a8ac8031b7ac71abed64aa68fa git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Gao-Xiang/xfs-add-AGFL-refilling-reservation/20221228-234851 git checkout 6d2b1213e868e4a8ac8031b7ac71abed64aa68fa # save the config file mkdir build_dir && cp config build_dir/.config make W=1 O=build_dir ARCH=x86_64 olddefconfig make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash fs/xfs/ If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> All error/warnings (new ones prefixed by >>): >> fs/xfs/libxfs/xfs_alloc.c:2596:1: warning: no previous prototype for 'xfs_fill_agfl' [-Wmissing-prototypes] 2596 | xfs_fill_agfl( | ^~~~~~~~~~~~~ -- >> fs/xfs/libxfs/xfs_ag_resv.c:249:1: warning: no previous prototype for 'xfs_agfl_calc_reserves' [-Wmissing-prototypes] 249 | xfs_agfl_calc_reserves( | ^~~~~~~~~~~~~~~~~~~~~~ -- fs/xfs/scrub/fscounters.c: In function 'xchk_fscount_aggregate_agcounts': >> fs/xfs/scrub/fscounters.c:248:39: error: 'struct xfs_perag' has no member named 'pag_rmapbt_resv'; did you mean 'pag_meta_resv'? 248 | fsc->fdblocks -= pag->pag_rmapbt_resv.ar_orig_reserved; | ^~~~~~~~~~~~~~~ | pag_meta_resv vim +248 fs/xfs/scrub/fscounters.c e147a756ab263f Darrick J. Wong 2021-04-26 194 75efa57d0bf5fc Darrick J. Wong 2019-04-25 195 /* 75efa57d0bf5fc Darrick J. Wong 2019-04-25 196 * Calculate what the global in-core counters ought to be from the incore 75efa57d0bf5fc Darrick J. Wong 2019-04-25 197 * per-AG structure. Callers can compare this to the actual in-core counters 75efa57d0bf5fc Darrick J. Wong 2019-04-25 198 * to estimate by how much both in-core and on-disk counters need to be 75efa57d0bf5fc Darrick J. Wong 2019-04-25 199 * adjusted. 75efa57d0bf5fc Darrick J. Wong 2019-04-25 200 */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 201 STATIC int 75efa57d0bf5fc Darrick J. Wong 2019-04-25 202 xchk_fscount_aggregate_agcounts( 75efa57d0bf5fc Darrick J. Wong 2019-04-25 203 struct xfs_scrub *sc, 75efa57d0bf5fc Darrick J. Wong 2019-04-25 204 struct xchk_fscounters *fsc) 75efa57d0bf5fc Darrick J. Wong 2019-04-25 205 { 75efa57d0bf5fc Darrick J. Wong 2019-04-25 206 struct xfs_mount *mp = sc->mp; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 207 struct xfs_perag *pag; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 208 uint64_t delayed; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 209 xfs_agnumber_t agno; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 210 int tries = 8; 8ef34723eff088 Darrick J. Wong 2019-11-05 211 int error = 0; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 212 75efa57d0bf5fc Darrick J. Wong 2019-04-25 213 retry: 75efa57d0bf5fc Darrick J. Wong 2019-04-25 214 fsc->icount = 0; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 215 fsc->ifree = 0; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 216 fsc->fdblocks = 0; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 217 f250eedcf7621b Dave Chinner 2021-06-02 218 for_each_perag(mp, agno, pag) { f250eedcf7621b Dave Chinner 2021-06-02 219 if (xchk_should_terminate(sc, &error)) f250eedcf7621b Dave Chinner 2021-06-02 220 break; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 221 75efa57d0bf5fc Darrick J. Wong 2019-04-25 222 /* This somehow got unset since the warmup? */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 223 if (!pag->pagi_init || !pag->pagf_init) { f250eedcf7621b Dave Chinner 2021-06-02 224 error = -EFSCORRUPTED; f250eedcf7621b Dave Chinner 2021-06-02 225 break; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 226 } 75efa57d0bf5fc Darrick J. Wong 2019-04-25 227 75efa57d0bf5fc Darrick J. Wong 2019-04-25 228 /* Count all the inodes */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 229 fsc->icount += pag->pagi_count; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 230 fsc->ifree += pag->pagi_freecount; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 231 75efa57d0bf5fc Darrick J. Wong 2019-04-25 232 /* Add up the free/freelist/bnobt/cntbt blocks */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 233 fsc->fdblocks += pag->pagf_freeblks; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 234 fsc->fdblocks += pag->pagf_flcount; ebd9027d088b3a Dave Chinner 2021-08-18 235 if (xfs_has_lazysbcount(sc->mp)) { 75efa57d0bf5fc Darrick J. Wong 2019-04-25 236 fsc->fdblocks += pag->pagf_btreeblks; e147a756ab263f Darrick J. Wong 2021-04-26 237 } else { e147a756ab263f Darrick J. Wong 2021-04-26 238 error = xchk_fscount_btreeblks(sc, fsc, agno); f250eedcf7621b Dave Chinner 2021-06-02 239 if (error) e147a756ab263f Darrick J. Wong 2021-04-26 240 break; e147a756ab263f Darrick J. Wong 2021-04-26 241 } 75efa57d0bf5fc Darrick J. Wong 2019-04-25 242 75efa57d0bf5fc Darrick J. Wong 2019-04-25 243 /* 75efa57d0bf5fc Darrick J. Wong 2019-04-25 244 * Per-AG reservations are taken out of the incore counters, 75efa57d0bf5fc Darrick J. Wong 2019-04-25 245 * so they must be left out of the free blocks computation. 75efa57d0bf5fc Darrick J. Wong 2019-04-25 246 */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 247 fsc->fdblocks -= pag->pag_meta_resv.ar_reserved; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 @248 fsc->fdblocks -= pag->pag_rmapbt_resv.ar_orig_reserved; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 249 75efa57d0bf5fc Darrick J. Wong 2019-04-25 250 } f250eedcf7621b Dave Chinner 2021-06-02 251 if (pag) f250eedcf7621b Dave Chinner 2021-06-02 252 xfs_perag_put(pag); 11f97e68458346 Darrick J. Wong 2022-11-06 253 if (error) { 11f97e68458346 Darrick J. Wong 2022-11-06 254 xchk_set_incomplete(sc); 8ef34723eff088 Darrick J. Wong 2019-11-05 255 return error; 11f97e68458346 Darrick J. Wong 2022-11-06 256 } 8ef34723eff088 Darrick J. Wong 2019-11-05 257 75efa57d0bf5fc Darrick J. Wong 2019-04-25 258 /* 75efa57d0bf5fc Darrick J. Wong 2019-04-25 259 * The global incore space reservation is taken from the incore 75efa57d0bf5fc Darrick J. Wong 2019-04-25 260 * counters, so leave that out of the computation. 75efa57d0bf5fc Darrick J. Wong 2019-04-25 261 */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 262 fsc->fdblocks -= mp->m_resblks_avail; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 263 75efa57d0bf5fc Darrick J. Wong 2019-04-25 264 /* 75efa57d0bf5fc Darrick J. Wong 2019-04-25 265 * Delayed allocation reservations are taken out of the incore counters 75efa57d0bf5fc Darrick J. Wong 2019-04-25 266 * but not recorded on disk, so leave them and their indlen blocks out 75efa57d0bf5fc Darrick J. Wong 2019-04-25 267 * of the computation. 75efa57d0bf5fc Darrick J. Wong 2019-04-25 268 */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 269 delayed = percpu_counter_sum(&mp->m_delalloc_blks); 75efa57d0bf5fc Darrick J. Wong 2019-04-25 270 fsc->fdblocks -= delayed; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 271 75efa57d0bf5fc Darrick J. Wong 2019-04-25 272 trace_xchk_fscounters_calc(mp, fsc->icount, fsc->ifree, fsc->fdblocks, 75efa57d0bf5fc Darrick J. Wong 2019-04-25 273 delayed); 75efa57d0bf5fc Darrick J. Wong 2019-04-25 274 75efa57d0bf5fc Darrick J. Wong 2019-04-25 275 75efa57d0bf5fc Darrick J. Wong 2019-04-25 276 /* Bail out if the values we compute are totally nonsense. */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 277 if (fsc->icount < fsc->icount_min || fsc->icount > fsc->icount_max || 75efa57d0bf5fc Darrick J. Wong 2019-04-25 278 fsc->fdblocks > mp->m_sb.sb_dblocks || 75efa57d0bf5fc Darrick J. Wong 2019-04-25 279 fsc->ifree > fsc->icount_max) 75efa57d0bf5fc Darrick J. Wong 2019-04-25 280 return -EFSCORRUPTED; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 281 75efa57d0bf5fc Darrick J. Wong 2019-04-25 282 /* 75efa57d0bf5fc Darrick J. Wong 2019-04-25 283 * If ifree > icount then we probably had some perturbation in the 75efa57d0bf5fc Darrick J. Wong 2019-04-25 284 * counters while we were calculating things. We'll try a few times 75efa57d0bf5fc Darrick J. Wong 2019-04-25 285 * to maintain ifree <= icount before giving up. 75efa57d0bf5fc Darrick J. Wong 2019-04-25 286 */ 75efa57d0bf5fc Darrick J. Wong 2019-04-25 287 if (fsc->ifree > fsc->icount) { 75efa57d0bf5fc Darrick J. Wong 2019-04-25 288 if (tries--) 75efa57d0bf5fc Darrick J. Wong 2019-04-25 289 goto retry; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 290 xchk_set_incomplete(sc); 75efa57d0bf5fc Darrick J. Wong 2019-04-25 291 return 0; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 292 } 75efa57d0bf5fc Darrick J. Wong 2019-04-25 293 75efa57d0bf5fc Darrick J. Wong 2019-04-25 294 return 0; 75efa57d0bf5fc Darrick J. Wong 2019-04-25 295 } 75efa57d0bf5fc Darrick J. Wong 2019-04-25 296
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 989cf341779b..6d9ada93aec3 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2305,7 +2305,7 @@ xfs_alloc_space_available( int available; xfs_extlen_t agflcount; - if (flags & XFS_ALLOC_FLAG_FREEING) + if (flags & XFS_ALLOC_FLAG_FREEING || args->minleft == 0) return true; reservation = xfs_ag_resv_needed(pag, args->resv);
Recently, I noticed an special problem on our products. The disk space is sufficient, while encounter btree split failure. After looking inside the disk, I found the specific AG space is about to be exhausted. More seriously, under this special situation, btree split failure will always be triggered and XFS filesystem is unavailable. After analysis the disk image and the AG, which seem same as Gao Xiang met before [1], The slight difference is my problem is triggered by creating new inode, I read through your discussion the mailing list[1], I think it's probably the same root cause. As Dave pointed out, args->minleft has an *exact* meaning, both inode fork allocations and inode chunk extent allocation pre-calculate args->minleft to ensure inobt record insertion succeed in any circumstances. But, this guarantee dosen't seem to be reliable, especially when it happens to meet cnt&bno btree splitting. Gao Xiang proposed an solution by adding postalloc to make current allocation reserve more space for inobt splitting, I think it's ok to slove their own problem, but may not be sloved completely, since inode chunk extent allocation may failed during inobt splitting too. Meanwhile, Gao Xiang also noticed strip align requirement may increase probablility of the problem, which is totally true. I think the reason is that align requirement may lead to one free extent divied into two, which increase probablility of the problem. eg: we needs an extent length 4 and align 4 and find a suitable free extent [3,10] ([start,length]), after this allocation, the lefted extents are [3,1] and [9,5]. Therefore, alignment allocation is more likely to increase the number of free extents, then may lead cnt&bno btree splitting, which increases likelihood of the problem. In my opinion, XFS has avariety of btrees, in order to ensure the growth of the btrees, XFS use args->minleft/agfl/perag reservation to achieve this, which corresponds as follows: perag reservation: for reverse map & freeinode & refcount btree args->minleft : for inode btree & inode/attr fork btree agfl : for block btree (bnobt) & count btree (cntbt) (rmapbt is exception, it has reservation but get free block from agfl, since agfl blocks are considered as free when calculate available space, and rmapbt allocates block from it's reservation, *rmapbt growth* don't affect available space calculation, so don't care about it) Before each allocation need to calculate or prepare these reservation, more precisely, call `xfs_alloc_space_available` to determine whether there is enough space to complete current allocation, including those involved tree growth. if xfs_alloc_space_available is true which means tree growth can definitely success. I think the root cause of the current problem is when AG space is about to exhausted and happened to encounter cnt&bno btree splitting, `xfs_alloc_space_available` does't work well. Because, considering btree splitting during "space allocation", we will meet block allocations many times for each "space allocation": 1st. allocation for space required at the beginning, i.e extent A1. 2nd. then need to *insert* or *update* free extent to cntbt & bnobt, which *may* lead to btree splitting and need allocation (as explained above) 3rd. extent A1 need to insert inode/attr fork btree or inobt etc.. which *may* also lead to splitting and allocation So, during block allocations, which will calling xfs_alloc_space_available at least 2 times (2nd don't call it, because bnt&cnt btree get block from agfl). Since the 1st judgement of space available, it has guaranteed there is enough space to complete 2nd and 3rd allocation, *BUT* after 2nd allocation, if the height bno&cnt btree increase, min_freelist of agfl will increase, more acurrate, xfs_alloc_min_freelist will increase, which may lead to 3rd allocation failed, and 3rd allocation failure will make our xfs filesystem unavailable. According to the above description, since every space allocation, we have guaranteed agfl min free list is enough for bno&cnt btree growth by calling `xfs_alloc_fix_freelist` to reserve enough agfl before we do 1st allocation. So the 2nd allocation will always succeed. args->minleft can guaranteed 3rd allocation will make it, it is no need to rejudge space available in 3rd allocation, so xfs_alloc_space_available should always be true. In summary, since btree alloc_block don't need any minleft, both 2rd and 3rd allocation are allocation for btree. So just treat these allocation same as freeing extents (caller with flag XFS_ALLOC_FLAG_FREEING set). [1] https://lore.kernel.org/linux-xfs/20221109034802.40322-1-hsiangkao@linux.alibaba.com/ Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Guo Xuenan <guoxuenan@huawei.com> --- fs/xfs/libxfs/xfs_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)