Message ID | 20220105071052.GD20464@templeofstupid.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | xfs_bmap_extents_to_btree allocation warnings | expand |
On Tue, Jan 04, 2022 at 11:10:52PM -0800, Krister Johansen wrote: > Hi, > I've been running into occasional WARNs related to allocating a block to > hold the new btree that XFS is attempting to create when calling this > function. The problem is sporadic -- once every 10-40 days and a > different system each time. The warning is: > WARNING: CPU: 4 PID: 115756 at fs/xfs/libxfs/xfs_bmap.c:716 xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs] > RIP: 0010:xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs] > Call Trace: > xfs_bmap_add_extent_hole_real+0x7d9/0x8f0 [xfs] > xfs_bmapi_allocate+0x2a8/0x2d0 [xfs] > xfs_bmapi_write+0x3a9/0x5f0 [xfs] > xfs_iomap_write_direct+0x293/0x3c0 [xfs] > xfs_file_iomap_begin+0x4d2/0x5c0 [xfs] > iomap_apply+0x68/0x160 > iomap_dio_rw+0x2c1/0x450 > xfs_file_dio_aio_write+0x103/0x2e0 [xfs] > xfs_file_write_iter+0x99/0xe0 [xfs] > new_sync_write+0x125/0x1c0 > __vfs_write+0x29/0x40 > vfs_write+0xb9/0x1a0 > ksys_write+0x67/0xe0 > __x64_sys_write+0x1a/0x20 > do_syscall_64+0x57/0x190 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 Which is indicative of a multi-extent allocation (i.e. data extent + BMBT indexing btree block allocations) not selecting an AG that has enough space for both data and BMBT blocks to be allocated. That's likely because args.total was not set correctly before AG selection was made. This exact sort of bug can be seen in the fix for recent error injection debug code in commit 6e8bd39d7227 ("xfs: Initialize xfs_alloc_arg->total correctly when allocating minlen extents"). You're not running a CONFIG_XFS_DEBUG=y kernel are you? In this case: > The process that's triggering the problem is dd punching a hole into > file via direct I/O. It's doing this as part of a watchdog process to > ensure that the system remains able to issue read and write requests. > The direct I/O is an attempt to avoid reading/writing cached data from > this process. > > I'm hardly an expert; however, after some digging it appears that the > direct I/O path for this particular workload is more susceptible to the > problem because its tp->t_firstblock is always set to a block in an > existing AG, Which is always true for BMBT block allocation - we've already allocated the data extent in the transaction in xfs_bmap_add_extent_hole_real(), and now we are doing the BMBT block allocation needed to convert the extent list inline in the inode literal area into external btree block format. > while the rest of the I/O on this filesystem goes through > the page cache and uses the delayed allocation mechanism by default. > (IOW, t_firstblock is NULLFSBLOCK most of the time.) Has nothing to do with delayed allocation - the physical allocation that occurs in writeback will do exactly the same thing as direct IO allocation.... > It seemed like one reason for keeping the bmap and the inode in the same > AG might be that with 32-bit block pointers in an inode there wouldn't > be space to store the AG and the block if the btree were allocated in a > different AG. The BMBT is a 64 bit btree so can index blocks anywhere in the filesystem. 32 bit btree indexes are only used for internal references within an AG, not for user data. > It also seemed like there were lock order concerns when > iterating over multiple AGs. yes, that's what t_firstblock is used to avoid - AGFs must be locked in ascending order only. > However, linux is using 64-bit block > pointers in the inode now and the XFS_ALLOCTYPE_START_BNO case in > xfs_alloc_vextent() seems to try to ensure that it never considers an AG > that's less than the agno for the fsbno passed in via args. Because otherwise allocations ABBA deadlock on AG locking. > Would something like this: > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c > index 4dccd4d90622..5d949ac1ecae 100644 > --- a/fs/xfs/libxfs/xfs_bmap.c > +++ b/fs/xfs/libxfs/xfs_bmap.c > @@ -664,6 +664,13 @@ xfs_bmap_extents_to_btree( > if (error) > goto out_root_realloc; > > + if (args.fsbno == NULLFSBLOCK && args.type == XFS_ALLOCTYPE_NEAR_BNO) { > + args.type = XFS_ALLOCTYPE_START_BNO; > + error = xfs_alloc_vextent(&args); > + if (error) > + goto out_root_realloc; > + } > + > if (WARN_ON_ONCE(args.fsbno == NULLFSBLOCK)) { > error = -ENOSPC; > goto out_root_realloc; It might, but it doesn't fix the root cause which is that we selected an AG without enough space in it for the entire chain of allocations in the first place. > Or this: > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c > index 4dccd4d90622..94e4ecb75561 100644 > --- a/fs/xfs/libxfs/xfs_bmap.c > +++ b/fs/xfs/libxfs/xfs_bmap.c > @@ -647,14 +647,10 @@ xfs_bmap_extents_to_btree( > args.tp = tp; > args.mp = mp; > xfs_rmap_ino_bmbt_owner(&args.oinfo, ip->i_ino, whichfork); > + args.type = XFS_ALLOCTYPE_START_BNO; > if (tp->t_firstblock == NULLFSBLOCK) { > - args.type = XFS_ALLOCTYPE_START_BNO; > args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino); > - } else if (tp->t_flags & XFS_TRANS_LOWMODE) { > - args.type = XFS_ALLOCTYPE_START_BNO; > - args.fsbno = tp->t_firstblock; > } else { > - args.type = XFS_ALLOCTYPE_NEAR_BNO; > args.fsbno = tp->t_firstblock; > } > args.minlen = args.maxlen = args.prod = 1; Same again - hides the symptom, doesn't address the root cause. > be a reasonable way to address the WARN? Or does this open a box of > problems that obvious to the experienced, but just subtle enough to > elude the unfamiliar? No, yes, and yes. > I ask because these filesystems are pretty busy on a day to day basis > and the path where t_firstblock is NULLFSBLOCK is never hitting this > problem. The overall workload is a btree based database. Lots of > random reads and writes to many files that all live in the same > directory. Which indicates that you are likely hitting AG full conditions more frequently than not. > meta-data=/dev/mapper/db-vol isize=512 agcount=32, agsize=228849020 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=0 > = reflink=1 .... > The xfs_db freesp report after the problem occurred. (N.B. it was a few > hours before I was able to get to this machine to investigate) > > xfs_db -r -c 'freesp -a 47 -s' /dev/mapper/db-vol > from to extents blocks pct > 1 1 48 48 0.00 > 2 3 119 303 0.02 > 4 7 46 250 0.01 > 8 15 22 255 0.01 > 16 31 17 374 0.02 > 32 63 16 728 0.04 > 64 127 9 997 0.05 > 128 255 149 34271 1.83 > 256 511 7 2241 0.12 > 512 1023 4 2284 0.12 > 1024 2047 1 1280 0.07 > 2048 4095 1 3452 0.18 > 1048576 2097151 1 1825182 97.52 > total free extents 440 > total free blocks 1871665 > average free extent size 4253.78 So 1,871,665 of 228,849,020 blocks free in the AG. That's 99.2% full, so it's extremely likely you are hitting a full AG condition. /me goes and looks at xfs_iomap_write_direct().... .... and notices that it passes "0" as the total allocation block count, which means it isn't reserving space in the AG for both the data extent and the BMBT blocks... ... and several other xfs_bmapi_write() callers have the same issue... Ok, let me spend a bit more looking into this in more depth, but it looks like the problem is at the xfs_bmapi_write() caller level, not deep in the allocator itself. Cheers, Dave.
Hi David, Thanks for the response. On Thu, Jan 06, 2022 at 12:01:23PM +1100, Dave Chinner wrote: > On Tue, Jan 04, 2022 at 11:10:52PM -0800, Krister Johansen wrote: > > WARNING: CPU: 4 PID: 115756 at fs/xfs/libxfs/xfs_bmap.c:716 xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs] > > RIP: 0010:xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs] > > Call Trace: > > xfs_bmap_add_extent_hole_real+0x7d9/0x8f0 [xfs] > > xfs_bmapi_allocate+0x2a8/0x2d0 [xfs] > > xfs_bmapi_write+0x3a9/0x5f0 [xfs] > > xfs_iomap_write_direct+0x293/0x3c0 [xfs] > > xfs_file_iomap_begin+0x4d2/0x5c0 [xfs] > > iomap_apply+0x68/0x160 > > iomap_dio_rw+0x2c1/0x450 > > xfs_file_dio_aio_write+0x103/0x2e0 [xfs] > > xfs_file_write_iter+0x99/0xe0 [xfs] > > new_sync_write+0x125/0x1c0 > > __vfs_write+0x29/0x40 > > vfs_write+0xb9/0x1a0 > > ksys_write+0x67/0xe0 > > __x64_sys_write+0x1a/0x20 > > do_syscall_64+0x57/0x190 > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Which is indicative of a multi-extent allocation (i.e. data extent + > BMBT indexing btree block allocations) not selecting an AG that has > enough space for both data and BMBT blocks to be allocated. > > That's likely because args.total was not set correctly before AG > selection was made. This exact sort of bug can be seen in the fix > for recent error injection debug code in commit 6e8bd39d7227 ("xfs: > Initialize xfs_alloc_arg->total correctly when allocating minlen > extents"). You're not running a CONFIG_XFS_DEBUG=y kernel are you? I am not running with CONFIG_XFS_DEBUG, no. > > while the rest of the I/O on this filesystem goes through > > the page cache and uses the delayed allocation mechanism by default. > > (IOW, t_firstblock is NULLFSBLOCK most of the time.) > > Has nothing to do with delayed allocation - the physical allocation > that occurs in writeback will do exactly the same thing as direct IO > allocation.... It must be similar, but not identical. Though I think you're right that I've confused myself about what's happening here. The dd process that is triggering this warning is an incredibly small fraction of all the I/O that occurs on this filesystem, and yet it accounts for all of the instances of the warning that I can find. I looked in xfs_bmapi_convert_delalloc and saw that xfs_trans_alloc allocated a new transaction but didn't touch t_firstblock. What I neglected to notice, though, was that xfs_bmap_btalloc and xfs_bmbt_alloc_block could modify t_firstblock, and that this code path could get there via the xfs_btree_insert calls in xfs_bmap_add_extent_delay_real for the latter and the xfs_bmap_alloc path for the former. I missed the difference in the alloc_args accounting between the delayed and real allocation paths and assumed that t_firstblock was the thing making the difference between the regular and direct I/O cases. > > However, linux is using 64-bit block > > pointers in the inode now and the XFS_ALLOCTYPE_START_BNO case in > > xfs_alloc_vextent() seems to try to ensure that it never considers an AG > > that's less than the agno for the fsbno passed in via args. > > Because otherwise allocations ABBA deadlock on AG locking. Sorry, what I'm really trying to ask is: are there still cases in XFS where using XFS_ALLOCTYPE_START_BNO can give you this kind of deadlock? There's some deadlock avoidance in the START_BNO implementation, but there are plenty of places where XFS_ALLOCTYPE_NEAR_BNO is still used if t_firstblock is not NULLFSBLOCK. I'm trying to work out if that code still exists because of concerns about deadlocks, or if its an attempt to limit the number of AGs searched instead. (Or some other policy choice, even.) > > be a reasonable way to address the WARN? Or does this open a box of > > problems that obvious to the experienced, but just subtle enough to > > elude the unfamiliar? > > No, yes, and yes. Thank you humoring my questions nonetheless. > > The xfs_db freesp report after the problem occurred. (N.B. it was a few > > hours before I was able to get to this machine to investigate) > > > > xfs_db -r -c 'freesp -a 47 -s' /dev/mapper/db-vol > > from to extents blocks pct > > 1 1 48 48 0.00 > > 2 3 119 303 0.02 > > 4 7 46 250 0.01 > > 8 15 22 255 0.01 > > 16 31 17 374 0.02 > > 32 63 16 728 0.04 > > 64 127 9 997 0.05 > > 128 255 149 34271 1.83 > > 256 511 7 2241 0.12 > > 512 1023 4 2284 0.12 > > 1024 2047 1 1280 0.07 > > 2048 4095 1 3452 0.18 > > 1048576 2097151 1 1825182 97.52 > > total free extents 440 > > total free blocks 1871665 > > average free extent size 4253.78 > > So 1,871,665 of 228,849,020 blocks free in the AG. That's 99.2% > full, so it's extremely likely you are hitting a full AG condition. > > /me goes and looks at xfs_iomap_write_direct().... > > .... and notices that it passes "0" as the total allocation block > count, which means it isn't reserving space in the AG for both the > data extent and the BMBT blocks... > > ... and several other xfs_bmapi_write() callers have the same > issue... > > Ok, let me spend a bit more looking into this in more depth, but it > looks like the problem is at the xfs_bmapi_write() caller level, not > deep in the allocator itself. At least on 5.4 xfs_bmapi_write is still passing resblks instead of zero, which is computed in xfs_iomap_write_direct. Related to your comment about alloc_args.total, I did a bit of tracing of the xfs_alloc tracepoints on my system and found that total seems to be getting set in both cases, but that a) it's actually a larger value for directio; and b) in the buffered write case the code is requesting more blocks at one time which causes a larger allocation to occur. I'm not certain, but wondered if this could be causing us to select an AG with more space by luck. directio: dd-102229 [005] .... 4969662.383215: xfs_alloc_exact_done: dev 253:1 agno 0 agbno 14240 minlen 4 maxlen 4 mod 0 prod 1 minleft 2 total 8 alignment 1 minalignslop 3 len 4 type THIS_BNO otype THIS_BNO wasdel 0 wasfromfl 0 resv 0 datatype 0x9 firstblock 0xffffffffffffffff dd-102229 [005] .... 4969662.383217: <stack trace> => xfs_alloc_ag_vextent_exact+0x2b4/0x2d0 [xfs] => xfs_alloc_ag_vextent+0x13b/0x150 [xfs] => xfs_alloc_vextent+0x2bc/0x550 [xfs] => xfs_bmap_btalloc+0x461/0x940 [xfs] => xfs_bmap_alloc+0x34/0x40 [xfs] => xfs_bmapi_allocate+0xdc/0x2d0 [xfs] => xfs_bmapi_write+0x3a9/0x5f0 [xfs] => xfs_iomap_write_direct+0x293/0x3c0 [xfs] => xfs_file_iomap_begin+0x4d2/0x5c0 [xfs] => iomap_apply+0x68/0x160 => iomap_dio_rw+0x2c1/0x450 => xfs_file_dio_aio_write+0x103/0x2e0 [xfs] => xfs_file_write_iter+0x99/0xe0 [xfs] => new_sync_write+0x125/0x1c0 => __vfs_write+0x29/0x40 => vfs_write+0xb9/0x1a0 => ksys_write+0x67/0xe0 => __x64_sys_write+0x1a/0x20 => do_syscall_64+0x57/0x190 => entry_SYSCALL_64_after_hwframe+0x44/0xa9 buffered write + fsync: dd-109921 [010] .... 4972814.844429: xfs_alloc_exact_done: dev 253:1 agno 0 agbno 21280 minlen 16 maxlen 16 mod 0 prod 1 minleft 2 total 4 alignment 1 minalignslop 3 len 16 type THIS_BNO otype THIS_BNO wasdel 1 wasfromfl 0 resv 0 datatype 0x9 firstblock 0xffffffffffffffff dd-109921 [010] .... 4972814.844433: <stack trace> => xfs_alloc_ag_vextent_exact+0x2b4/0x2d0 [xfs] => xfs_alloc_ag_vextent+0x13b/0x150 [xfs] => xfs_alloc_vextent+0x2bc/0x550 [xfs] => xfs_bmap_btalloc+0x461/0x940 [xfs] => xfs_bmap_alloc+0x34/0x40 [xfs] => xfs_bmapi_allocate+0xdc/0x2d0 [xfs] => xfs_bmapi_convert_delalloc+0x26f/0x4b0 [xfs] => xfs_map_blocks+0x15a/0x3f0 [xfs] => xfs_do_writepage+0x118/0x420 [xfs] => write_cache_pages+0x1ae/0x4b0 => xfs_vm_writepages+0x6a/0xa0 [xfs] => do_writepages+0x43/0xd0 => __filemap_fdatawrite_range+0xd5/0x110 => file_write_and_wait_range+0x74/0xc0 => xfs_file_fsync+0x5d/0x230 [xfs] => vfs_fsync_range+0x49/0x80 => do_fsync+0x3d/0x70 => __x64_sys_fsync+0x14/0x20 => do_syscall_64+0x57/0x190 => entry_SYSCALL_64_after_hwframe+0x44/0xa9 I'm not sure if this is useful, but if there's something else that you'd like me to instrument that would be, let me know and I'll see what I can pull together. Thanks again, -K
Hi Dave, On Thu, Jan 06, 2022 at 12:01:23PM +1100, Dave Chinner wrote: > On Tue, Jan 04, 2022 at 11:10:52PM -0800, Krister Johansen wrote: > > Hi, > > I've been running into occasional WARNs related to allocating a block to > > hold the new btree that XFS is attempting to create when calling this > > function. The problem is sporadic -- once every 10-40 days and a > > different system each time. > > The warning is: > > > WARNING: CPU: 4 PID: 115756 at fs/xfs/libxfs/xfs_bmap.c:716 xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs] > > RIP: 0010:xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs] > > Call Trace: > > xfs_bmap_add_extent_hole_real+0x7d9/0x8f0 [xfs] > > xfs_bmapi_allocate+0x2a8/0x2d0 [xfs] > > xfs_bmapi_write+0x3a9/0x5f0 [xfs] > > xfs_iomap_write_direct+0x293/0x3c0 [xfs] > > xfs_file_iomap_begin+0x4d2/0x5c0 [xfs] > > iomap_apply+0x68/0x160 > > iomap_dio_rw+0x2c1/0x450 > > xfs_file_dio_aio_write+0x103/0x2e0 [xfs] > > xfs_file_write_iter+0x99/0xe0 [xfs] > > new_sync_write+0x125/0x1c0 > > __vfs_write+0x29/0x40 > > vfs_write+0xb9/0x1a0 > > ksys_write+0x67/0xe0 > > __x64_sys_write+0x1a/0x20 > > do_syscall_64+0x57/0x190 > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > <snip> > So 1,871,665 of 228,849,020 blocks free in the AG. That's 99.2% > full, so it's extremely likely you are hitting a full AG condition. > > /me goes and looks at xfs_iomap_write_direct().... > > .... and notices that it passes "0" as the total allocation block > count, which means it isn't reserving space in the AG for both the > data extent and the BMBT blocks... > > ... and several other xfs_bmapi_write() callers have the same > issue... > > Ok, let me spend a bit more looking into this in more depth, but it > looks like the problem is at the xfs_bmapi_write() caller level, not > deep in the allocator itself. I noodled on this a bit more and have another hypothesis. Feel free to tell me that this one is just as nuts (or more). However, after thinking through your comments about the accounting, and reviewing some other patches and threads for similar problems: https://lore.kernel.org/linux-xfs/20171127202434.43125-4-bfoster@redhat.com/ https://lore.kernel.org/linux-xfs/20171207185810.48757-1-bfoster@redhat.com/ https://lore.kernel.org/linux-xfs/20190327145000.10756-1-bfoster@redhat.com/ I wondered if perhaps the problem was related to other problems in xfs_alloc_fix_freelist. Taking inspiration from some of the fixes that Brian made here, it looks like there's a possibility of the freelist refill code grabbing blocks that were assumed to be available by previous checks in that function. For example, using some values from a successful trace of a directio allocation: dd-102227 [027] .... 4969662.381037: xfs_alloc_near_first: dev 25 3:1 agno 0 agbno 5924 minlen 4 maxlen 4 mod 0 prod 1 minleft 1 total 8 alignment 4 minalignslop 0 len 4 type NEAR_BNO otype START_BNO wasdel 0 wasfromfl 0 resv 0 datatype 0x9 firstblock 0xffffffffffffffff dd-102227 [027] .... 4969662.381047: xfs_alloc_near_first: dev 25 3:1 agno 0 agbno 5921 minlen 1 maxlen 1 mod 0 prod 1 minleft 0 total 0 alignment 1 minalignslop 0 len 1 type NEAR_BNO otype NEAR_BNO wasdel 0 wasfromfl 0 resv 0 datatype 0x0 firstblock 0x1724 [first is the bmap alloc, second is the extents_to_btree alloc] if agflcount = min(pagf_flcount, min_free) agflcount = min(3, 8) and available = pagf_freeblks + agflcount - reservation - min_free - minleft available = 14 + 3 - 0 - 8 - 1 available = 8 which satisfies the total from the first allocation request; however, if this code path needs to refill the freelists and the ag btree is full because a lot of space is allocated and not much is free, then inserts here may trigger rebalances. Usage might look something like this: pagf_freeblks = 14 allocate 5 blocks to fill freelist pags_freeblks = 9 fill of freelist triggers split that requires 4 nodes next iteration allocates 4 blocks to refill freelist pages_freeblks = 5 refill requires rebalance and another node next iteration allocates 1 block to refill freelist pages_freeblks = 4 freelist filled; return to caller caller consumes remaining 4 blocks for bmap allocation pages_freeblks = 0 no blocks available for xfs_bmap_extents_to_btree I'm not sure if this is possible, but I thought I'd mention it since Brian's prior work here got me thinking about it. If this does sound plausible, what do you think about re-validating the space_available conditions after refilling the freelist? Something like: diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 353e53b..d235744 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2730,6 +2730,16 @@ xfs_alloc_fix_freelist( } } xfs_trans_brelse(tp, agflbp); + + /* + * Freelist refill may have consumed blocks from pagf_freeblks. Ensure + * that this allocation still meets its requested constraints by + * revalidating the min_freelist and space_available checks. + */ + need = xfs_alloc_min_freelist(mp, pag); + if (!xfs_alloc_space_available(args, need, flags)) + goto out_agbp_relse; + args->agbp = agbp; return 0; perhaps? Thanks again, -K
On Thu, Jan 06, 2022 at 12:52:28AM -0800, Krister Johansen wrote: > On Thu, Jan 06, 2022 at 12:01:23PM +1100, Dave Chinner wrote: > > On Tue, Jan 04, 2022 at 11:10:52PM -0800, Krister Johansen wrote: > > > However, linux is using 64-bit block > > > pointers in the inode now and the XFS_ALLOCTYPE_START_BNO case in > > > xfs_alloc_vextent() seems to try to ensure that it never considers an AG > > > that's less than the agno for the fsbno passed in via args. > > > > Because otherwise allocations ABBA deadlock on AG locking. > > Sorry, what I'm really trying to ask is: are there still cases in XFS > where using XFS_ALLOCTYPE_START_BNO can give you this kind of deadlock? I'd say yes - I can think of several scenarios where we make multiple allocations per transaction (directory and attribute code) and I think that they don't actually use t_firstblock correctly to avoid AGF locking issues. i.e. I think that tp->t_firstblock should actually be tracking the highest locked AGF in the transaction, not the first AGF we locked... > There's some deadlock avoidance in the START_BNO implementation, but > there are plenty of places where XFS_ALLOCTYPE_NEAR_BNO is still used if > t_firstblock is not NULLFSBLOCK. Right, NEAR_BNO is used because we likely have physical locality constraints (e.g. seek minimisation by placing related blocks as close to each other as possible) and potentially other limits, like the allocations are for per-AG data and therefore must be placed within the specified AG... > I'm trying to work out if that code > still exists because of concerns about deadlocks, or if its an attempt > to limit the number of AGs searched instead. (Or some other policy > choice, even.) All of the above, and more... :( > > > be a reasonable way to address the WARN? Or does this open a box of > > > problems that obvious to the experienced, but just subtle enough to > > > elude the unfamiliar? > > > > No, yes, and yes. > > Thank you humoring my questions nonetheless. > > > > The xfs_db freesp report after the problem occurred. (N.B. it was a few > > > hours before I was able to get to this machine to investigate) > > > > > > xfs_db -r -c 'freesp -a 47 -s' /dev/mapper/db-vol > > > from to extents blocks pct > > > 1 1 48 48 0.00 > > > 2 3 119 303 0.02 > > > 4 7 46 250 0.01 > > > 8 15 22 255 0.01 > > > 16 31 17 374 0.02 > > > 32 63 16 728 0.04 > > > 64 127 9 997 0.05 > > > 128 255 149 34271 1.83 > > > 256 511 7 2241 0.12 > > > 512 1023 4 2284 0.12 > > > 1024 2047 1 1280 0.07 > > > 2048 4095 1 3452 0.18 > > > 1048576 2097151 1 1825182 97.52 > > > total free extents 440 > > > total free blocks 1871665 > > > average free extent size 4253.78 > > > > So 1,871,665 of 228,849,020 blocks free in the AG. That's 99.2% > > full, so it's extremely likely you are hitting a full AG condition. > > > > /me goes and looks at xfs_iomap_write_direct().... > > > > .... and notices that it passes "0" as the total allocation block > > count, which means it isn't reserving space in the AG for both the > > data extent and the BMBT blocks... > > > > ... and several other xfs_bmapi_write() callers have the same > > issue... > > > > Ok, let me spend a bit more looking into this in more depth, but it > > looks like the problem is at the xfs_bmapi_write() caller level, not > > deep in the allocator itself. > > At least on 5.4 xfs_bmapi_write is still passing resblks instead of > zero, which is computed in xfs_iomap_write_direct. yup, I missed commit da781e64b28c ("xfs: don't set bmapi total block req where minleft is") back in 2019 where that behaviour was changed, and instead it changes xfs_bmapi_write() to implcitly manage space for BMBT blocks via args->minleft whilst still explicitly requiring the caller to reserve those blocks at transaction allocation time. Bit of a mess, really, because multi-allocation transactions are still required to pass both the data blocks + the possible BMBT blocks that might be needed to xfs_bmapi_write(). I suspect that for this case the implicit args->minleft reservation is double accounted... > Related to your comment about alloc_args.total, I did a bit of tracing > of the xfs_alloc tracepoints on my system and found that total seems to > be getting set in both cases, but that a) it's actually a larger value > for directio; and b) in the buffered write case the code is requesting > more blocks at one time which causes a larger allocation to occur. I'm > not certain, but wondered if this could be causing us to select an AG > with more space by luck. > > directio: > > dd-102229 [005] .... 4969662.383215: xfs_alloc_exact_done: dev 253:1 agno 0 agbno 14240 minlen 4 maxlen 4 mod 0 prod 1 minleft 2 total 8 alignment 1 minalignslop 3 len 4 type THIS_BNO otype THIS_BNO wasdel 0 wasfromfl 0 resv 0 datatype 0x9 firstblock 0xffffffffffffffff That one is correct - 4 extra blocks on top of the data extent... > buffered write + fsync: > > dd-109921 [010] .... 4972814.844429: xfs_alloc_exact_done: dev 253:1 agno 0 agbno 21280 minlen 16 maxlen 16 mod 0 prod 1 minleft 2 total 4 alignment 1 minalignslop 3 len 16 type THIS_BNO otype THIS_BNO wasdel 1 wasfromfl 0 resv 0 datatype 0x9 firstblock 0xffffffffffffffff But that one is clearly wrong - we're asking for 16 blocks for the data extent, and a total block count of the allocation of 4 blocks. Even though we've reserved 16 blocks during the initial delayed allocation, we've still got to have 16 + 4 blocks free in the AG for the allocation to succced. That's one of the bugs the commit I mentioned above fixed... Cheers, Dave.
On Fri, Jan 07, 2022 at 09:40:14PM -0800, Krister Johansen wrote: > On Thu, Jan 06, 2022 at 12:01:23PM +1100, Dave Chinner wrote: > > On Tue, Jan 04, 2022 at 11:10:52PM -0800, Krister Johansen wrote: > I wondered if perhaps the problem was related to other problems in > xfs_alloc_fix_freelist. Taking inspiration from some of the fixes that > Brian made here, it looks like there's a possibility of the freelist > refill code grabbing blocks that were assumed to be available by > previous checks in that function. > > For example, using some values from a successful trace of a directio > allocation: > > dd-102227 [027] .... 4969662.381037: xfs_alloc_near_first: dev 25 > 3:1 agno 0 agbno 5924 minlen 4 maxlen 4 mod 0 prod 1 minleft 1 total 8 alignment > 4 minalignslop 0 len 4 type NEAR_BNO otype START_BNO wasdel 0 wasfromfl 0 resv > 0 datatype 0x9 firstblock 0xffffffffffffffff > > dd-102227 [027] .... 4969662.381047: xfs_alloc_near_first: dev 25 > 3:1 agno 0 agbno 5921 minlen 1 maxlen 1 mod 0 prod 1 minleft 0 total 0 alignment > 1 minalignslop 0 len 1 type NEAR_BNO otype NEAR_BNO wasdel 0 wasfromfl 0 resv 0 > datatype 0x0 firstblock 0x1724 > > [first is the bmap alloc, second is the extents_to_btree alloc] > > if agflcount = min(pagf_flcount, min_free) > agflcount = min(3, 8) > > and available = pagf_freeblks + agflcount - reservation - min_free - minleft > available = 14 + 3 - 0 - 8 - 1 > > available = 8 > > which satisfies the total from the first allocation request; however, if > this code path needs to refill the freelists and the ag btree is full > because a lot of space is allocated and not much is free, then inserts > here may trigger rebalances. Usage might look something like this: > > pagf_freeblks = 14 > allocate 5 blocks to fill freelist > pags_freeblks = 9 > fill of freelist triggers split that requires 4 nodes > next iteration allocates 4 blocks to refill freelist > pages_freeblks = 5 > refill requires rebalance and another node > next iteration allocates 1 block to refill freelist > pages_freeblks = 4 > freelist filled; return to caller > > caller consumes remaining 4 blocks for bmap allocation > > pages_freeblks = 0 > > no blocks available for xfs_bmap_extents_to_btree No, can't happen. When the AG is nearly empty, there is only one block in the AG freespace trees - the root block. Those root blocks can hold ~500 free space extents. Hence if there are only 14 free blocks left in the AG, then by definition we have a single level tree and a split cannot ever occur. Remember, the AGFL is for space management btree splits and merges, not for BMBT splits/merges. BMBT blocks are user metadata and have to be reserved up front from the AG, they are not allocated from/accounted via the free lists that the AGF freespace and rmap btrees use for their block management. > I'm not sure if this is possible, but I thought I'd mention it since > Brian's prior work here got me thinking about it. If this does sound > plausible, what do you think about re-validating the space_available > conditions after refilling the freelist? Something like: > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > index 353e53b..d235744 100644 > --- a/fs/xfs/libxfs/xfs_alloc.c > +++ b/fs/xfs/libxfs/xfs_alloc.c > @@ -2730,6 +2730,16 @@ xfs_alloc_fix_freelist( > } > } > xfs_trans_brelse(tp, agflbp); > + > + /* > + * Freelist refill may have consumed blocks from pagf_freeblks. Ensure > + * that this allocation still meets its requested constraints by > + * revalidating the min_freelist and space_available checks. > + */ > + need = xfs_alloc_min_freelist(mp, pag); > + if (!xfs_alloc_space_available(args, need, flags)) > + goto out_agbp_relse; No, that will just lead to a filesystem shutdown because at ENOSPC we'll have a transaction containing a dirty AGF/AGFL and freespace btrees. At that point, the transaction cancellation will see that it can't cleanly back out and at that point it's all over... This is also an AGF ABBA deadlock vector, because now we have a dirty AGF locked in the transaction that we haven't tracked via t_firstblock.... Cheers, Dave.
On Mon, Jan 10, 2022 at 07:41:52PM +1100, Dave Chinner wrote: > On Thu, Jan 06, 2022 at 12:52:28AM -0800, Krister Johansen wrote: > > On Thu, Jan 06, 2022 at 12:01:23PM +1100, Dave Chinner wrote: > > > On Tue, Jan 04, 2022 at 11:10:52PM -0800, Krister Johansen wrote: > > > So 1,871,665 of 228,849,020 blocks free in the AG. That's 99.2% > > > full, so it's extremely likely you are hitting a full AG condition. > > > > > > /me goes and looks at xfs_iomap_write_direct().... > > > > > > .... and notices that it passes "0" as the total allocation block > > > count, which means it isn't reserving space in the AG for both the > > > data extent and the BMBT blocks... > > > > > > ... and several other xfs_bmapi_write() callers have the same > > > issue... > > > > > > Ok, let me spend a bit more looking into this in more depth, but it > > > looks like the problem is at the xfs_bmapi_write() caller level, not > > > deep in the allocator itself. > > > > At least on 5.4 xfs_bmapi_write is still passing resblks instead of > > zero, which is computed in xfs_iomap_write_direct. > > yup, I missed commit da781e64b28c ("xfs: don't set bmapi total block > req where minleft is") back in 2019 where that behaviour was > changed, and instead it changes xfs_bmapi_write() to implcitly > manage space for BMBT blocks via args->minleft whilst still > explicitly requiring the caller to reserve those blocks at > transaction allocation time. FWIW, I should have said here that you should really add that commit to your current tree and see if that fixes the problem... Cheers, Dave.
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 4dccd4d90622..5d949ac1ecae 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -664,6 +664,13 @@ xfs_bmap_extents_to_btree( if (error) goto out_root_realloc; + if (args.fsbno == NULLFSBLOCK && args.type == XFS_ALLOCTYPE_NEAR_BNO) { + args.type = XFS_ALLOCTYPE_START_BNO; + error = xfs_alloc_vextent(&args); + if (error) + goto out_root_realloc; + } + if (WARN_ON_ONCE(args.fsbno == NULLFSBLOCK)) { error = -ENOSPC; goto out_root_realloc; Or this: diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 4dccd4d90622..94e4ecb75561 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -647,14 +647,10 @@ xfs_bmap_extents_to_btree( args.tp = tp; args.mp = mp; xfs_rmap_ino_bmbt_owner(&args.oinfo, ip->i_ino, whichfork); + args.type = XFS_ALLOCTYPE_START_BNO; if (tp->t_firstblock == NULLFSBLOCK) { - args.type = XFS_ALLOCTYPE_START_BNO; args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino); - } else if (tp->t_flags & XFS_TRANS_LOWMODE) { - args.type = XFS_ALLOCTYPE_START_BNO; - args.fsbno = tp->t_firstblock; } else { - args.type = XFS_ALLOCTYPE_NEAR_BNO; args.fsbno = tp->t_firstblock; } args.minlen = args.maxlen = args.prod = 1;