diff mbox series

xfs: make sure sb_fdblocks is non-negative

Message ID 20240511003426.13858-1-wen.gang.wang@oracle.com (mailing list archive)
State New
Headers show
Series xfs: make sure sb_fdblocks is non-negative | expand

Commit Message

Wengang Wang May 11, 2024, 12:34 a.m. UTC
when writting super block to disk (in xfs_log_sb), sb_fdblocks is fetched from
m_fdblocks without any lock. As m_fdblocks can experience a positive -> negativ
 -> positive changing when the FS reaches fullness (see xfs_mod_fdblocks)
So there is a chance that sb_fdblocks is negative, and because sb_fdblocks is
type of unsigned long long, it reads super big. And sb_fdblocks being bigger
than sb_dblocks is a problem during log recovery, xfs_validate_sb_write()
complains.

Fix:
As sb_fdblocks will be re-calculated during mount when lazysbcount is enabled,
We just need to make xfs_validate_sb_write() happy -- make sure sb_fdblocks is
not genative.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 fs/xfs/libxfs/xfs_sb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Dave Chinner May 11, 2024, 1:17 a.m. UTC | #1
On Fri, May 10, 2024 at 05:34:26PM -0700, Wengang Wang wrote:
> when writting super block to disk (in xfs_log_sb), sb_fdblocks is fetched from
> m_fdblocks without any lock. As m_fdblocks can experience a positive -> negativ
>  -> positive changing when the FS reaches fullness (see xfs_mod_fdblocks)
> So there is a chance that sb_fdblocks is negative, and because sb_fdblocks is
> type of unsigned long long, it reads super big. And sb_fdblocks being bigger
> than sb_dblocks is a problem during log recovery, xfs_validate_sb_write()
> complains.
> 
> Fix:
> As sb_fdblocks will be re-calculated during mount when lazysbcount is enabled,
> We just need to make xfs_validate_sb_write() happy -- make sure sb_fdblocks is
> not genative.

Ok, I have no problems with the change being made, but I'm unclear
on why we care if these values get logged as large positive numbers?

The comment above this code explains that these counts are known to
be inaccurate and so are not trusted. i.e. they will be corrected
post-log recovery if they are recovered from the log:

	 * Lazy sb counters don't update the in-core superblock so do that now.
         * If this is at unmount, the counters will be exactly correct, but at
         * any other time they will only be ballpark correct because of
         * reservations that have been taken out percpu counters. If we have an
         * unclean shutdown, this will be corrected by log recovery rebuilding
         * the counters from the AGF block counts.

IOWs journal recovery doesn't actually care what these values are,
so what actually goes wrong if this sum returns a negative value?

Cheers,

Dave.
Wengang Wang May 13, 2024, 5:06 p.m. UTC | #2
Hi Dave,
Please see inlines,

> On May 10, 2024, at 6:17 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Fri, May 10, 2024 at 05:34:26PM -0700, Wengang Wang wrote:
>> when writting super block to disk (in xfs_log_sb), sb_fdblocks is fetched from
>> m_fdblocks without any lock. As m_fdblocks can experience a positive -> negativ
>> -> positive changing when the FS reaches fullness (see xfs_mod_fdblocks)
>> So there is a chance that sb_fdblocks is negative, and because sb_fdblocks is
>> type of unsigned long long, it reads super big. And sb_fdblocks being bigger
>> than sb_dblocks is a problem during log recovery, xfs_validate_sb_write()
>> complains.
>> 
>> Fix:
>> As sb_fdblocks will be re-calculated during mount when lazysbcount is enabled,
>> We just need to make xfs_validate_sb_write() happy -- make sure sb_fdblocks is
>> not genative.
> 
> Ok, I have no problems with the change being made, but I'm unclear
> on why we care if these values get logged as large positive numbers?
> 
> The comment above this code explains that these counts are known to
> be inaccurate and so are not trusted. i.e. they will be corrected
> post-log recovery if they are recovered from the log:
> 
> * Lazy sb counters don't update the in-core superblock so do that now.
>         * If this is at unmount, the counters will be exactly correct, but at
>         * any other time they will only be ballpark correct because of
>         * reservations that have been taken out percpu counters. If we have an
>         * unclean shutdown, this will be corrected by log recovery rebuilding
>         * the counters from the AGF block counts.
> 

Things is that we have a metadump, looking at the fdblocks from super block 0, it is good.

$ xfs_db -c "sb 0" -c "p" cust.img |egrep "dblocks|ifree|icount"
dblocks = 26214400
icount = 512
ifree = 337
fdblocks = 25997100

And when looking at the log, we have the following:

$ egrep -a "fdblocks|icount|ifree" cust.log |tail
sb_fdblocks 37
sb_icount 1056
sb_ifree 87
sb_fdblocks 37
sb_icount 1056
sb_ifree 87
sb_fdblocks 37
sb_icount 1056
sb_ifree 87
sb_fdblocks 18446744073709551604

# cust.log is output of my script which tries to parse the log buffer.

18446744073709551604ULL == 0xfffffffffffffff4 or -12LL 

With upstream kernel (6.7.0-rc3), when I tried to mount (log recover) the metadump,
I got the following in dmesg:

[   52.927796] XFS (loop0): SB summary counter sanity check failed
[   52.928889] XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x60/0x110 [xfs], xfs_sb block 0x0
[   52.930890] XFS (loop0): Unmount and run xfs_repair
[   52.931797] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[   52.932954] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 01 90 00 00  XFSB............
[   52.934333] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   52.935733] 00000020: c9 c1 ed ae 84 ed 46 b9 a1 f0 09 57 4a a9 98 42  ......F....WJ..B
[   52.937120] 00000030: 00 00 00 00 01 00 00 06 00 00 00 00 00 00 00 80  ................
[   52.938515] 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
[   52.939919] 00000050: 00 00 00 01 00 64 00 00 00 00 00 04 00 00 00 00  .....d..........
[   52.941293] 00000060: 00 00 64 00 b4 a5 02 00 02 00 00 08 00 00 00 00  ..d.............
[   52.942661] 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 17 00 00 19  ................
[   52.944046] XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply+0x38b/0x3a0 [xfs] (fs/xfs/xfs_buf.c:1559).  Shutting down filesystem.
[   52.946710] XFS (loop0): Please unmount the filesystem and rectify the problem(s)
[   52.948099] XFS (loop0): log mount/recovery failed: error -117
[   52.949810] XFS (loop0): log mount failed

Looking at corresponding code:
 231 xfs_validate_sb_write(
 232         struct xfs_mount        *mp,
 233         struct xfs_buf          *bp,
 234         struct xfs_sb           *sbp)
 235 {
 236         /*
 237          * Carry out additional sb summary counter sanity checks when we write
 238          * the superblock.  We skip this in the read validator because there
 239          * could be newer superblocks in the log and if the values are garbage
 240          * even after replay we'll recalculate them at the end of log mount.
 241          *
 242          * mkfs has traditionally written zeroed counters to inprogress and
 243          * secondary superblocks, so allow this usage to continue because
 244          * we never read counters from such superblocks.
 245          */
 246         if (xfs_buf_daddr(bp) == XFS_SB_DADDR && !sbp->sb_inprogress &&
 247             (sbp->sb_fdblocks > sbp->sb_dblocks ||
 248              !xfs_verify_icount(mp, sbp->sb_icount) ||
 249              sbp->sb_ifree > sbp->sb_icount)) {
 250                 xfs_warn(mp, "SB summary counter sanity check failed");
 251                 return -EFSCORRUPTED;
 252         }

From dmesg and code, we know the check failure was due to bad sb_ifree vs sb_icount or bad sb_fdblocks vs sb_dblocks.

Looking at the super block dump and log dump,
We know ifree and icount are good, what’s bad is sb_fdblocks. And that sb_fdblocks is from log.
# I verified that sb_fdblocks is 0xfffffffffffffff4 with a UEK debug kernel (though not 6.7.0-rc3)

So the sb_fdblocks is updated from log to incore at xfs_log_sb() -> xfs_validate_sb_write() path though
Should be may re-calculated from AGs.

The fix aims to make xfs_validate_sb_write() happy.

Thanks,
Wengang

> IOWs journal recovery doesn't actually care what these values are,
> so what actually goes wrong if this sum returns a negative value?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 73a4b895de67..199756970383 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -1037,7 +1037,7 @@  xfs_log_sb(
 		mp->m_sb.sb_ifree = min_t(uint64_t,
 				percpu_counter_sum(&mp->m_ifree),
 				mp->m_sb.sb_icount);
-		mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks);
+		mp->m_sb.sb_fdblocks = percpu_counter_sum_positive(&mp->m_fdblocks);
 	}
 
 	xfs_sb_to_disk(bp->b_addr, &mp->m_sb);