xfs: make sure sb_fdblocks is non-negative

Message ID	20240511003426.13858-1-wen.gang.wang@oracle.com (mailing list archive)
State	Superseded
Headers	show Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5D3A07F6 for <linux-xfs@vger.kernel.org>; Sat, 11 May 2024 00:34:30 +0000 (UTC) From: Wengang Wang <wen.gang.wang@oracle.com> To: linux-xfs@vger.kernel.org Cc: wen.gang.wang@oracle.com Subject: [PATCH] xfs: make sure sb_fdblocks is non-negative Date: Fri, 10 May 2024 17:34:26 -0700 Message-Id: <20240511003426.13858-1-wen.gang.wang@oracle.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	xfs: make sure sb_fdblocks is non-negative \| expand xfs: make sure sb_fdblocks is non-negative

Wengang Wang May 11, 2024, 12:34 a.m. UTC

when writting super block to disk (in xfs_log_sb), sb_fdblocks is fetched from
m_fdblocks without any lock. As m_fdblocks can experience a positive -> negativ
 -> positive changing when the FS reaches fullness (see xfs_mod_fdblocks)
So there is a chance that sb_fdblocks is negative, and because sb_fdblocks is
type of unsigned long long, it reads super big. And sb_fdblocks being bigger
than sb_dblocks is a problem during log recovery, xfs_validate_sb_write()
complains.

Fix:
As sb_fdblocks will be re-calculated during mount when lazysbcount is enabled,
We just need to make xfs_validate_sb_write() happy -- make sure sb_fdblocks is
not genative.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 fs/xfs/libxfs/xfs_sb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Dave Chinner May 11, 2024, 1:17 a.m. UTC | #1

On Fri, May 10, 2024 at 05:34:26PM -0700, Wengang Wang wrote:
> when writting super block to disk (in xfs_log_sb), sb_fdblocks is fetched from
> m_fdblocks without any lock. As m_fdblocks can experience a positive -> negativ
>  -> positive changing when the FS reaches fullness (see xfs_mod_fdblocks)
> So there is a chance that sb_fdblocks is negative, and because sb_fdblocks is
> type of unsigned long long, it reads super big. And sb_fdblocks being bigger
> than sb_dblocks is a problem during log recovery, xfs_validate_sb_write()
> complains.
> 
> Fix:
> As sb_fdblocks will be re-calculated during mount when lazysbcount is enabled,
> We just need to make xfs_validate_sb_write() happy -- make sure sb_fdblocks is
> not genative.

Ok, I have no problems with the change being made, but I'm unclear
on why we care if these values get logged as large positive numbers?

The comment above this code explains that these counts are known to
be inaccurate and so are not trusted. i.e. they will be corrected
post-log recovery if they are recovered from the log:

	 * Lazy sb counters don't update the in-core superblock so do that now.
         * If this is at unmount, the counters will be exactly correct, but at
         * any other time they will only be ballpark correct because of
         * reservations that have been taken out percpu counters. If we have an
         * unclean shutdown, this will be corrected by log recovery rebuilding
         * the counters from the AGF block counts.

IOWs journal recovery doesn't actually care what these values are,
so what actually goes wrong if this sum returns a negative value?

Cheers,

Dave.

Wengang Wang May 13, 2024, 5:06 p.m. UTC | #2

Hi Dave,
Please see inlines,

> On May 10, 2024, at 6:17 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Fri, May 10, 2024 at 05:34:26PM -0700, Wengang Wang wrote:
>> when writting super block to disk (in xfs_log_sb), sb_fdblocks is fetched from
>> m_fdblocks without any lock. As m_fdblocks can experience a positive -> negativ
>> -> positive changing when the FS reaches fullness (see xfs_mod_fdblocks)
>> So there is a chance that sb_fdblocks is negative, and because sb_fdblocks is
>> type of unsigned long long, it reads super big. And sb_fdblocks being bigger
>> than sb_dblocks is a problem during log recovery, xfs_validate_sb_write()
>> complains.
>> 
>> Fix:
>> As sb_fdblocks will be re-calculated during mount when lazysbcount is enabled,
>> We just need to make xfs_validate_sb_write() happy -- make sure sb_fdblocks is
>> not genative.
> 
> Ok, I have no problems with the change being made, but I'm unclear
> on why we care if these values get logged as large positive numbers?
> 
> The comment above this code explains that these counts are known to
> be inaccurate and so are not trusted. i.e. they will be corrected
> post-log recovery if they are recovered from the log:
> 
> * Lazy sb counters don't update the in-core superblock so do that now.
>         * If this is at unmount, the counters will be exactly correct, but at
>         * any other time they will only be ballpark correct because of
>         * reservations that have been taken out percpu counters. If we have an
>         * unclean shutdown, this will be corrected by log recovery rebuilding
>         * the counters from the AGF block counts.
> 

Things is that we have a metadump, looking at the fdblocks from super block 0, it is good.

$ xfs_db -c "sb 0" -c "p" cust.img |egrep "dblocks|ifree|icount"
dblocks = 26214400
icount = 512
ifree = 337
fdblocks = 25997100

And when looking at the log, we have the following:

$ egrep -a "fdblocks|icount|ifree" cust.log |tail
sb_fdblocks 37
sb_icount 1056
sb_ifree 87
sb_fdblocks 37
sb_icount 1056
sb_ifree 87
sb_fdblocks 37
sb_icount 1056
sb_ifree 87
sb_fdblocks 18446744073709551604

# cust.log is output of my script which tries to parse the log buffer.

18446744073709551604ULL == 0xfffffffffffffff4 or -12LL 

With upstream kernel (6.7.0-rc3), when I tried to mount (log recover) the metadump,
I got the following in dmesg:

[   52.927796] XFS (loop0): SB summary counter sanity check failed
[   52.928889] XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x60/0x110 [xfs], xfs_sb block 0x0
[   52.930890] XFS (loop0): Unmount and run xfs_repair
[   52.931797] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[   52.932954] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 01 90 00 00  XFSB............
[   52.934333] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   52.935733] 00000020: c9 c1 ed ae 84 ed 46 b9 a1 f0 09 57 4a a9 98 42  ......F....WJ..B
[   52.937120] 00000030: 00 00 00 00 01 00 00 06 00 00 00 00 00 00 00 80  ................
[   52.938515] 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
[   52.939919] 00000050: 00 00 00 01 00 64 00 00 00 00 00 04 00 00 00 00  .....d..........
[   52.941293] 00000060: 00 00 64 00 b4 a5 02 00 02 00 00 08 00 00 00 00  ..d.............
[   52.942661] 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 17 00 00 19  ................
[   52.944046] XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply+0x38b/0x3a0 [xfs] (fs/xfs/xfs_buf.c:1559).  Shutting down filesystem.
[   52.946710] XFS (loop0): Please unmount the filesystem and rectify the problem(s)
[   52.948099] XFS (loop0): log mount/recovery failed: error -117
[   52.949810] XFS (loop0): log mount failed

Looking at corresponding code:
 231 xfs_validate_sb_write(
 232         struct xfs_mount        *mp,
 233         struct xfs_buf          *bp,
 234         struct xfs_sb           *sbp)
 235 {
 236         /*
 237          * Carry out additional sb summary counter sanity checks when we write
 238          * the superblock.  We skip this in the read validator because there
 239          * could be newer superblocks in the log and if the values are garbage
 240          * even after replay we'll recalculate them at the end of log mount.
 241          *
 242          * mkfs has traditionally written zeroed counters to inprogress and
 243          * secondary superblocks, so allow this usage to continue because
 244          * we never read counters from such superblocks.
 245          */
 246         if (xfs_buf_daddr(bp) == XFS_SB_DADDR && !sbp->sb_inprogress &&
 247             (sbp->sb_fdblocks > sbp->sb_dblocks ||
 248              !xfs_verify_icount(mp, sbp->sb_icount) ||
 249              sbp->sb_ifree > sbp->sb_icount)) {
 250                 xfs_warn(mp, "SB summary counter sanity check failed");
 251                 return -EFSCORRUPTED;
 252         }

From dmesg and code, we know the check failure was due to bad sb_ifree vs sb_icount or bad sb_fdblocks vs sb_dblocks.

Looking at the super block dump and log dump,
We know ifree and icount are good, what’s bad is sb_fdblocks. And that sb_fdblocks is from log.
# I verified that sb_fdblocks is 0xfffffffffffffff4 with a UEK debug kernel (though not 6.7.0-rc3)

So the sb_fdblocks is updated from log to incore at xfs_log_sb() -> xfs_validate_sb_write() path though
Should be may re-calculated from AGs.

The fix aims to make xfs_validate_sb_write() happy.

Thanks,
Wengang

> IOWs journal recovery doesn't actually care what these values are,
> so what actually goes wrong if this sum returns a negative value?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Wengang Wang May 31, 2024, 3:44 p.m. UTC | #3

Hi Dave,

Do you have further comments and/or suggestions? Or give a RB pls :D

Thanks,
Wengang

> On May 13, 2024, at 10:06 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> Hi Dave,
> Please see inlines,
> 
>> On May 10, 2024, at 6:17 PM, Dave Chinner <david@fromorbit.com> wrote:
>> 
>> On Fri, May 10, 2024 at 05:34:26PM -0700, Wengang Wang wrote:
>>> when writting super block to disk (in xfs_log_sb), sb_fdblocks is fetched from
>>> m_fdblocks without any lock. As m_fdblocks can experience a positive -> negativ
>>> -> positive changing when the FS reaches fullness (see xfs_mod_fdblocks)
>>> So there is a chance that sb_fdblocks is negative, and because sb_fdblocks is
>>> type of unsigned long long, it reads super big. And sb_fdblocks being bigger
>>> than sb_dblocks is a problem during log recovery, xfs_validate_sb_write()
>>> complains.
>>> 
>>> Fix:
>>> As sb_fdblocks will be re-calculated during mount when lazysbcount is enabled,
>>> We just need to make xfs_validate_sb_write() happy -- make sure sb_fdblocks is
>>> not genative.
>> 
>> Ok, I have no problems with the change being made, but I'm unclear
>> on why we care if these values get logged as large positive numbers?
>> 
>> The comment above this code explains that these counts are known to
>> be inaccurate and so are not trusted. i.e. they will be corrected
>> post-log recovery if they are recovered from the log:
>> 
>> * Lazy sb counters don't update the in-core superblock so do that now.
>>        * If this is at unmount, the counters will be exactly correct, but at
>>        * any other time they will only be ballpark correct because of
>>        * reservations that have been taken out percpu counters. If we have an
>>        * unclean shutdown, this will be corrected by log recovery rebuilding
>>        * the counters from the AGF block counts.
>> 
> 
> Things is that we have a metadump, looking at the fdblocks from super block 0, it is good.
> 
> $ xfs_db -c "sb 0" -c "p" cust.img |egrep "dblocks|ifree|icount"
> dblocks = 26214400
> icount = 512
> ifree = 337
> fdblocks = 25997100
> 
> And when looking at the log, we have the following:
> 
> $ egrep -a "fdblocks|icount|ifree" cust.log |tail
> sb_fdblocks 37
> sb_icount 1056
> sb_ifree 87
> sb_fdblocks 37
> sb_icount 1056
> sb_ifree 87
> sb_fdblocks 37
> sb_icount 1056
> sb_ifree 87
> sb_fdblocks 18446744073709551604
> 
> # cust.log is output of my script which tries to parse the log buffer.
> 
> 18446744073709551604ULL == 0xfffffffffffffff4 or -12LL 
> 
> With upstream kernel (6.7.0-rc3), when I tried to mount (log recover) the metadump,
> I got the following in dmesg:
> 
> [   52.927796] XFS (loop0): SB summary counter sanity check failed
> [   52.928889] XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x60/0x110 [xfs], xfs_sb block 0x0
> [   52.930890] XFS (loop0): Unmount and run xfs_repair
> [   52.931797] XFS (loop0): First 128 bytes of corrupted metadata buffer:
> [   52.932954] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 01 90 00 00  XFSB............
> [   52.934333] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> [   52.935733] 00000020: c9 c1 ed ae 84 ed 46 b9 a1 f0 09 57 4a a9 98 42  ......F....WJ..B
> [   52.937120] 00000030: 00 00 00 00 01 00 00 06 00 00 00 00 00 00 00 80  ................
> [   52.938515] 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> [   52.939919] 00000050: 00 00 00 01 00 64 00 00 00 00 00 04 00 00 00 00  .....d..........
> [   52.941293] 00000060: 00 00 64 00 b4 a5 02 00 02 00 00 08 00 00 00 00  ..d.............
> [   52.942661] 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 17 00 00 19  ................
> [   52.944046] XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply+0x38b/0x3a0 [xfs] (fs/xfs/xfs_buf.c:1559).  Shutting down filesystem.
> [   52.946710] XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> [   52.948099] XFS (loop0): log mount/recovery failed: error -117
> [   52.949810] XFS (loop0): log mount failed
> 
> Looking at corresponding code:
> 231 xfs_validate_sb_write(
> 232         struct xfs_mount        *mp,
> 233         struct xfs_buf          *bp,
> 234         struct xfs_sb           *sbp)
> 235 {
> 236         /*
> 237          * Carry out additional sb summary counter sanity checks when we write
> 238          * the superblock.  We skip this in the read validator because there
> 239          * could be newer superblocks in the log and if the values are garbage
> 240          * even after replay we'll recalculate them at the end of log mount.
> 241          *
> 242          * mkfs has traditionally written zeroed counters to inprogress and
> 243          * secondary superblocks, so allow this usage to continue because
> 244          * we never read counters from such superblocks.
> 245          */
> 246         if (xfs_buf_daddr(bp) == XFS_SB_DADDR && !sbp->sb_inprogress &&
> 247             (sbp->sb_fdblocks > sbp->sb_dblocks ||
> 248              !xfs_verify_icount(mp, sbp->sb_icount) ||
> 249              sbp->sb_ifree > sbp->sb_icount)) {
> 250                 xfs_warn(mp, "SB summary counter sanity check failed");
> 251                 return -EFSCORRUPTED;
> 252         }
> 
> From dmesg and code, we know the check failure was due to bad sb_ifree vs sb_icount or bad sb_fdblocks vs sb_dblocks.
> 
> Looking at the super block dump and log dump,
> We know ifree and icount are good, what’s bad is sb_fdblocks. And that sb_fdblocks is from log.
> # I verified that sb_fdblocks is 0xfffffffffffffff4 with a UEK debug kernel (though not 6.7.0-rc3)
> 
> So the sb_fdblocks is updated from log to incore at xfs_log_sb() -> xfs_validate_sb_write() path though
> Should be may re-calculated from AGs.
> 
> The fix aims to make xfs_validate_sb_write() happy.
> 
> Thanks,
> Wengang
> 
>> IOWs journal recovery doesn't actually care what these values are,
>> so what actually goes wrong if this sum returns a negative value?
>> 
>> Cheers,
>> 
>> Dave.
>> -- 
>> Dave Chinner
>> david@fromorbit.com

Darrick J. Wong May 31, 2024, 3:52 p.m. UTC | #4

On Fri, May 10, 2024 at 05:34:26PM -0700, Wengang Wang wrote:

You might want to lead off with the origins of this fixpatch:

"A user with a completely full filesystem experienced an unexpected
shutdown when the filesystem tried to write the superblock during
runtime:"

<dmesg dump here>

"When xfs_log_sb writes a superblock to disk, sb_fdblocks is fetched..."

(or so I'm guessing from the other replies in this thread?)

((What was it doing?  Adding the ATTR/ATTR2 feature to the filesystem?))

> when writting super block to disk (in xfs_log_sb), sb_fdblocks is fetched from
> m_fdblocks without any lock. As m_fdblocks can experience a positive -> negativ

"negative"

>  -> positive changing when the FS reaches fullness (see xfs_mod_fdblocks)
> So there is a chance that sb_fdblocks is negative, and because sb_fdblocks is
> type of unsigned long long, it reads super big. And sb_fdblocks being bigger
> than sb_dblocks is a problem during log recovery, xfs_validate_sb_write()
> complains.
> 
> Fix:
> As sb_fdblocks will be re-calculated during mount when lazysbcount is enabled,
> We just need to make xfs_validate_sb_write() happy -- make sure sb_fdblocks is
> not genative.

"negative".

This otherwise looks good to me.

--D

> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_sb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index 73a4b895de67..199756970383 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -1037,7 +1037,7 @@ xfs_log_sb(
>  		mp->m_sb.sb_ifree = min_t(uint64_t,
>  				percpu_counter_sum(&mp->m_ifree),
>  				mp->m_sb.sb_icount);
> -		mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks);
> +		mp->m_sb.sb_fdblocks = percpu_counter_sum_positive(&mp->m_fdblocks);
>  	}
>  
>  	xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
> -- 
> 2.39.3 (Apple Git-146)
> 
>

Wengang Wang May 31, 2024, 6:03 p.m. UTC | #5

Yes, I was trying to analyze the problem from the perspective of code review/analysis only.
Since you like the way starting from the symptoms of the problem, I will try.

Thanks,
wengang

> On May 31, 2024, at 8:52 AM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Fri, May 10, 2024 at 05:34:26PM -0700, Wengang Wang wrote:
> 
> You might want to lead off with the origins of this fixpatch:
> 
> "A user with a completely full filesystem experienced an unexpected
> shutdown when the filesystem tried to write the superblock during
> runtime:"
> 
> <dmesg dump here>
> 
> "When xfs_log_sb writes a superblock to disk, sb_fdblocks is fetched..."
> 
> (or so I'm guessing from the other replies in this thread?)
> 
> ((What was it doing?  Adding the ATTR/ATTR2 feature to the filesystem?))
> 
>> when writting super block to disk (in xfs_log_sb), sb_fdblocks is fetched from
>> m_fdblocks without any lock. As m_fdblocks can experience a positive -> negativ
> 
> "negative"
> 
>> -> positive changing when the FS reaches fullness (see xfs_mod_fdblocks)
>> So there is a chance that sb_fdblocks is negative, and because sb_fdblocks is
>> type of unsigned long long, it reads super big. And sb_fdblocks being bigger
>> than sb_dblocks is a problem during log recovery, xfs_validate_sb_write()
>> complains.
>> 
>> Fix:
>> As sb_fdblocks will be re-calculated during mount when lazysbcount is enabled,
>> We just need to make xfs_validate_sb_write() happy -- make sure sb_fdblocks is
>> not genative.
> 
> "negative".
> 
> This otherwise looks good to me.
> 
> --D
> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> fs/xfs/libxfs/xfs_sb.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
>> index 73a4b895de67..199756970383 100644
>> --- a/fs/xfs/libxfs/xfs_sb.c
>> +++ b/fs/xfs/libxfs/xfs_sb.c
>> @@ -1037,7 +1037,7 @@ xfs_log_sb(
>> mp->m_sb.sb_ifree = min_t(uint64_t,
>> percpu_counter_sum(&mp->m_ifree),
>> mp->m_sb.sb_icount);
>> - mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks);
>> + mp->m_sb.sb_fdblocks = percpu_counter_sum_positive(&mp->m_fdblocks);
>> }
>> 
>> xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
>> -- 
>> 2.39.3 (Apple Git-146)
>> 
>>

Dave Chinner June 2, 2024, 11:37 p.m. UTC | #6

On Fri, May 31, 2024 at 03:44:56PM +0000, Wengang Wang wrote:
> Hi Dave,
> 
> Do you have further comments and/or suggestions? Or give a RB pls :D

Sorry, LSFMM intervened and I didn't notice your comment until now.

> > On May 13, 2024, at 10:06 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> > 
> > Things is that we have a metadump, looking at the fdblocks from super block 0, it is good.
> > 
> > $ xfs_db -c "sb 0" -c "p" cust.img |egrep "dblocks|ifree|icount"
> > dblocks = 26214400
> > icount = 512
> > ifree = 337
> > fdblocks = 25997100
> > 
> > And when looking at the log, we have the following:
> > 
> > $ egrep -a "fdblocks|icount|ifree" cust.log |tail
> > sb_fdblocks 37
> > sb_icount 1056
> > sb_ifree 87
> > sb_fdblocks 37
> > sb_icount 1056
> > sb_ifree 87
> > sb_fdblocks 37
> > sb_icount 1056
> > sb_ifree 87
> > sb_fdblocks 18446744073709551604
> > 
> > # cust.log is output of my script which tries to parse the log buffer.
> > 
> > 18446744073709551604ULL == 0xfffffffffffffff4 or -12LL 
> > 
> > With upstream kernel (6.7.0-rc3), when I tried to mount (log recover) the metadump,
> > I got the following in dmesg:
> > 
> > [   52.927796] XFS (loop0): SB summary counter sanity check failed
> > [   52.928889] XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x60/0x110 [xfs], xfs_sb block 0x0
> > [   52.930890] XFS (loop0): Unmount and run xfs_repair
> > [   52.931797] XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > [   52.932954] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 01 90 00 00  XFSB............
> > [   52.934333] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > [   52.935733] 00000020: c9 c1 ed ae 84 ed 46 b9 a1 f0 09 57 4a a9 98 42  ......F....WJ..B
> > [   52.937120] 00000030: 00 00 00 00 01 00 00 06 00 00 00 00 00 00 00 80  ................
> > [   52.938515] 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> > [   52.939919] 00000050: 00 00 00 01 00 64 00 00 00 00 00 04 00 00 00 00  .....d..........
> > [   52.941293] 00000060: 00 00 64 00 b4 a5 02 00 02 00 00 08 00 00 00 00  ..d.............
> > [   52.942661] 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 17 00 00 19  ................
> > [   52.944046] XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply+0x38b/0x3a0 [xfs] (fs/xfs/xfs_buf.c:1559).  Shutting down filesystem.
> > [   52.946710] XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> > [   52.948099] XFS (loop0): log mount/recovery failed: error -117
> > [   52.949810] XFS (loop0): log mount failed

And that's what should be in the commit message, as it explains
exactly how the problem occurred, the symptom that was seen, and
why the change is necessary. It also means that anyone else who sees
a similar problem and is grepping the commit history will see this
and recognise it, thereby knowing that this is the fix they need...

> > Looking at corresponding code:
> > 231 xfs_validate_sb_write(
> > 232         struct xfs_mount        *mp,
> > 233         struct xfs_buf          *bp,
> > 234         struct xfs_sb           *sbp)
> > 235 {
> > 236         /*
> > 237          * Carry out additional sb summary counter sanity checks when we write
> > 238          * the superblock.  We skip this in the read validator because there
> > 239          * could be newer superblocks in the log and if the values are garbage
> > 240          * even after replay we'll recalculate them at the end of log mount.
> > 241          *
> > 242          * mkfs has traditionally written zeroed counters to inprogress and
> > 243          * secondary superblocks, so allow this usage to continue because
> > 244          * we never read counters from such superblocks.
> > 245          */
> > 246         if (xfs_buf_daddr(bp) == XFS_SB_DADDR && !sbp->sb_inprogress &&
> > 247             (sbp->sb_fdblocks > sbp->sb_dblocks ||
> > 248              !xfs_verify_icount(mp, sbp->sb_icount) ||
> > 249              sbp->sb_ifree > sbp->sb_icount)) {
> > 250                 xfs_warn(mp, "SB summary counter sanity check failed");
> > 251                 return -EFSCORRUPTED;
> > 252         }
> > 
> > From dmesg and code, we know the check failure was due to bad sb_ifree vs sb_icount or bad sb_fdblocks vs sb_dblocks.
> > 
> > Looking at the super block dump and log dump,
> > We know ifree and icount are good, what’s bad is sb_fdblocks. And that sb_fdblocks is from log.
> > # I verified that sb_fdblocks is 0xfffffffffffffff4 with a UEK debug kernel (though not 6.7.0-rc3)
> > 
> > So the sb_fdblocks is updated from log to incore at xfs_log_sb() -> xfs_validate_sb_write() path though
> > Should be may re-calculated from AGs.
> > 
> > The fix aims to make xfs_validate_sb_write() happy.

What about the sb_icount and sb_ifree counters? They are also percpu
counters, and they can return transient negative numbers, too,
right? If they end up in the log, the same as this transient
negative sb_fdblocks count, won't that also cause exactly the same
issue?

i.e. if we need to fix the sb_fdblocks sum to always be positive,
then we need to do the same thing with the other lazy superblock
per-cpu counters so they don't trip the over the same transient
underflow issue...

-Dave.

Wengang Wang June 3, 2024, 5:23 p.m. UTC | #7

> On Jun 2, 2024, at 4:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Fri, May 31, 2024 at 03:44:56PM +0000, Wengang Wang wrote:
>> Hi Dave,
>> 
>> Do you have further comments and/or suggestions? Or give a RB pls :D
> 
> Sorry, LSFMM intervened and I didn't notice your comment until now.
> 
No worries!

>>> On May 13, 2024, at 10:06 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>>> 
>>> Things is that we have a metadump, looking at the fdblocks from super block 0, it is good.
>>> 
>>> $ xfs_db -c "sb 0" -c "p" cust.img |egrep "dblocks|ifree|icount"
>>> dblocks = 26214400
>>> icount = 512
>>> ifree = 337
>>> fdblocks = 25997100
>>> 
>>> And when looking at the log, we have the following:
>>> 
>>> $ egrep -a "fdblocks|icount|ifree" cust.log |tail
>>> sb_fdblocks 37
>>> sb_icount 1056
>>> sb_ifree 87
>>> sb_fdblocks 37
>>> sb_icount 1056
>>> sb_ifree 87
>>> sb_fdblocks 37
>>> sb_icount 1056
>>> sb_ifree 87
>>> sb_fdblocks 18446744073709551604
>>> 
>>> # cust.log is output of my script which tries to parse the log buffer.
>>> 
>>> 18446744073709551604ULL == 0xfffffffffffffff4 or -12LL 
>>> 
>>> With upstream kernel (6.7.0-rc3), when I tried to mount (log recover) the metadump,
>>> I got the following in dmesg:
>>> 
>>> [   52.927796] XFS (loop0): SB summary counter sanity check failed
>>> [   52.928889] XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x60/0x110 [xfs], xfs_sb block 0x0
>>> [   52.930890] XFS (loop0): Unmount and run xfs_repair
>>> [   52.931797] XFS (loop0): First 128 bytes of corrupted metadata buffer:
>>> [   52.932954] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 01 90 00 00  XFSB............
>>> [   52.934333] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>>> [   52.935733] 00000020: c9 c1 ed ae 84 ed 46 b9 a1 f0 09 57 4a a9 98 42  ......F....WJ..B
>>> [   52.937120] 00000030: 00 00 00 00 01 00 00 06 00 00 00 00 00 00 00 80  ................
>>> [   52.938515] 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
>>> [   52.939919] 00000050: 00 00 00 01 00 64 00 00 00 00 00 04 00 00 00 00  .....d..........
>>> [   52.941293] 00000060: 00 00 64 00 b4 a5 02 00 02 00 00 08 00 00 00 00  ..d.............
>>> [   52.942661] 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 17 00 00 19  ................
>>> [   52.944046] XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply+0x38b/0x3a0 [xfs] (fs/xfs/xfs_buf.c:1559).  Shutting down filesystem.
>>> [   52.946710] XFS (loop0): Please unmount the filesystem and rectify the problem(s)
>>> [   52.948099] XFS (loop0): log mount/recovery failed: error -117
>>> [   52.949810] XFS (loop0): log mount failed
> 
> And that's what should be in the commit message, as it explains
> exactly how the problem occurred, the symptom that was seen, and
> why the change is necessary. It also means that anyone else who sees
> a similar problem and is grepping the commit history will see this
> and recognise it, thereby knowing that this is the fix they need...
> 

OK, got it.

>>> Looking at corresponding code:
>>> 231 xfs_validate_sb_write(
>>> 232         struct xfs_mount        *mp,
>>> 233         struct xfs_buf          *bp,
>>> 234         struct xfs_sb           *sbp)
>>> 235 {
>>> 236         /*
>>> 237          * Carry out additional sb summary counter sanity checks when we write
>>> 238          * the superblock.  We skip this in the read validator because there
>>> 239          * could be newer superblocks in the log and if the values are garbage
>>> 240          * even after replay we'll recalculate them at the end of log mount.
>>> 241          *
>>> 242          * mkfs has traditionally written zeroed counters to inprogress and
>>> 243          * secondary superblocks, so allow this usage to continue because
>>> 244          * we never read counters from such superblocks.
>>> 245          */
>>> 246         if (xfs_buf_daddr(bp) == XFS_SB_DADDR && !sbp->sb_inprogress &&
>>> 247             (sbp->sb_fdblocks > sbp->sb_dblocks ||
>>> 248              !xfs_verify_icount(mp, sbp->sb_icount) ||
>>> 249              sbp->sb_ifree > sbp->sb_icount)) {
>>> 250                 xfs_warn(mp, "SB summary counter sanity check failed");
>>> 251                 return -EFSCORRUPTED;
>>> 252         }
>>> 
>>> From dmesg and code, we know the check failure was due to bad sb_ifree vs sb_icount or bad sb_fdblocks vs sb_dblocks.
>>> 
>>> Looking at the super block dump and log dump,
>>> We know ifree and icount are good, what’s bad is sb_fdblocks. And that sb_fdblocks is from log.
>>> # I verified that sb_fdblocks is 0xfffffffffffffff4 with a UEK debug kernel (though not 6.7.0-rc3)
>>> 
>>> So the sb_fdblocks is updated from log to incore at xfs_log_sb() -> xfs_validate_sb_write() path though
>>> Should be may re-calculated from AGs.
>>> 
>>> The fix aims to make xfs_validate_sb_write() happy.
> 
> What about the sb_icount and sb_ifree counters? They are also percpu
> counters, and they can return transient negative numbers, too,
> right? If they end up in the log, the same as this transient
> negative sb_fdblocks count, won't that also cause exactly the same
> issue?
> 

Yes, sb_icount and sb_ifree are also percpu counters. They have been addressed by
commit 59f6ab40fd8735c9a1a15401610a31cc06a0bbd6, right?

> i.e. if we need to fix the sb_fdblocks sum to always be positive,
> then we need to do the same thing with the other lazy superblock
> per-cpu counters so they don't trip the over the same transient
> underflow issue...
> 

Agreed. While, I think we don’t have further percpu counters problems after this patch.  

Will send a new patch with line breakness.

Thanks,
Wengang

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Darrick J. Wong June 3, 2024, 6:43 p.m. UTC | #8

On Mon, Jun 03, 2024 at 05:23:40PM +0000, Wengang Wang wrote:
> 
> 
> > On Jun 2, 2024, at 4:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Fri, May 31, 2024 at 03:44:56PM +0000, Wengang Wang wrote:
> >> Hi Dave,
> >> 
> >> Do you have further comments and/or suggestions? Or give a RB pls :D
> > 
> > Sorry, LSFMM intervened and I didn't notice your comment until now.
> > 
> No worries!
> 
> >>> On May 13, 2024, at 10:06 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> >>> 
> >>> Things is that we have a metadump, looking at the fdblocks from super block 0, it is good.
> >>> 
> >>> $ xfs_db -c "sb 0" -c "p" cust.img |egrep "dblocks|ifree|icount"
> >>> dblocks = 26214400
> >>> icount = 512
> >>> ifree = 337
> >>> fdblocks = 25997100
> >>> 
> >>> And when looking at the log, we have the following:
> >>> 
> >>> $ egrep -a "fdblocks|icount|ifree" cust.log |tail
> >>> sb_fdblocks 37
> >>> sb_icount 1056
> >>> sb_ifree 87
> >>> sb_fdblocks 37
> >>> sb_icount 1056
> >>> sb_ifree 87
> >>> sb_fdblocks 37
> >>> sb_icount 1056
> >>> sb_ifree 87
> >>> sb_fdblocks 18446744073709551604
> >>> 
> >>> # cust.log is output of my script which tries to parse the log buffer.
> >>> 
> >>> 18446744073709551604ULL == 0xfffffffffffffff4 or -12LL 
> >>> 
> >>> With upstream kernel (6.7.0-rc3), when I tried to mount (log recover) the metadump,
> >>> I got the following in dmesg:
> >>> 
> >>> [   52.927796] XFS (loop0): SB summary counter sanity check failed
> >>> [   52.928889] XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x60/0x110 [xfs], xfs_sb block 0x0
> >>> [   52.930890] XFS (loop0): Unmount and run xfs_repair
> >>> [   52.931797] XFS (loop0): First 128 bytes of corrupted metadata buffer:
> >>> [   52.932954] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 01 90 00 00  XFSB............
> >>> [   52.934333] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> >>> [   52.935733] 00000020: c9 c1 ed ae 84 ed 46 b9 a1 f0 09 57 4a a9 98 42  ......F....WJ..B
> >>> [   52.937120] 00000030: 00 00 00 00 01 00 00 06 00 00 00 00 00 00 00 80  ................
> >>> [   52.938515] 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> >>> [   52.939919] 00000050: 00 00 00 01 00 64 00 00 00 00 00 04 00 00 00 00  .....d..........
> >>> [   52.941293] 00000060: 00 00 64 00 b4 a5 02 00 02 00 00 08 00 00 00 00  ..d.............
> >>> [   52.942661] 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 17 00 00 19  ................
> >>> [   52.944046] XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply+0x38b/0x3a0 [xfs] (fs/xfs/xfs_buf.c:1559).  Shutting down filesystem.
> >>> [   52.946710] XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> >>> [   52.948099] XFS (loop0): log mount/recovery failed: error -117
> >>> [   52.949810] XFS (loop0): log mount failed
> > 
> > And that's what should be in the commit message, as it explains
> > exactly how the problem occurred, the symptom that was seen, and
> > why the change is necessary. It also means that anyone else who sees
> > a similar problem and is grepping the commit history will see this
> > and recognise it, thereby knowing that this is the fix they need...
> > 
> 
> OK, got it.
> 
> >>> Looking at corresponding code:
> >>> 231 xfs_validate_sb_write(
> >>> 232         struct xfs_mount        *mp,
> >>> 233         struct xfs_buf          *bp,
> >>> 234         struct xfs_sb           *sbp)
> >>> 235 {
> >>> 236         /*
> >>> 237          * Carry out additional sb summary counter sanity checks when we write
> >>> 238          * the superblock.  We skip this in the read validator because there
> >>> 239          * could be newer superblocks in the log and if the values are garbage
> >>> 240          * even after replay we'll recalculate them at the end of log mount.
> >>> 241          *
> >>> 242          * mkfs has traditionally written zeroed counters to inprogress and
> >>> 243          * secondary superblocks, so allow this usage to continue because
> >>> 244          * we never read counters from such superblocks.
> >>> 245          */
> >>> 246         if (xfs_buf_daddr(bp) == XFS_SB_DADDR && !sbp->sb_inprogress &&
> >>> 247             (sbp->sb_fdblocks > sbp->sb_dblocks ||
> >>> 248              !xfs_verify_icount(mp, sbp->sb_icount) ||
> >>> 249              sbp->sb_ifree > sbp->sb_icount)) {
> >>> 250                 xfs_warn(mp, "SB summary counter sanity check failed");
> >>> 251                 return -EFSCORRUPTED;
> >>> 252         }
> >>> 
> >>> From dmesg and code, we know the check failure was due to bad sb_ifree vs sb_icount or bad sb_fdblocks vs sb_dblocks.
> >>> 
> >>> Looking at the super block dump and log dump,
> >>> We know ifree and icount are good, what’s bad is sb_fdblocks. And that sb_fdblocks is from log.
> >>> # I verified that sb_fdblocks is 0xfffffffffffffff4 with a UEK debug kernel (though not 6.7.0-rc3)
> >>> 
> >>> So the sb_fdblocks is updated from log to incore at xfs_log_sb() -> xfs_validate_sb_write() path though
> >>> Should be may re-calculated from AGs.
> >>> 
> >>> The fix aims to make xfs_validate_sb_write() happy.
> > 
> > What about the sb_icount and sb_ifree counters? They are also percpu
> > counters, and they can return transient negative numbers, too,
> > right? If they end up in the log, the same as this transient
> > negative sb_fdblocks count, won't that also cause exactly the same
> > issue?
> > 
> 
> Yes, sb_icount and sb_ifree are also percpu counters. They have been addressed by
> commit 59f6ab40fd8735c9a1a15401610a31cc06a0bbd6, right?

icount and ifree don't go through xfs_mod_freecounter, which means that
they never do the "subtract and see if we went negative" dance that
fdblocks/frextents does.  percpu_counter_sum_positive isn't necessary.

--D

> > i.e. if we need to fix the sb_fdblocks sum to always be positive,
> > then we need to do the same thing with the other lazy superblock
> > per-cpu counters so they don't trip the over the same transient
> > underflow issue...
> > 
> 
> Agreed. While, I think we don’t have further percpu counters problems after this patch.  
> 
> Will send a new patch with line breakness.
> 
> Thanks,
> Wengang
> 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
>

Dave Chinner June 4, 2024, 4:06 a.m. UTC | #9

On Mon, Jun 03, 2024 at 11:43:27AM -0700, Darrick J. Wong wrote:
> On Mon, Jun 03, 2024 at 05:23:40PM +0000, Wengang Wang wrote:
> > > What about the sb_icount and sb_ifree counters? They are also percpu
> > > counters, and they can return transient negative numbers, too,
> > > right? If they end up in the log, the same as this transient
> > > negative sb_fdblocks count, won't that also cause exactly the same
> > > issue?
> > > 
> > 
> > Yes, sb_icount and sb_ifree are also percpu counters. They have been addressed by
> > commit 59f6ab40fd8735c9a1a15401610a31cc06a0bbd6, right?
> 
> icount and ifree don't go through xfs_mod_freecounter, which means that
> they never do the "subtract and see if we went negative" dance that
> fdblocks/frextents does.  percpu_counter_sum_positive isn't necessary.

I'm pretty sure that what xfs_mod_freecounter() does has no bearing
on whether percpu_counter_sum() can return a negative number.
percpu_counter_sum() is intentionally designed to be racy w.r.t.
sub-batch percpu counter updates.

That is, percpu_counter_sum() takes the fbc->lock while summing only
to bound the maximum sum error to +/-(nr_cpus * batch size). It does
not serialise against sub-batch percpu count modifications - add/sub
only serialise against a sum when they go over the batch size and
need folding. Hence the sum will serialise against folding
operations, thereby bounding the maximum error across the sum to
+/-batch size on every CPU.

For example, if we have a counter with a batch of 128, 4 CPUs,
fbc->count = 0 and the current percpu count values are:

CPU 	value
 0	32
 1	0
 2	-32
 3	0

We have a sum of zero when everything is static and unchanging.

However, when we look at the dynamic nature of the sum, we can have
percpu counters change during the summing iteration. For example,
say between the sum starting and finishing on CPU 3, we have a
racing subtraction on CPU 0 of 64, and and add of 64 on CPU 1.
Neither trigger batch overruns so don't serialise against a sum in
progress.

If the running sum samples CPU 0 after it updates, and CPU 1 before
it updates, the sum will see:

fbc->count = 0

CPU 	value
 0	-32
 1	0
 2	-32
 3	0

and the sum will be -64. If we then rerun the sum and it doesn't
race with any updates, it will see:

fbc->count = 0

CPU 	value
 0	-32
 1	64
 2	-32
 3	0

and the sum will be zero.

The above example could actually happen with ifree/icount.
Modifications to ifree and icount in
xfs_trans_unreserve_and_mod_sb() aren't serialised against each
other (i.e. multiple alloc/free transactions can be committing and
updating the ifree/icount counters concurrently) and they can also
be running concurrently with the superblock update summing in
xfs_log_sb(). Hence the inherent raciness in percpu_counter_sum() vs
percpu_counter_add() is not mitigated in any way, and so ifree and
icount could have a transient sum that is less than zero....

So AFAICT, then need the _sum_positive() treatment as well.

-Dave.

xfs: make sure sb_fdblocks is non-negative

Commit Message

Comments

Patch