diff mbox series

[2/3] xfs: delalloc and quota softlimit timers are incoherent

Message ID 20241112221920.1105007-3-david@fromorbit.com (mailing list archive)
State Under Review
Headers show
Series xfs: miscellaneous bug fixes | expand

Commit Message

Dave Chinner Nov. 12, 2024, 10:05 p.m. UTC
From: Supriya Wickrematillake <sup@sgi.com>

I've been seeing this failure on during xfs/050 recently:

 XFS: Assertion failed: dst->d_spc_timer != 0, file: fs/xfs/xfs_qm_syscalls.c, line: 435
....
 Call Trace:
  <TASK>
  xfs_qm_scall_getquota_fill_qc+0x2a2/0x2b0
  xfs_qm_scall_getquota_next+0x69/0xa0
  xfs_fs_get_nextdqblk+0x62/0xf0
  quota_getnextxquota+0xbf/0x320
  do_quotactl+0x1a1/0x410
  __se_sys_quotactl+0x126/0x310
  __x64_sys_quotactl+0x21/0x30
  x64_sys_call+0x2819/0x2ee0
  do_syscall_64+0x68/0x130
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

It turns out that the _qmount call has silently been failing to
unmount and mount the filesystem, so when the softlimit is pushed
past with a buffered write, it is not getting synced to disk before
the next quota report is being run.

Hence when the quota report runs, we have 300 blocks of delalloc
data on an inode, with a soft limit of 200 blocks. XFS dquots
account delalloc reservations as used space, hence the dquot is over
the soft limit.

However, we don't update the soft limit timers until we do a
transactional update of the dquot. That is, the dquot sits over the
soft limit without a softlimit timer being started until writeback
occurs and the allocation modifies the dquot and we call
xfs_qm_adjust_dqtimers() from xfs_trans_apply_dquot_deltas() in
xfs_trans_commit() context.

This isn't really a problem, except for this debug code in
xfs_qm_scall_getquota_fill_qc():

#ifdef DEBUG
        if (xfs_dquot_is_enforced(dqp) && dqp->q_id != 0) {
                if ((dst->d_space > dst->d_spc_softlimit) &&
                    (dst->d_spc_softlimit > 0)) {
                        ASSERT(dst->d_spc_timer != 0);
                }
....

It asserts taht if the used block count is over the soft limit,
it *must* have a soft limit timer running. This is clearly not
the case, because we haven't committed the delalloc space to disk
yet. Hence the soft limit is only exceeded temporarily in memory
(which isn't an issue) and we start the timer the moment we exceed
the soft limit in journalled metadata.

This debug was introduced in:

commit 0d5ad8383061fbc0a9804fbb98218750000fe032
Author: Supriya Wickrematillake <sup@sgi.com>
Date:   Wed May 15 22:44:44 1996 +0000

    initial checkin
    quotactl syscall functions.

The very first quota support commit back in 1996. This is zero-day
debug for Irix and, as it turns out, a zero-day bug in the debug
code because the delalloc code on Irix didn't update the softlimit
timers, either.

IOWs, this issue has been in the code for 28 years.

We obviously don't care if soft limit timers are a bit rubbery when
we have delalloc reservations in memory. Production systems running
quota reports have been exposed to this situation for 28 years and
nobody has noticed it, so the debug code is essentially worthless at
this point in time.

We also have the on-disk dquot verifiers checking that the soft
limit timer is running whenever the dquot is over the soft limit
before we write it to disk and after we read it from disk. These
aren't firing, so it is clear the issue is purely a temporary
in-memory incoherency that I never would have noticed had the test
not silently failed to unmount the filesystem.

Hence I'm simply going to trash this runtime debug because it isn't
useful in the slightest for catching quota bugs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_qm_syscalls.c | 13 -------------
 1 file changed, 13 deletions(-)

Comments

Darrick J. Wong Nov. 12, 2024, 11:48 p.m. UTC | #1
On Wed, Nov 13, 2024 at 09:05:15AM +1100, Dave Chinner wrote:
> From: Supriya Wickrematillake <sup@sgi.com>

Wow, there are still people working at SGI??

> I've been seeing this failure on during xfs/050 recently:
> 
>  XFS: Assertion failed: dst->d_spc_timer != 0, file: fs/xfs/xfs_qm_syscalls.c, line: 435
> ....
>  Call Trace:
>   <TASK>
>   xfs_qm_scall_getquota_fill_qc+0x2a2/0x2b0
>   xfs_qm_scall_getquota_next+0x69/0xa0
>   xfs_fs_get_nextdqblk+0x62/0xf0
>   quota_getnextxquota+0xbf/0x320
>   do_quotactl+0x1a1/0x410
>   __se_sys_quotactl+0x126/0x310
>   __x64_sys_quotactl+0x21/0x30
>   x64_sys_call+0x2819/0x2ee0
>   do_syscall_64+0x68/0x130
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> It turns out that the _qmount call has silently been failing to
> unmount and mount the filesystem, so when the softlimit is pushed
> past with a buffered write, it is not getting synced to disk before
> the next quota report is being run.
> 
> Hence when the quota report runs, we have 300 blocks of delalloc
> data on an inode, with a soft limit of 200 blocks. XFS dquots
> account delalloc reservations as used space, hence the dquot is over
> the soft limit.
> 
> However, we don't update the soft limit timers until we do a
> transactional update of the dquot. That is, the dquot sits over the
> soft limit without a softlimit timer being started until writeback
> occurs and the allocation modifies the dquot and we call
> xfs_qm_adjust_dqtimers() from xfs_trans_apply_dquot_deltas() in
> xfs_trans_commit() context.
> 
> This isn't really a problem, except for this debug code in
> xfs_qm_scall_getquota_fill_qc():
> 
> #ifdef DEBUG
>         if (xfs_dquot_is_enforced(dqp) && dqp->q_id != 0) {
>                 if ((dst->d_space > dst->d_spc_softlimit) &&
>                     (dst->d_spc_softlimit > 0)) {
>                         ASSERT(dst->d_spc_timer != 0);
>                 }
> ....
> 
> It asserts taht if the used block count is over the soft limit,
> it *must* have a soft limit timer running. This is clearly not
> the case, because we haven't committed the delalloc space to disk
> yet. Hence the soft limit is only exceeded temporarily in memory
> (which isn't an issue) and we start the timer the moment we exceed
> the soft limit in journalled metadata.
> 
> This debug was introduced in:
> 
> commit 0d5ad8383061fbc0a9804fbb98218750000fe032
> Author: Supriya Wickrematillake <sup@sgi.com>
> Date:   Wed May 15 22:44:44 1996 +0000
> 
>     initial checkin
>     quotactl syscall functions.
> 
> The very first quota support commit back in 1996. This is zero-day
> debug for Irix and, as it turns out, a zero-day bug in the debug
> code because the delalloc code on Irix didn't update the softlimit
> timers, either.
> 
> IOWs, this issue has been in the code for 28 years.
> 
> We obviously don't care if soft limit timers are a bit rubbery when
> we have delalloc reservations in memory. Production systems running
> quota reports have been exposed to this situation for 28 years and
> nobody has noticed it, so the debug code is essentially worthless at
> this point in time.
> 
> We also have the on-disk dquot verifiers checking that the soft
> limit timer is running whenever the dquot is over the soft limit
> before we write it to disk and after we read it from disk. These
> aren't firing, so it is clear the issue is purely a temporary
> in-memory incoherency that I never would have noticed had the test
> not silently failed to unmount the filesystem.
> 
> Hence I'm simply going to trash this runtime debug because it isn't
> useful in the slightest for catching quota bugs.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Agreed!  I've hit this once in a blue moon and didn't think it was
especially useful either.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_qm_syscalls.c | 13 -------------
>  1 file changed, 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> index 4eda50ae2d1c..0c78f30fa4a3 100644
> --- a/fs/xfs/xfs_qm_syscalls.c
> +++ b/fs/xfs/xfs_qm_syscalls.c
> @@ -427,19 +427,6 @@ xfs_qm_scall_getquota_fill_qc(
>  		dst->d_ino_timer = 0;
>  		dst->d_rt_spc_timer = 0;
>  	}
> -
> -#ifdef DEBUG
> -	if (xfs_dquot_is_enforced(dqp) && dqp->q_id != 0) {
> -		if ((dst->d_space > dst->d_spc_softlimit) &&
> -		    (dst->d_spc_softlimit > 0)) {
> -			ASSERT(dst->d_spc_timer != 0);
> -		}
> -		if ((dst->d_ino_count > dqp->q_ino.softlimit) &&
> -		    (dqp->q_ino.softlimit > 0)) {
> -			ASSERT(dst->d_ino_timer != 0);
> -		}
> -	}
> -#endif
>  }
>  
>  /* Return the quota information for the dquot matching id. */
> -- 
> 2.45.2
> 
>
Dave Chinner Nov. 13, 2024, 12:14 a.m. UTC | #2
On Tue, Nov 12, 2024 at 03:48:35PM -0800, Darrick J. Wong wrote:
> On Wed, Nov 13, 2024 at 09:05:15AM +1100, Dave Chinner wrote:
> > From: Supriya Wickrematillake <sup@sgi.com>
> 
> Wow, there are still people working at SGI??

huh, that's a git-send-email screwup. It must have pulled that from
the quote of the initial quota commit:

> > This debug was introduced in:
> > 
> > commit 0d5ad8383061fbc0a9804fbb98218750000fe032
> > Author: Supriya Wickrematillake <sup@sgi.com>
> > Date:   Wed May 15 22:44:44 1996 +0000
> > 
> >     initial checkin
> >     quotactl syscall functions.

Here, and decided to ignore the actual "From: dchinner@redhat" tag
in the local commit message.

Great work, git!


> > The very first quota support commit back in 1996. This is zero-day
> > debug for Irix and, as it turns out, a zero-day bug in the debug
> > code because the delalloc code on Irix didn't update the softlimit
> > timers, either.
> > 
> > IOWs, this issue has been in the code for 28 years.
> > 
> > We obviously don't care if soft limit timers are a bit rubbery when
> > we have delalloc reservations in memory. Production systems running
> > quota reports have been exposed to this situation for 28 years and
> > nobody has noticed it, so the debug code is essentially worthless at
> > this point in time.
> > 
> > We also have the on-disk dquot verifiers checking that the soft
> > limit timer is running whenever the dquot is over the soft limit
> > before we write it to disk and after we read it from disk. These
> > aren't firing, so it is clear the issue is purely a temporary
> > in-memory incoherency that I never would have noticed had the test
> > not silently failed to unmount the filesystem.
> > 
> > Hence I'm simply going to trash this runtime debug because it isn't
> > useful in the slightest for catching quota bugs.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Agreed!  I've hit this once in a blue moon and didn't think it was
> especially useful either.
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!

-Dave.
Christoph Hellwig Nov. 13, 2024, 8:48 a.m. UTC | #3
Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
diff mbox series

Patch

diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 4eda50ae2d1c..0c78f30fa4a3 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -427,19 +427,6 @@  xfs_qm_scall_getquota_fill_qc(
 		dst->d_ino_timer = 0;
 		dst->d_rt_spc_timer = 0;
 	}
-
-#ifdef DEBUG
-	if (xfs_dquot_is_enforced(dqp) && dqp->q_id != 0) {
-		if ((dst->d_space > dst->d_spc_softlimit) &&
-		    (dst->d_spc_softlimit > 0)) {
-			ASSERT(dst->d_spc_timer != 0);
-		}
-		if ((dst->d_ino_count > dqp->q_ino.softlimit) &&
-		    (dqp->q_ino.softlimit > 0)) {
-			ASSERT(dst->d_ino_timer != 0);
-		}
-	}
-#endif
 }
 
 /* Return the quota information for the dquot matching id. */