diff mbox series

[RFC,v2] xfs: run blockgc on freeze to avoid iget stalls after reclaim

Message ID 20240119193645.354214-1-bfoster@redhat.com (mailing list archive)
State New
Headers show
Series [RFC,v2] xfs: run blockgc on freeze to avoid iget stalls after reclaim | expand

Commit Message

Brian Foster Jan. 19, 2024, 7:36 p.m. UTC
We've had reports on distro (pre-deferred inactivation) kernels that
inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
lock when invoked on a frozen XFS fs. This occurs because
drop_caches acquires the lock and then blocks in xfs_inactive() on
transaction alloc for an inode that requires an eofb trim. unfreeze
then blocks on the same lock and the fs is deadlocked.

With deferred inactivation, the deadlock problem is no longer
present because ->destroy_inode() no longer blocks whether the fs is
frozen or not. There is still unfortunate behavior in that lookups
of a pending inactive inode spin loop waiting for the pending
inactive state to clear, which won't happen until the fs is
unfrozen. This was always possible to some degree, but is
potentially amplified by the fact that reclaim no longer blocks on
the first inode that requires inactivation work. Instead, we
populate the inactivation queues indefinitely. The side effect can
be observed easily by invoking drop_caches on a frozen fs previously
populated with eofb and/or cowblocks inodes and then running
anything that relies on inode lookup (i.e., ls).

To mitigate this behavior, invoke a non-sync blockgc scan during the
freeze sequence to minimize the chance that inode evictions require
inactivation while the fs is frozen. A synchronous scan would
provide more of a guarantee, but is potentially unsafe from
partially frozen context. This is because a file read task may be
blocked on a write fault while holding iolock (such as when reading
into a mapped buffer) and a sync scan retries indefinitely on iolock
failure. Therefore, this adds risk of potential livelock during the
freeze sequence.

Finally, since the deadlock issue was present for such a long time,
also document the subtle ->destroy_inode() constraint to avoid
unintentional reintroduction of the deadlock problem in the future.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi all,

There was a good amount of discussion on the first version of this patch
[1] a couple or so years ago now. The main issue was that using a sync
scan is unsafe in certain cases (best described here [2]), so this
best-effort approach was considered as a fallback option to improve
behavior.

The reason I'm reposting this is that it is one of several options for
dealing with the aforementioned deadlock on stable/distro kernels, so it
seems to have mutual benefit. Looking back through the original
discussion, I think there are several ways this could be improved to
provide the benefit of a sync scan. For example, if the scan could be
made to run before faults are locked out (re [3]), that may be
sufficient to allow a sync scan. Or now that freeze_super() actually
checks for ->sync_fs() errors, an async scan could be followed by a
check for tagged blockgc entries that triggers an -EBUSY or some error
return to fail the freeze, which would most likely be a rare and
transient situation. Etc.

These thoughts are mainly incremental improvements upon some form of
freeze time scan and may not be of significant additional value given
current upstream behavior, so this patch takes the simple, best effort
approach. Thoughts?

Brian

[1] https://lore.kernel.org/linux-xfs/20220113133701.629593-1-bfoster@redhat.com/
[2] https://lore.kernel.org/linux-xfs/20220115224030.GA59729@dread.disaster.area/
[3] https://lore.kernel.org/linux-xfs/Yehvc4g+WakcG1mP@bfoster/

 fs/xfs/xfs_super.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

Comments

Amir Goldstein Jan. 20, 2024, 8:50 a.m. UTC | #1
On Fri, Jan 19, 2024 at 9:35 PM Brian Foster <bfoster@redhat.com> wrote:
>
> We've had reports on distro (pre-deferred inactivation) kernels that
> inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> lock when invoked on a frozen XFS fs. This occurs because
> drop_caches acquires the lock and then blocks in xfs_inactive() on
> transaction alloc for an inode that requires an eofb trim. unfreeze
> then blocks on the same lock and the fs is deadlocked.
>
> With deferred inactivation, the deadlock problem is no longer
> present because ->destroy_inode() no longer blocks whether the fs is
> frozen or not. There is still unfortunate behavior in that lookups
> of a pending inactive inode spin loop waiting for the pending
> inactive state to clear, which won't happen until the fs is
> unfrozen. This was always possible to some degree, but is
> potentially amplified by the fact that reclaim no longer blocks on
> the first inode that requires inactivation work. Instead, we
> populate the inactivation queues indefinitely. The side effect can
> be observed easily by invoking drop_caches on a frozen fs previously
> populated with eofb and/or cowblocks inodes and then running
> anything that relies on inode lookup (i.e., ls).
>
> To mitigate this behavior, invoke a non-sync blockgc scan during the
> freeze sequence to minimize the chance that inode evictions require
> inactivation while the fs is frozen. A synchronous scan would
> provide more of a guarantee, but is potentially unsafe from
> partially frozen context. This is because a file read task may be
> blocked on a write fault while holding iolock (such as when reading
> into a mapped buffer) and a sync scan retries indefinitely on iolock
> failure. Therefore, this adds risk of potential livelock during the
> freeze sequence.
>
> Finally, since the deadlock issue was present for such a long time,
> also document the subtle ->destroy_inode() constraint to avoid
> unintentional reintroduction of the deadlock problem in the future.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Is there an appropriate Fixes: commit that could be mentioned here?
or at least a range of stable kernels to apply this suggested fix?

Thanks,
Amir.

> ---
>
> Hi all,
>
> There was a good amount of discussion on the first version of this patch
> [1] a couple or so years ago now. The main issue was that using a sync
> scan is unsafe in certain cases (best described here [2]), so this
> best-effort approach was considered as a fallback option to improve
> behavior.
>
> The reason I'm reposting this is that it is one of several options for
> dealing with the aforementioned deadlock on stable/distro kernels, so it
> seems to have mutual benefit. Looking back through the original
> discussion, I think there are several ways this could be improved to
> provide the benefit of a sync scan. For example, if the scan could be
> made to run before faults are locked out (re [3]), that may be
> sufficient to allow a sync scan. Or now that freeze_super() actually
> checks for ->sync_fs() errors, an async scan could be followed by a
> check for tagged blockgc entries that triggers an -EBUSY or some error
> return to fail the freeze, which would most likely be a rare and
> transient situation. Etc.
>
> These thoughts are mainly incremental improvements upon some form of
> freeze time scan and may not be of significant additional value given
> current upstream behavior, so this patch takes the simple, best effort
> approach. Thoughts?
>
> Brian
>
> [1] https://lore.kernel.org/linux-xfs/20220113133701.629593-1-bfoster@redhat.com/
> [2] https://lore.kernel.org/linux-xfs/20220115224030.GA59729@dread.disaster.area/
> [3] https://lore.kernel.org/linux-xfs/Yehvc4g+WakcG1mP@bfoster/
>
>  fs/xfs/xfs_super.c | 24 ++++++++++++++++--------
>  1 file changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index d0009430a627..43e72e266666 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -657,8 +657,13 @@ xfs_fs_alloc_inode(
>  }
>
>  /*
> - * Now that the generic code is guaranteed not to be accessing
> - * the linux inode, we can inactivate and reclaim the inode.
> + * Now that the generic code is guaranteed not to be accessing the inode, we can
> + * inactivate and reclaim it.
> + *
> + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> + * allocation in this context. A transaction alloc that blocks on frozen state
> + * from a context with ->s_umount held will deadlock with unfreeze.
>   */
>  STATIC void
>  xfs_fs_destroy_inode(
> @@ -811,15 +816,18 @@ xfs_fs_sync_fs(
>          * down inodegc because once SB_FREEZE_FS is set it's too late to
>          * prevent inactivation races with freeze. The fs doesn't get called
>          * again by the freezing process until after SB_FREEZE_FS has been set,
> -        * so it's now or never.  Same logic applies to speculative allocation
> -        * garbage collection.
> +        * so it's now or never.
>          *
> -        * We don't care if this is a normal syncfs call that does this or
> -        * freeze that does this - we can run this multiple times without issue
> -        * and we won't race with a restart because a restart can only occur
> -        * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> +        * The same logic applies to block garbage collection. Run a best-effort
> +        * blockgc scan to reduce the working set of inodes that the shrinker
> +        * would send to inactivation queue purgatory while frozen. We can't run
> +        * a sync scan with page faults blocked because that could potentially
> +        * livelock against a read task blocked on a page fault (i.e. if reading
> +        * into a mapped buffer) while holding iolock.
>          */
>         if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> +               xfs_blockgc_free_space(mp, NULL);
> +
>                 xfs_inodegc_stop(mp);
>                 xfs_blockgc_stop(mp);
>         }
> --
> 2.42.0
>
>
Dave Chinner Jan. 22, 2024, 3:23 a.m. UTC | #2
On Fri, Jan 19, 2024 at 02:36:45PM -0500, Brian Foster wrote:
> We've had reports on distro (pre-deferred inactivation) kernels that
> inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> lock when invoked on a frozen XFS fs. This occurs because
> drop_caches acquires the lock and then blocks in xfs_inactive() on
> transaction alloc for an inode that requires an eofb trim. unfreeze
> then blocks on the same lock and the fs is deadlocked.

Yup, but why do we need to address that in upstream kernels?

> With deferred inactivation, the deadlock problem is no longer
> present because ->destroy_inode() no longer blocks whether the fs is
> frozen or not. There is still unfortunate behavior in that lookups
> of a pending inactive inode spin loop waiting for the pending
> inactive state to clear, which won't happen until the fs is
> unfrozen.

Largely we took option 1 from the previous discussion:

| ISTM the currently most viable options we've discussed are:
| 
| 1. Leave things as is, accept potential for lookup stalls while frozen
| and wait and see if this ever really becomes a problem for real users.

https://lore.kernel.org/linux-xfs/YeVxCXE6hXa1S%2Fic@bfoster/

And really it hasn't caused any serious problems with the upstream
and distro kernels that have background inodegc.

Regardless, the spinning lookup problem seems pretty easy to avoid.
We can turn it into a blocking lookup if the filesytsem is frozen
simply by cycling sb_start_write/sb_end_write if there's a pending
inactivation on that inode, and now the lookup blocks until the
filesystem is thawed.

Alternatively, we could actually implement reinstantiation of the
VFS inode portion of the inode and reactivate the inode if inodegc
is pending. As darrick mentioned we didn't do this because of the
difficulty in removing arbitrary items from the middle of llist
structures.

However, it occurs to me that we don't even need to remove the inode
from the inodegc list to recycle it. We can be lazy - just need to
set a flag on the inode to cancel the inodegc and have inodegc clear
the flag and skip over cancelled inodes instead of inactivating
them.  Then if a gc cancelled inode then gets reclaimed by the VFS
again, the inodegc queueing code can simply remove cancelled flag
and it's already queued for processing....

I think this avoids all the problems with needing to do inodegc
cleanup while the filesystem is frozen, so I leaning towards this as
the best way to solve this problem in the upstream kernel.

> This was always possible to some degree, but is
> potentially amplified by the fact that reclaim no longer blocks on
> the first inode that requires inactivation work. Instead, we
> populate the inactivation queues indefinitely. The side effect can
> be observed easily by invoking drop_caches on a frozen fs previously
> populated with eofb and/or cowblocks inodes and then running
> anything that relies on inode lookup (i.e., ls).

As we discussed last time, that is largely avoidable by only queuing
inodes that absolutely need cleanup work. i.e. call
xfs_can_free_eofblocks() and make it return false is the filesystem
has an elevated freeze state because eof block clearing at reclaim
is not essential for correct operation - the inode simply behaves as
if it has XFS_DIFLAG_PREALLOC set on it. This will happen with any
inode that has had fallocate() called on it, so it's not like it's a
rare occurrence, either.

The cowblocks case is much a much rarer situation, so that case could
potentially just queue inodes those until the freeze goes away.

But if we can reinstantiate inodegc queued inodes as per my
suggestion above then we can just leave the inodegc queuing
unchanged and just not care how long they block for because if they
are needed again whilst queued we can reuse them immediately....

> Finally, since the deadlock issue was present for such a long time,
> also document the subtle ->destroy_inode() constraint to avoid
> unintentional reintroduction of the deadlock problem in the future.

That's still useful, though.

-Dave.
Brian Foster Jan. 25, 2024, 12:25 p.m. UTC | #3
On Sat, Jan 20, 2024 at 10:50:02AM +0200, Amir Goldstein wrote:
> On Fri, Jan 19, 2024 at 9:35 PM Brian Foster <bfoster@redhat.com> wrote:
> >
> > We've had reports on distro (pre-deferred inactivation) kernels that
> > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > lock when invoked on a frozen XFS fs. This occurs because
> > drop_caches acquires the lock and then blocks in xfs_inactive() on
> > transaction alloc for an inode that requires an eofb trim. unfreeze
> > then blocks on the same lock and the fs is deadlocked.
> >
> > With deferred inactivation, the deadlock problem is no longer
> > present because ->destroy_inode() no longer blocks whether the fs is
> > frozen or not. There is still unfortunate behavior in that lookups
> > of a pending inactive inode spin loop waiting for the pending
> > inactive state to clear, which won't happen until the fs is
> > unfrozen. This was always possible to some degree, but is
> > potentially amplified by the fact that reclaim no longer blocks on
> > the first inode that requires inactivation work. Instead, we
> > populate the inactivation queues indefinitely. The side effect can
> > be observed easily by invoking drop_caches on a frozen fs previously
> > populated with eofb and/or cowblocks inodes and then running
> > anything that relies on inode lookup (i.e., ls).
> >
> > To mitigate this behavior, invoke a non-sync blockgc scan during the
> > freeze sequence to minimize the chance that inode evictions require
> > inactivation while the fs is frozen. A synchronous scan would
> > provide more of a guarantee, but is potentially unsafe from
> > partially frozen context. This is because a file read task may be
> > blocked on a write fault while holding iolock (such as when reading
> > into a mapped buffer) and a sync scan retries indefinitely on iolock
> > failure. Therefore, this adds risk of potential livelock during the
> > freeze sequence.
> >
> > Finally, since the deadlock issue was present for such a long time,
> > also document the subtle ->destroy_inode() constraint to avoid
> > unintentional reintroduction of the deadlock problem in the future.
> >
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> Is there an appropriate Fixes: commit that could be mentioned here?
> or at least a range of stable kernels to apply this suggested fix?
> 

Hmm.. well I didn't really consider this a bug upstream. The above is
more historical reference to an issue that has since gone away, but
trying to use the bug report on a stable kernel to be forward looking
about improving on potentially awkward behavior of the latest upstream
kernel under the same sort of circumstances (i.e. reclaim while frozen).

I suppose something like this would be potentially useful for stable
kernels that don't include background inactivation. I haven't audited
which stable kernels might fall in that category (if any), but alas it
probably doesn't matter because this patch likely wasn't going anywhere
anyways.

Brian

> Thanks,
> Amir.
> 
> > ---
> >
> > Hi all,
> >
> > There was a good amount of discussion on the first version of this patch
> > [1] a couple or so years ago now. The main issue was that using a sync
> > scan is unsafe in certain cases (best described here [2]), so this
> > best-effort approach was considered as a fallback option to improve
> > behavior.
> >
> > The reason I'm reposting this is that it is one of several options for
> > dealing with the aforementioned deadlock on stable/distro kernels, so it
> > seems to have mutual benefit. Looking back through the original
> > discussion, I think there are several ways this could be improved to
> > provide the benefit of a sync scan. For example, if the scan could be
> > made to run before faults are locked out (re [3]), that may be
> > sufficient to allow a sync scan. Or now that freeze_super() actually
> > checks for ->sync_fs() errors, an async scan could be followed by a
> > check for tagged blockgc entries that triggers an -EBUSY or some error
> > return to fail the freeze, which would most likely be a rare and
> > transient situation. Etc.
> >
> > These thoughts are mainly incremental improvements upon some form of
> > freeze time scan and may not be of significant additional value given
> > current upstream behavior, so this patch takes the simple, best effort
> > approach. Thoughts?
> >
> > Brian
> >
> > [1] https://lore.kernel.org/linux-xfs/20220113133701.629593-1-bfoster@redhat.com/
> > [2] https://lore.kernel.org/linux-xfs/20220115224030.GA59729@dread.disaster.area/
> > [3] https://lore.kernel.org/linux-xfs/Yehvc4g+WakcG1mP@bfoster/
> >
> >  fs/xfs/xfs_super.c | 24 ++++++++++++++++--------
> >  1 file changed, 16 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index d0009430a627..43e72e266666 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -657,8 +657,13 @@ xfs_fs_alloc_inode(
> >  }
> >
> >  /*
> > - * Now that the generic code is guaranteed not to be accessing
> > - * the linux inode, we can inactivate and reclaim the inode.
> > + * Now that the generic code is guaranteed not to be accessing the inode, we can
> > + * inactivate and reclaim it.
> > + *
> > + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> > + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> > + * allocation in this context. A transaction alloc that blocks on frozen state
> > + * from a context with ->s_umount held will deadlock with unfreeze.
> >   */
> >  STATIC void
> >  xfs_fs_destroy_inode(
> > @@ -811,15 +816,18 @@ xfs_fs_sync_fs(
> >          * down inodegc because once SB_FREEZE_FS is set it's too late to
> >          * prevent inactivation races with freeze. The fs doesn't get called
> >          * again by the freezing process until after SB_FREEZE_FS has been set,
> > -        * so it's now or never.  Same logic applies to speculative allocation
> > -        * garbage collection.
> > +        * so it's now or never.
> >          *
> > -        * We don't care if this is a normal syncfs call that does this or
> > -        * freeze that does this - we can run this multiple times without issue
> > -        * and we won't race with a restart because a restart can only occur
> > -        * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> > +        * The same logic applies to block garbage collection. Run a best-effort
> > +        * blockgc scan to reduce the working set of inodes that the shrinker
> > +        * would send to inactivation queue purgatory while frozen. We can't run
> > +        * a sync scan with page faults blocked because that could potentially
> > +        * livelock against a read task blocked on a page fault (i.e. if reading
> > +        * into a mapped buffer) while holding iolock.
> >          */
> >         if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > +               xfs_blockgc_free_space(mp, NULL);
> > +
> >                 xfs_inodegc_stop(mp);
> >                 xfs_blockgc_stop(mp);
> >         }
> > --
> > 2.42.0
> >
> >
>
Brian Foster Jan. 25, 2024, 12:46 p.m. UTC | #4
On Mon, Jan 22, 2024 at 02:23:44PM +1100, Dave Chinner wrote:
> On Fri, Jan 19, 2024 at 02:36:45PM -0500, Brian Foster wrote:
> > We've had reports on distro (pre-deferred inactivation) kernels that
> > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > lock when invoked on a frozen XFS fs. This occurs because
> > drop_caches acquires the lock and then blocks in xfs_inactive() on
> > transaction alloc for an inode that requires an eofb trim. unfreeze
> > then blocks on the same lock and the fs is deadlocked.
> 
> Yup, but why do we need to address that in upstream kernels?
> 
> > With deferred inactivation, the deadlock problem is no longer
> > present because ->destroy_inode() no longer blocks whether the fs is
> > frozen or not. There is still unfortunate behavior in that lookups
> > of a pending inactive inode spin loop waiting for the pending
> > inactive state to clear, which won't happen until the fs is
> > unfrozen.
> 
> Largely we took option 1 from the previous discussion:
> 
> | ISTM the currently most viable options we've discussed are:
> | 
> | 1. Leave things as is, accept potential for lookup stalls while frozen
> | and wait and see if this ever really becomes a problem for real users.
> 
> https://lore.kernel.org/linux-xfs/YeVxCXE6hXa1S%2Fic@bfoster/
> 
> And really it hasn't caused any serious problems with the upstream
> and distro kernels that have background inodegc.
> 

For quite a long time, neither did introduction of the reclaim s_umount
deadlock.

I can only speak for $employer distros of course, but my understanding
is that the kernels that do have deferred inactivation are still in the
early adoption phase of the typical lifecycle.

> Regardless, the spinning lookup problem seems pretty easy to avoid.
> We can turn it into a blocking lookup if the filesytsem is frozen
> simply by cycling sb_start_write/sb_end_write if there's a pending
> inactivation on that inode, and now the lookup blocks until the
> filesystem is thawed.
> 
> Alternatively, we could actually implement reinstantiation of the
> VFS inode portion of the inode and reactivate the inode if inodegc
> is pending. As darrick mentioned we didn't do this because of the
> difficulty in removing arbitrary items from the middle of llist
> structures.
> 
> However, it occurs to me that we don't even need to remove the inode
> from the inodegc list to recycle it. We can be lazy - just need to
> set a flag on the inode to cancel the inodegc and have inodegc clear
> the flag and skip over cancelled inodes instead of inactivating
> them.  Then if a gc cancelled inode then gets reclaimed by the VFS
> again, the inodegc queueing code can simply remove cancelled flag
> and it's already queued for processing....
> 

Interesting idea, though from looking through some code I'm skeptical
this is as simple as setting and clearing a flag.

> I think this avoids all the problems with needing to do inodegc
> cleanup while the filesystem is frozen, so I leaning towards this as
> the best way to solve this problem in the upstream kernel.
> 

"... avoids all the problems ..."

AFAIA the only concern was avoiding the unlikely "read into mapped
buffer" race situation, and that is fairly easy to avoid with further
improvements to the freeze interface. For example, the appended diff is
a quickly hacked up prototype (i.e. for conceptual purposes only) of
something I was thinking as a logical evolution from this patch.

> > This was always possible to some degree, but is
> > potentially amplified by the fact that reclaim no longer blocks on
> > the first inode that requires inactivation work. Instead, we
> > populate the inactivation queues indefinitely. The side effect can
> > be observed easily by invoking drop_caches on a frozen fs previously
> > populated with eofb and/or cowblocks inodes and then running
> > anything that relies on inode lookup (i.e., ls).
> 
> As we discussed last time, that is largely avoidable by only queuing
> inodes that absolutely need cleanup work. i.e. call
> xfs_can_free_eofblocks() and make it return false is the filesystem
> has an elevated freeze state because eof block clearing at reclaim
> is not essential for correct operation - the inode simply behaves as
> if it has XFS_DIFLAG_PREALLOC set on it. This will happen with any
> inode that has had fallocate() called on it, so it's not like it's a
> rare occurrence, either.
> 

Fortunately a simple scan mostly handles this case.

> The cowblocks case is much a much rarer situation, so that case could
> potentially just queue inodes those until the freeze goes away.
> 

And this as well.

> But if we can reinstantiate inodegc queued inodes as per my
> suggestion above then we can just leave the inodegc queuing
> unchanged and just not care how long they block for because if they
> are needed again whilst queued we can reuse them immediately....
> 

"... leave the inodegc queueing unchanged and just not care how long
they block for ..."

This reads to me as.. "leave inodes on the queue for non-deterministic
time where the memory asked for by reclaim can't be released," just
conveniently worded as if this behavior were somehow advantageous. ;)

Regardless, I'll just drop this patch. There's not enough objective
analysis here to make it clear to me how any of this is intended to try
and improve upon the patch as posted. I mainly sent this for posterity
since I had dug up the old discussion and realized I never followed up
from when it was suggested.

TLDR: option #1 it is...

> > Finally, since the deadlock issue was present for such a long time,
> > also document the subtle ->destroy_inode() constraint to avoid
> > unintentional reintroduction of the deadlock problem in the future.
> 
> That's still useful, though.
> 

Heh, well $maintainer can feel free to strip the rest out if you just
want a comment update.

Brian

--- 8< ---

Prototype to hook the filesystem into the various stages of freezing the
superblock:

diff --git a/fs/super.c b/fs/super.c
index 076392396e72..9eca666cb55b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1961,6 +1961,7 @@ static int wait_for_partially_frozen(struct super_block *sb)
 int freeze_super(struct super_block *sb, enum freeze_holder who)
 {
 	int ret;
+	bool per_stage = sb->s_iflags & SB_I_FREEZE_HACK;
 
 	atomic_inc(&sb->s_active);
 	if (!super_lock_excl(sb))
@@ -2015,6 +2016,11 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
 	if (!super_lock_excl(sb))
 		WARN(1, "Dying superblock while freezing!");
 
+	if (per_stage) {
+		ret = sb->s_op->freeze_fs(sb);
+		BUG_ON(ret);
+	}
+
 	/* Now we go and block page faults... */
 	sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
 	sb_wait_write(sb, SB_FREEZE_PAGEFAULT);
@@ -2029,6 +2035,11 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
 		return ret;
 	}
 
+	if (per_stage) {
+		ret = sb->s_op->freeze_fs(sb);
+		BUG_ON(ret);
+	}
+
 	/* Now wait for internal filesystem counter */
 	sb->s_writers.frozen = SB_FREEZE_FS;
 	sb_wait_write(sb, SB_FREEZE_FS);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index d0009430a627..d9dec07e5184 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -805,25 +805,6 @@ xfs_fs_sync_fs(
 		flush_delayed_work(&mp->m_log->l_work);
 	}
 
-	/*
-	 * If we are called with page faults frozen out, it means we are about
-	 * to freeze the transaction subsystem. Take the opportunity to shut
-	 * down inodegc because once SB_FREEZE_FS is set it's too late to
-	 * prevent inactivation races with freeze. The fs doesn't get called
-	 * again by the freezing process until after SB_FREEZE_FS has been set,
-	 * so it's now or never.  Same logic applies to speculative allocation
-	 * garbage collection.
-	 *
-	 * We don't care if this is a normal syncfs call that does this or
-	 * freeze that does this - we can run this multiple times without issue
-	 * and we won't race with a restart because a restart can only occur
-	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
-	 */
-	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
-		xfs_inodegc_stop(mp);
-		xfs_blockgc_stop(mp);
-	}
-
 	return 0;
 }
 
@@ -937,6 +918,23 @@ xfs_fs_freeze(
 	struct xfs_mount	*mp = XFS_M(sb);
 	unsigned int		flags;
 	int			ret;
+	unsigned short		stage = sb->s_writers.frozen;
+
+	trace_printk("%d: stage %u\n", __LINE__, stage);
+
+	if (stage == SB_FREEZE_WRITE) {
+		struct xfs_icwalk       icw = {0};
+
+		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
+		xfs_blockgc_free_space(mp, &icw);
+
+		xfs_inodegc_stop(mp);
+		xfs_blockgc_stop(mp);
+		return 0;
+	}
+
+	if (stage != SB_FREEZE_FS)
+		return 0;
 
 	/*
 	 * The filesystem is now frozen far enough that memory reclaim
@@ -1688,7 +1686,7 @@ xfs_fs_fill_super(
 		sb->s_time_max = XFS_LEGACY_TIME_MAX;
 	}
 	trace_xfs_inode_timestamp_range(mp, sb->s_time_min, sb->s_time_max);
-	sb->s_iflags |= SB_I_CGROUPWB;
+	sb->s_iflags |= SB_I_CGROUPWB | SB_I_FREEZE_HACK;
 
 	set_posix_acl_flag(sb);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98b7a7a8c42e..5349563a860e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1170,6 +1170,7 @@ extern int send_sigurg(struct fown_struct *fown);
 #define SB_I_TS_EXPIRY_WARNED 0x00000400 /* warned about timestamp range expiry */
 #define SB_I_RETIRED	0x00000800	/* superblock shouldn't be reused */
 #define SB_I_NOUMASK	0x00001000	/* VFS does not apply umask */
+#define SB_I_FREEZE_HACK	0x00002000
 
 /* Possible states of 'frozen' field */
 enum {
Dave Chinner Jan. 25, 2024, 11:46 p.m. UTC | #5
On Thu, Jan 25, 2024 at 07:46:55AM -0500, Brian Foster wrote:
> On Mon, Jan 22, 2024 at 02:23:44PM +1100, Dave Chinner wrote:
> > On Fri, Jan 19, 2024 at 02:36:45PM -0500, Brian Foster wrote:
> > > We've had reports on distro (pre-deferred inactivation) kernels that
> > > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > > lock when invoked on a frozen XFS fs. This occurs because
> > > drop_caches acquires the lock and then blocks in xfs_inactive() on
> > > transaction alloc for an inode that requires an eofb trim. unfreeze
> > > then blocks on the same lock and the fs is deadlocked.
> > 
> > Yup, but why do we need to address that in upstream kernels?
> > 
> > > With deferred inactivation, the deadlock problem is no longer
> > > present because ->destroy_inode() no longer blocks whether the fs is
> > > frozen or not. There is still unfortunate behavior in that lookups
> > > of a pending inactive inode spin loop waiting for the pending
> > > inactive state to clear, which won't happen until the fs is
> > > unfrozen.
> > 
> > Largely we took option 1 from the previous discussion:
> > 
> > | ISTM the currently most viable options we've discussed are:
> > | 
> > | 1. Leave things as is, accept potential for lookup stalls while frozen
> > | and wait and see if this ever really becomes a problem for real users.
> > 
> > https://lore.kernel.org/linux-xfs/YeVxCXE6hXa1S%2Fic@bfoster/
> > 
> > And really it hasn't caused any serious problems with the upstream
> > and distro kernels that have background inodegc.
> > 
> 
> For quite a long time, neither did introduction of the reclaim s_umount
> deadlock.

Yup, and that's *exactly* the problem we should be fixing here
because that's the root cause of the deadlock you are trying to
mitigate with these freeze-side blockgc flushes.

The deadlock in XFS inactivation is only the messenger - it's
a symptom of the problem, and trying to prevent inactivation in that
scenario is only addressing one specific symptom. It doesn't
address any other potential avenue to the same deadlock in XFS or
in any other filesystem that can be frozen.

> I can only speak for $employer distros of course, but my understanding
> is that the kernels that do have deferred inactivation are still in the
> early adoption phase of the typical lifecycle.

Fixing the (shrinker) s_umount vs thaw deadlock is relevant to
current upstream kernels because it removes a landmine that any
filesystem could step on. It is also a fix that could be backported
to all downstream kernels, and in doing so will also solve the
issue on the distro you care about....

I started on the VFS changes just before christmas, but I haven't
got back to it yet because it wasn't particularly high priority. The
patch below introduces the freeze serialisation lock, but doesn't
yet reduce the s_umount scope of thaw. Maybe you can pick it up from
here?

-Dave.
Brian Foster Jan. 29, 2024, 3:02 p.m. UTC | #6
On Fri, Jan 26, 2024 at 10:46:12AM +1100, Dave Chinner wrote:
> On Thu, Jan 25, 2024 at 07:46:55AM -0500, Brian Foster wrote:
> > On Mon, Jan 22, 2024 at 02:23:44PM +1100, Dave Chinner wrote:
> > > On Fri, Jan 19, 2024 at 02:36:45PM -0500, Brian Foster wrote:
> > > > We've had reports on distro (pre-deferred inactivation) kernels that
> > > > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > > > lock when invoked on a frozen XFS fs. This occurs because
> > > > drop_caches acquires the lock and then blocks in xfs_inactive() on
> > > > transaction alloc for an inode that requires an eofb trim. unfreeze
> > > > then blocks on the same lock and the fs is deadlocked.
> > > 
> > > Yup, but why do we need to address that in upstream kernels?
> > > 
> > > > With deferred inactivation, the deadlock problem is no longer
> > > > present because ->destroy_inode() no longer blocks whether the fs is
> > > > frozen or not. There is still unfortunate behavior in that lookups
> > > > of a pending inactive inode spin loop waiting for the pending
> > > > inactive state to clear, which won't happen until the fs is
> > > > unfrozen.
> > > 
> > > Largely we took option 1 from the previous discussion:
> > > 
> > > | ISTM the currently most viable options we've discussed are:
> > > | 
> > > | 1. Leave things as is, accept potential for lookup stalls while frozen
> > > | and wait and see if this ever really becomes a problem for real users.
> > > 
> > > https://lore.kernel.org/linux-xfs/YeVxCXE6hXa1S%2Fic@bfoster/
> > > 
> > > And really it hasn't caused any serious problems with the upstream
> > > and distro kernels that have background inodegc.
> > > 
> > 
> > For quite a long time, neither did introduction of the reclaim s_umount
> > deadlock.
> 
> Yup, and that's *exactly* the problem we should be fixing here
> because that's the root cause of the deadlock you are trying to
> mitigate with these freeze-side blockgc flushes.
> 
> The deadlock in XFS inactivation is only the messenger - it's
> a symptom of the problem, and trying to prevent inactivation in that
> scenario is only addressing one specific symptom. It doesn't
> address any other potential avenue to the same deadlock in XFS or
> in any other filesystem that can be frozen.
> 

We address symptoms all the time. You've suggested several alternatives
in this thread that also only address symptoms. I see no reason why
symptoms and the cure must be addressed at the same time.

> > I can only speak for $employer distros of course, but my understanding
> > is that the kernels that do have deferred inactivation are still in the
> > early adoption phase of the typical lifecycle.
> 
> Fixing the (shrinker) s_umount vs thaw deadlock is relevant to
> current upstream kernels because it removes a landmine that any
> filesystem could step on. It is also a fix that could be backported
> to all downstream kernels, and in doing so will also solve the
> issue on the distro you care about....
> 

I don't think a novel upstream cross subsystem lock split exercise is
the most practical option for a stable kernel.

> I started on the VFS changes just before christmas, but I haven't
> got back to it yet because it wasn't particularly high priority. The
> patch below introduces the freeze serialisation lock, but doesn't
> yet reduce the s_umount scope of thaw. Maybe you can pick it up from
> here?
> 

I don't disagree with this idea or as a general direction, but there's
too much additional risk, complexity and unknown here that I don't think
this is the best next step. My preference is to improve on problematic
behaviors with something like the async scan patch, unlink the user
reported issue(s) from the broader design flaw(s), and then address the
latter as a follow on effort.

Brian

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> fsfreeze: separate freeze/thaw from s_umount
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Holding the s_umount lock exclusive across the entire freeze/thaw
> operations is problematic as we do substantial operations in those
> operations, and we can have non-freeze/thaw operations block whilst
> holding s_umount. This can lead to deadlocks on thaw when something
> blocks on a freeze holding the s_umount lock.
> 
> Whilst this might seem like a theoretical problem, consider that the
> superblock shrinker runs under s_umount protection. This means the
> inode eviction and freeing path runs under s_umount and can run
> whilst a filesystem is frozen. If a filesystem blocks in a
> shrinker-run evict path because the filesystem is frozen, then the
> thaw will deadlock attempting to obtain the s_umount lock and the
> system is toast.
> 
> This *used* to happen with XFS. The changes in 5.14 to move to
> background inode inactivation moved XFS inode inactivation out of
> the ->destroy_inode() path, and so moved the garbage collection
> transactions from the evict path to a background kworker thread.
> These transactions from the shrinker context could deadlock the
> filesystem, and it was trivially reproducable by running "freeze,
> drop_caches; thaw" on an active system.
> 
> Whilst XFS tripped over this, it is possible that any filesystem
> that runs garbage collection operations when inodes are reclaimed
> can block holding the s_umount lock waiting for a freeze condition
> to be lifted.
> 
> We can't avoid holding the s_umount lock in the superblock shrinker
> - it is needed to provide existence guarantees so the superblock and
> internal filesystem structures cannot be torn down whilst the
> shrinker is running. Hence we need to address the way freeze/thaw
> use the s_umount lock.
> 
> Realistically, the only thing that the s_umount lock is needed for
> is to ensure that the freeze/thaw has a valid active superblock
> reference. It also serves to serialise freeze/thaw against other
> freeze/thaw operations, and this is where the problem lies. That is,
> thaw needs to take the s_umount before it starts thawing the
> filesystem to serialise against other freeze/thaw operations. It
> then holds the s_umount lock until the fs is thawed and then calls
> deactive_locked_super() to drop the active sb reference the freeze
> gained.
> 
> Realistically, the thaw only needs to hold the s_umount lock for the
> call to deactive_locked_super() - as long as we can serialise
> thaw against other concurrent freeze/thaw operations we don't need
> s_umount for the thawing process at all.
> 
> Moving the thaw process out from under the s_umount lock and then
> only taking the s_umount lock to call deactive_locked_super() then
> avoids all the deadlock problems with blocking holding the s_umount
> lock waiting for thaw to complete - the thaw completes, wakes
> waiters which then run to completion, drop the s_umount lock and the
> thaw then gains the s_umount lock and drops the active sb reference.
> 
> TO do this, introduce a new freeze lock to the superblock. This will
> be used to serialise freeze/thaw operations, and avoid the need to
> hold the s_umount lock across these operations. The s_umount lock is
> still needed to gain/drop active sb references, but once we have
> that reference in freeze we don't need the s_umount lock any more.
> 
> However, we probably still want to serialise freeze against remount,
> so we keep the s_umount lock held across freeze. We might be able to
> reduce the scope of the s_umount lock in future, but that's not
> necessary right now to alleviate the deadlock condition.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/super.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 61c19e3f06d8..96f3edf7c66b 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -334,6 +334,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
>  	INIT_LIST_HEAD(&s->s_mounts);
>  	s->s_user_ns = get_user_ns(user_ns);
>  	init_rwsem(&s->s_umount);
> +	sema_init(&s->s_freeze_lock, 1);
>  	lockdep_set_class(&s->s_umount, &type->s_umount_key);
>  	/*
>  	 * sget() can have s_umount recursion.
> @@ -1216,6 +1217,7 @@ static void do_thaw_all_callback(struct super_block *sb)
>  		if (IS_ENABLED(CONFIG_BLOCK))
>  			while (sb->s_bdev && !thaw_bdev(sb->s_bdev))
>  				pr_warn("Emergency Thaw on %pg\n", sb->s_bdev);
> +		down(&sb->s_freeze_lock);
>  		thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE);
>  	} else {
>  		super_unlock_excl(sb);
> @@ -1966,10 +1968,12 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
>  	atomic_inc(&sb->s_active);
>  	if (!super_lock_excl(sb))
>  		WARN(1, "Dying superblock while freezing!");
> +	down(&sb->s_freeze_lock);
>  
>  retry:
>  	if (sb->s_writers.frozen == SB_FREEZE_COMPLETE) {
>  		if (sb->s_writers.freeze_holders & who) {
> +			up(&sb->s_freeze_lock);
>  			deactivate_locked_super(sb);
>  			return -EBUSY;
>  		}
> @@ -1981,6 +1985,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
>  		 * freeze and assign the active ref to the freeze.
>  		 */
>  		sb->s_writers.freeze_holders |= who;
> +		up(&sb->s_freeze_lock);
>  		super_unlock_excl(sb);
>  		return 0;
>  	}
> @@ -1988,6 +1993,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
>  	if (sb->s_writers.frozen != SB_UNFROZEN) {
>  		ret = wait_for_partially_frozen(sb);
>  		if (ret) {
> +			up(&sb->s_freeze_lock);
>  			deactivate_locked_super(sb);
>  			return ret;
>  		}
> @@ -1996,6 +2002,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
>  	}
>  
>  	if (!(sb->s_flags & SB_BORN)) {
> +		up(&sb->s_freeze_lock);
>  		super_unlock_excl(sb);
>  		return 0;	/* sic - it's "nothing to do" */
>  	}
> @@ -2005,16 +2012,19 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
>  		sb->s_writers.freeze_holders |= who;
>  		sb->s_writers.frozen = SB_FREEZE_COMPLETE;
>  		wake_up_var(&sb->s_writers.frozen);
> +		up(&sb->s_freeze_lock);
>  		super_unlock_excl(sb);
>  		return 0;
>  	}
>  
>  	sb->s_writers.frozen = SB_FREEZE_WRITE;
>  	/* Release s_umount to preserve sb_start_write -> s_umount ordering */
> +	up(&sb->s_freeze_lock);
>  	super_unlock_excl(sb);
>  	sb_wait_write(sb, SB_FREEZE_WRITE);
>  	if (!super_lock_excl(sb))
>  		WARN(1, "Dying superblock while freezing!");
> +	down(&sb->s_freeze_lock);
>  
>  	/* Now we go and block page faults... */
>  	sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
> @@ -2026,6 +2036,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
>  		sb->s_writers.frozen = SB_UNFROZEN;
>  		sb_freeze_unlock(sb, SB_FREEZE_PAGEFAULT);
>  		wake_up_var(&sb->s_writers.frozen);
> +		up(&sb->s_freeze_lock);
>  		deactivate_locked_super(sb);
>  		return ret;
>  	}
> @@ -2042,6 +2053,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
>  			sb->s_writers.frozen = SB_UNFROZEN;
>  			sb_freeze_unlock(sb, SB_FREEZE_FS);
>  			wake_up_var(&sb->s_writers.frozen);
> +			up(&sb->s_freeze_lock);
>  			deactivate_locked_super(sb);
>  			return ret;
>  		}
> @@ -2054,6 +2066,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
>  	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
>  	wake_up_var(&sb->s_writers.frozen);
>  	lockdep_sb_freeze_release(sb);
> +	up(&sb->s_freeze_lock);
>  	super_unlock_excl(sb);
>  	return 0;
>  }
> @@ -2071,6 +2084,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
>  
>  	if (sb->s_writers.frozen == SB_FREEZE_COMPLETE) {
>  		if (!(sb->s_writers.freeze_holders & who)) {
> +			up(&sb->s_freeze_lock);
>  			super_unlock_excl(sb);
>  			return -EINVAL;
>  		}
> @@ -2082,10 +2096,12 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
>  		 */
>  		if (sb->s_writers.freeze_holders & ~who) {
>  			sb->s_writers.freeze_holders &= ~who;
> +			up(&sb->s_freeze_lock);
>  			deactivate_locked_super(sb);
>  			return 0;
>  		}
>  	} else {
> +		up(&sb->s_freeze_lock);
>  		super_unlock_excl(sb);
>  		return -EINVAL;
>  	}
> @@ -2104,6 +2120,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
>  		if (error) {
>  			printk(KERN_ERR "VFS:Filesystem thaw failed\n");
>  			lockdep_sb_freeze_release(sb);
> +			up(&sb->s_freeze_lock);
>  			super_unlock_excl(sb);
>  			return error;
>  		}
> @@ -2114,6 +2131,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
>  	wake_up_var(&sb->s_writers.frozen);
>  	sb_freeze_unlock(sb, SB_FREEZE_FS);
>  out:
> +	up(&sb->s_freeze_lock);
>  	deactivate_locked_super(sb);
>  	return 0;
>  }
> @@ -2134,6 +2152,7 @@ int thaw_super(struct super_block *sb, enum freeze_holder who)
>  {
>  	if (!super_lock_excl(sb))
>  		WARN(1, "Dying superblock while thawing!");
> +	down(&sb->s_freeze_lock);
>  	return thaw_super_locked(sb, who);
>  }
>  EXPORT_SYMBOL(thaw_super);
>
Dave Chinner Feb. 1, 2024, 1:16 a.m. UTC | #7
On Mon, Jan 29, 2024 at 10:02:16AM -0500, Brian Foster wrote:
> On Fri, Jan 26, 2024 at 10:46:12AM +1100, Dave Chinner wrote:
> > On Thu, Jan 25, 2024 at 07:46:55AM -0500, Brian Foster wrote:
> > > On Mon, Jan 22, 2024 at 02:23:44PM +1100, Dave Chinner wrote:
> > > > On Fri, Jan 19, 2024 at 02:36:45PM -0500, Brian Foster wrote:
> > > > > We've had reports on distro (pre-deferred inactivation) kernels that
> > > > > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > > > > lock when invoked on a frozen XFS fs. This occurs because
> > > > > drop_caches acquires the lock and then blocks in xfs_inactive() on
> > > > > transaction alloc for an inode that requires an eofb trim. unfreeze
> > > > > then blocks on the same lock and the fs is deadlocked.
> > > > 
> > > > Yup, but why do we need to address that in upstream kernels?
> > > > 
> > > > > With deferred inactivation, the deadlock problem is no longer
> > > > > present because ->destroy_inode() no longer blocks whether the fs is
> > > > > frozen or not. There is still unfortunate behavior in that lookups
> > > > > of a pending inactive inode spin loop waiting for the pending
> > > > > inactive state to clear, which won't happen until the fs is
> > > > > unfrozen.
> > > > 
> > > > Largely we took option 1 from the previous discussion:
> > > > 
> > > > | ISTM the currently most viable options we've discussed are:
> > > > | 
> > > > | 1. Leave things as is, accept potential for lookup stalls while frozen
> > > > | and wait and see if this ever really becomes a problem for real users.
> > > > 
> > > > https://lore.kernel.org/linux-xfs/YeVxCXE6hXa1S%2Fic@bfoster/
> > > > 
> > > > And really it hasn't caused any serious problems with the upstream
> > > > and distro kernels that have background inodegc.
> > > > 
> > > 
> > > For quite a long time, neither did introduction of the reclaim s_umount
> > > deadlock.
> > 
> > Yup, and that's *exactly* the problem we should be fixing here
> > because that's the root cause of the deadlock you are trying to
> > mitigate with these freeze-side blockgc flushes.
> > 
> > The deadlock in XFS inactivation is only the messenger - it's
> > a symptom of the problem, and trying to prevent inactivation in that
> > scenario is only addressing one specific symptom. It doesn't
> > address any other potential avenue to the same deadlock in XFS or
> > in any other filesystem that can be frozen.
> 
> We address symptoms all the time. You've suggested several alternatives
> in this thread that also only address symptoms. I see no reason why
> symptoms and the cure must be addressed at the same time.
> 
> > > I can only speak for $employer distros of course, but my understanding
> > > is that the kernels that do have deferred inactivation are still in the
> > > early adoption phase of the typical lifecycle.
> > 
> > Fixing the (shrinker) s_umount vs thaw deadlock is relevant to
> > current upstream kernels because it removes a landmine that any
> > filesystem could step on. It is also a fix that could be backported
> > to all downstream kernels, and in doing so will also solve the
> > issue on the distro you care about....
> > 
> 
> I don't think a novel upstream cross subsystem lock split exercise is
> the most practical option for a stable kernel.

No, but that's not the point.

If you need a custom fix for backporting to older kernels, then the
first patch in the series is the custom fix, then it gets followed
by the changes that fix the remaining upstream issues properly.
Then the last patch in the series removes the custom hack for
backports (if it still exists).

We know how to do this - we've done it many times in the past - and
it's a win-win because everyone gets what they want. i.e. There's a
backportable fix for stable kernels that doesn't burden upstream,
and the upstream issues are addressed in the best possible way and
we don't leave technical debt behind that upstream developers will
still have to address at some point in the future.

> > I started on the VFS changes just before christmas, but I haven't
> > got back to it yet because it wasn't particularly high priority. The
> > patch below introduces the freeze serialisation lock, but doesn't
> > yet reduce the s_umount scope of thaw. Maybe you can pick it up from
> > here?
> > 
> 
> I don't disagree with this idea or as a general direction, but there's
> too much additional risk, complexity and unknown here that I don't think
> this is the best next step. My preference is to improve on problematic
> behaviors with something like the async scan patch, unlink the user
> reported issue(s) from the broader design flaw(s), and then address the
> latter as a follow on effort.

I tire of discussions about how we *might* do something.

"Deeds not words", to quote my high school's motto.

Here's the fixes for the iget vs inactive vs freeze problems in the
upstream kernel:

https://lore.kernel.org/linux-xfs/20240201005217.1011010-1-david@fromorbit.com/T/#t

With that sorted, are there any other issues we know about that
running a blockgc scan during freeze might work around?

-Dave.
Brian Foster Feb. 2, 2024, 7:41 p.m. UTC | #8
On Thu, Feb 01, 2024 at 12:16:03PM +1100, Dave Chinner wrote:
> On Mon, Jan 29, 2024 at 10:02:16AM -0500, Brian Foster wrote:
> > On Fri, Jan 26, 2024 at 10:46:12AM +1100, Dave Chinner wrote:
> > > On Thu, Jan 25, 2024 at 07:46:55AM -0500, Brian Foster wrote:
> > > > On Mon, Jan 22, 2024 at 02:23:44PM +1100, Dave Chinner wrote:
> > > > > On Fri, Jan 19, 2024 at 02:36:45PM -0500, Brian Foster wrote:
...
> Here's the fixes for the iget vs inactive vs freeze problems in the
> upstream kernel:
> 
> https://lore.kernel.org/linux-xfs/20240201005217.1011010-1-david@fromorbit.com/T/#t
> 
> With that sorted, are there any other issues we know about that
> running a blockgc scan during freeze might work around?
> 

The primary motivation for the scan patch was the downstream/stable
deadlock issue. The reason I posted it upstream is because when I
considered the overall behavior change, I thought it uniformly
beneficial to both contexts based on the (minor) benefits of the side
effects of the scan. You don't need me to enumerate them, and none of
them are uniquely important or worth overanalyzing.

The only real question that matters here is do you agree with the
general reasoning for a blockgc scan during freeze, or shall I drop the
patch?

Brian

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Darrick J. Wong Feb. 2, 2024, 11:33 p.m. UTC | #9
On Fri, Feb 02, 2024 at 02:41:56PM -0500, Brian Foster wrote:
> On Thu, Feb 01, 2024 at 12:16:03PM +1100, Dave Chinner wrote:
> > On Mon, Jan 29, 2024 at 10:02:16AM -0500, Brian Foster wrote:
> > > On Fri, Jan 26, 2024 at 10:46:12AM +1100, Dave Chinner wrote:
> > > > On Thu, Jan 25, 2024 at 07:46:55AM -0500, Brian Foster wrote:
> > > > > On Mon, Jan 22, 2024 at 02:23:44PM +1100, Dave Chinner wrote:
> > > > > > On Fri, Jan 19, 2024 at 02:36:45PM -0500, Brian Foster wrote:
> ...
> > Here's the fixes for the iget vs inactive vs freeze problems in the
> > upstream kernel:
> > 
> > https://lore.kernel.org/linux-xfs/20240201005217.1011010-1-david@fromorbit.com/T/#t
> > 
> > With that sorted, are there any other issues we know about that
> > running a blockgc scan during freeze might work around?
> > 
> 
> The primary motivation for the scan patch was the downstream/stable
> deadlock issue. The reason I posted it upstream is because when I
> considered the overall behavior change, I thought it uniformly
> beneficial to both contexts based on the (minor) benefits of the side
> effects of the scan. You don't need me to enumerate them, and none of
> them are uniquely important or worth overanalyzing.
> 
> The only real question that matters here is do you agree with the
> general reasoning for a blockgc scan during freeze, or shall I drop the
> patch?

I don't see any particular downside to flushing {block,inode}gc work
during a freeze, other than the loss of speculative preallocations
sounds painful.

Does Dave's patchset to recycle NEEDS_INACTIVE inodes eliminate the
stall problem?

--D

> Brian
> 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 
>
Brian Foster Feb. 4, 2024, 4:03 p.m. UTC | #10
On Fri, Feb 02, 2024 at 03:33:43PM -0800, Darrick J. Wong wrote:
> On Fri, Feb 02, 2024 at 02:41:56PM -0500, Brian Foster wrote:
> > On Thu, Feb 01, 2024 at 12:16:03PM +1100, Dave Chinner wrote:
> > > On Mon, Jan 29, 2024 at 10:02:16AM -0500, Brian Foster wrote:
> > > > On Fri, Jan 26, 2024 at 10:46:12AM +1100, Dave Chinner wrote:
> > > > > On Thu, Jan 25, 2024 at 07:46:55AM -0500, Brian Foster wrote:
> > > > > > On Mon, Jan 22, 2024 at 02:23:44PM +1100, Dave Chinner wrote:
> > > > > > > On Fri, Jan 19, 2024 at 02:36:45PM -0500, Brian Foster wrote:
> > ...
> > > Here's the fixes for the iget vs inactive vs freeze problems in the
> > > upstream kernel:
> > > 
> > > https://lore.kernel.org/linux-xfs/20240201005217.1011010-1-david@fromorbit.com/T/#t
> > > 
> > > With that sorted, are there any other issues we know about that
> > > running a blockgc scan during freeze might work around?
> > > 
> > 
> > The primary motivation for the scan patch was the downstream/stable
> > deadlock issue. The reason I posted it upstream is because when I
> > considered the overall behavior change, I thought it uniformly
> > beneficial to both contexts based on the (minor) benefits of the side
> > effects of the scan. You don't need me to enumerate them, and none of
> > them are uniquely important or worth overanalyzing.
> > 
> > The only real question that matters here is do you agree with the
> > general reasoning for a blockgc scan during freeze, or shall I drop the
> > patch?
> 

Hi Darrick,

> I don't see any particular downside to flushing {block,inode}gc work
> during a freeze, other than the loss of speculative preallocations
> sounds painful.
> 

Yeah, that's definitely a tradeoff. The more I thought about that, the
more ISTM that any workload that might be sensitive enough to the
penalty of an extra blockgc scan, the less likely it's probably going to
see freeze cycles all that often.

I suspect the same applies to the bit of extra work added to the freeze
as well , but maybe there's some more painful scenario..?

> Does Dave's patchset to recycle NEEDS_INACTIVE inodes eliminate the
> stall problem?
> 

I assume it does. I think some of the confusion here is that I probably
would have gone in a slightly different direction on that issue, but
that's a separate discussion.

As it relates to this patch, in hindsight I probably should have
rewritten the commit log from the previous version. If I were to do that
now, it might read more like this (factoring out sync vs. non-sync
nuance and whatnot):

"
xfs: run blockgc on freeze to keep inodes off the inactivation queues

blockgc processing is disabled when the filesystem is frozen. This means
<words words words about blockgc> ...

Rather than hold pending blockgc inodes in inactivation queues when
frozen, run a blockgc scan during the freeze sequence to process this
subset of inodes up front. This allows reclaim to potentially free these
inodes while frozen (by keeping them off inactive lists) and produces a
more predictable/consistent on-disk freeze state. The latter is
potentially beneficial for shapshots, as this means no dangling post-eof
preallocs or cowblock recovery.

Potential tradeoffs for this are <yadda yadda, more words from above>
...
"

... but again, the primary motivation for this was still the whole
deadlock thing. I think it's perfectly reasonable to look at this change
and say it's not worth it. Thanks for the feedback.

Brian

> --D
> 
> > Brian
> > 
> > > -Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > 
> > 
> > 
>
Darrick J. Wong Feb. 5, 2024, 10:07 p.m. UTC | #11
On Sun, Feb 04, 2024 at 11:03:07AM -0500, Brian Foster wrote:
> On Fri, Feb 02, 2024 at 03:33:43PM -0800, Darrick J. Wong wrote:
> > On Fri, Feb 02, 2024 at 02:41:56PM -0500, Brian Foster wrote:
> > > On Thu, Feb 01, 2024 at 12:16:03PM +1100, Dave Chinner wrote:
> > > > On Mon, Jan 29, 2024 at 10:02:16AM -0500, Brian Foster wrote:
> > > > > On Fri, Jan 26, 2024 at 10:46:12AM +1100, Dave Chinner wrote:
> > > > > > On Thu, Jan 25, 2024 at 07:46:55AM -0500, Brian Foster wrote:
> > > > > > > On Mon, Jan 22, 2024 at 02:23:44PM +1100, Dave Chinner wrote:
> > > > > > > > On Fri, Jan 19, 2024 at 02:36:45PM -0500, Brian Foster wrote:
> > > ...
> > > > Here's the fixes for the iget vs inactive vs freeze problems in the
> > > > upstream kernel:
> > > > 
> > > > https://lore.kernel.org/linux-xfs/20240201005217.1011010-1-david@fromorbit.com/T/#t
> > > > 
> > > > With that sorted, are there any other issues we know about that
> > > > running a blockgc scan during freeze might work around?
> > > > 
> > > 
> > > The primary motivation for the scan patch was the downstream/stable
> > > deadlock issue. The reason I posted it upstream is because when I
> > > considered the overall behavior change, I thought it uniformly
> > > beneficial to both contexts based on the (minor) benefits of the side
> > > effects of the scan. You don't need me to enumerate them, and none of
> > > them are uniquely important or worth overanalyzing.
> > > 
> > > The only real question that matters here is do you agree with the
> > > general reasoning for a blockgc scan during freeze, or shall I drop the
> > > patch?
> > 
> 
> Hi Darrick,
> 
> > I don't see any particular downside to flushing {block,inode}gc work
> > during a freeze, other than the loss of speculative preallocations
> > sounds painful.
> > 
> 
> Yeah, that's definitely a tradeoff. The more I thought about that, the
> more ISTM that any workload that might be sensitive enough to the
> penalty of an extra blockgc scan, the less likely it's probably going to
> see freeze cycles all that often.
> 
> I suspect the same applies to the bit of extra work added to the freeze
> as well , but maybe there's some more painful scenario..?

<shrug> I suppose if you had a few gigabytes of speculative
preallocations across a bunch of log files (or log structured tree
files, or whatever) then you could lose those preallocations and make
fragmentation worse.  Since blockgc can run on open files, maybe we
should leave that out of the freeze preparation syncfs?

OTOH most of the inodes on those lists are not open at all, so perhaps
we /should/ kick inodegc while preparing for freeze?  Such a patch could
reuse the justification below after s/blockgc/inodegc/.  Too bad we
didn't think far enough into the FIFREEZE design to allow userspace to
indicate if they want us to minimize freeze time or post-freeze
recovery time.

--D

> > Does Dave's patchset to recycle NEEDS_INACTIVE inodes eliminate the
> > stall problem?
> > 
> 
> I assume it does. I think some of the confusion here is that I probably
> would have gone in a slightly different direction on that issue, but
> that's a separate discussion.
> 
> As it relates to this patch, in hindsight I probably should have
> rewritten the commit log from the previous version. If I were to do that
> now, it might read more like this (factoring out sync vs. non-sync
> nuance and whatnot):
> 
> "
> xfs: run blockgc on freeze to keep inodes off the inactivation queues
> 
> blockgc processing is disabled when the filesystem is frozen. This means
> <words words words about blockgc> ...
> 
> Rather than hold pending blockgc inodes in inactivation queues when
> frozen, run a blockgc scan during the freeze sequence to process this
> subset of inodes up front. This allows reclaim to potentially free these
> inodes while frozen (by keeping them off inactive lists) and produces a
> more predictable/consistent on-disk freeze state. The latter is
> potentially beneficial for shapshots, as this means no dangling post-eof
> preallocs or cowblock recovery.
> 
> Potential tradeoffs for this are <yadda yadda, more words from above>
> ...
> "
> 
> ... but again, the primary motivation for this was still the whole
> deadlock thing. I think it's perfectly reasonable to look at this change
> and say it's not worth it. Thanks for the feedback.
> 
> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > -Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@fromorbit.com
> > > > 
> > > 
> > > 
> > 
> 
>
Brian Foster Feb. 6, 2024, 1:28 p.m. UTC | #12
On Mon, Feb 05, 2024 at 02:07:27PM -0800, Darrick J. Wong wrote:
> On Sun, Feb 04, 2024 at 11:03:07AM -0500, Brian Foster wrote:
> > On Fri, Feb 02, 2024 at 03:33:43PM -0800, Darrick J. Wong wrote:
> > > On Fri, Feb 02, 2024 at 02:41:56PM -0500, Brian Foster wrote:
> > > > On Thu, Feb 01, 2024 at 12:16:03PM +1100, Dave Chinner wrote:
> > > > > On Mon, Jan 29, 2024 at 10:02:16AM -0500, Brian Foster wrote:
> > > > > > On Fri, Jan 26, 2024 at 10:46:12AM +1100, Dave Chinner wrote:
> > > > > > > On Thu, Jan 25, 2024 at 07:46:55AM -0500, Brian Foster wrote:
> > > > > > > > On Mon, Jan 22, 2024 at 02:23:44PM +1100, Dave Chinner wrote:
> > > > > > > > > On Fri, Jan 19, 2024 at 02:36:45PM -0500, Brian Foster wrote:
> > > > ...
> > > > > Here's the fixes for the iget vs inactive vs freeze problems in the
> > > > > upstream kernel:
> > > > > 
> > > > > https://lore.kernel.org/linux-xfs/20240201005217.1011010-1-david@fromorbit.com/T/#t
> > > > > 
> > > > > With that sorted, are there any other issues we know about that
> > > > > running a blockgc scan during freeze might work around?
> > > > > 
> > > > 
> > > > The primary motivation for the scan patch was the downstream/stable
> > > > deadlock issue. The reason I posted it upstream is because when I
> > > > considered the overall behavior change, I thought it uniformly
> > > > beneficial to both contexts based on the (minor) benefits of the side
> > > > effects of the scan. You don't need me to enumerate them, and none of
> > > > them are uniquely important or worth overanalyzing.
> > > > 
> > > > The only real question that matters here is do you agree with the
> > > > general reasoning for a blockgc scan during freeze, or shall I drop the
> > > > patch?
> > > 
> > 
> > Hi Darrick,
> > 
> > > I don't see any particular downside to flushing {block,inode}gc work
> > > during a freeze, other than the loss of speculative preallocations
> > > sounds painful.
> > > 
> > 
> > Yeah, that's definitely a tradeoff. The more I thought about that, the
> > more ISTM that any workload that might be sensitive enough to the
> > penalty of an extra blockgc scan, the less likely it's probably going to
> > see freeze cycles all that often.
> > 
> > I suspect the same applies to the bit of extra work added to the freeze
> > as well , but maybe there's some more painful scenario..?
> 
> <shrug> I suppose if you had a few gigabytes of speculative
> preallocations across a bunch of log files (or log structured tree
> files, or whatever) then you could lose those preallocations and make
> fragmentation worse.  Since blockgc can run on open files, maybe we
> should leave that out of the freeze preparation syncfs?
> 

By "leave that out," do you mean leave out the blockgc scan on freeze,
or use a special mode that explicitly skips over opened/writeable files?

FWIW, this sounds more like a generic improvement to the background scan
to me. Background blockgc currently filters out on things like whether
the file is dirty in pagecache. If you have a log file or something, I
would think the regular background scan may end up processing such files
more frequently than a freeze induced one will..? And for anything that
isn't under active or continuous modification, freeze is already going
to flush everything out for the first post-unfreeze background scan to
take care of.

So I dunno, I think I agree and disagree. :) I think it would be
perfectly reasonable to add an open/writeable file filter check to the
regular background scan to make it less aggressive. This patch does
invoke the background scan, but only because of the wonky read into a
mapped buffer use case. I still think freeze should (eventually) rather
invoke the more aggressive sync scan and process all pending work before
quiesce and not alter behavior based on heuristics.

> OTOH most of the inodes on those lists are not open at all, so perhaps
> we /should/ kick inodegc while preparing for freeze?  Such a patch could
> reuse the justification below after s/blockgc/inodegc/.  Too bad we
> didn't think far enough into the FIFREEZE design to allow userspace to
> indicate if they want us to minimize freeze time or post-freeze
> recovery time.
> 

Yeah, I think this potentially ties in interestingly with the
security/post freeze drop caches thing Christian brought up on fsdevel
recently. It would be more ideal if freeze actually had some controls
that different use cases could use to suggest how aggressive (or not) to
be with such things. Maybe that somewhat relates to the per-stage
->freeze_fs() prototype thing I posted earlier in the thread [1] to help
support running a sync scan.

Given the current implementation, I think ultimately it just depends on
your perspective of what freeze is supposed to do. To me, it should
reliably put the filesystem into a predictable state on-disk (based on
the common snapshot use case). It is a big hammer that should be
scheduled with care wrt to any performance sensitive workloads, and
should be minimally disruptive to the system when held for a
non-deterministic/extended amount of time. Departures from that are
either optimizations or extra feature/modifiers that we currently don't
have a great way to control. Just my .02.

Brian

[1] Appended to this reply:
  https://lore.kernel.org/linux-xfs/ZbJYP63PgykS1CwU@bfoster/

> --D
> 
> > > Does Dave's patchset to recycle NEEDS_INACTIVE inodes eliminate the
> > > stall problem?
> > > 
> > 
> > I assume it does. I think some of the confusion here is that I probably
> > would have gone in a slightly different direction on that issue, but
> > that's a separate discussion.
> > 
> > As it relates to this patch, in hindsight I probably should have
> > rewritten the commit log from the previous version. If I were to do that
> > now, it might read more like this (factoring out sync vs. non-sync
> > nuance and whatnot):
> > 
> > "
> > xfs: run blockgc on freeze to keep inodes off the inactivation queues
> > 
> > blockgc processing is disabled when the filesystem is frozen. This means
> > <words words words about blockgc> ...
> > 
> > Rather than hold pending blockgc inodes in inactivation queues when
> > frozen, run a blockgc scan during the freeze sequence to process this
> > subset of inodes up front. This allows reclaim to potentially free these
> > inodes while frozen (by keeping them off inactive lists) and produces a
> > more predictable/consistent on-disk freeze state. The latter is
> > potentially beneficial for shapshots, as this means no dangling post-eof
> > preallocs or cowblock recovery.
> > 
> > Potential tradeoffs for this are <yadda yadda, more words from above>
> > ...
> > "
> > 
> > ... but again, the primary motivation for this was still the whole
> > deadlock thing. I think it's perfectly reasonable to look at this change
> > and say it's not worth it. Thanks for the feedback.
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > -Dave.
> > > > > -- 
> > > > > Dave Chinner
> > > > > david@fromorbit.com
> > > > > 
> > > > 
> > > > 
> > > 
> > 
> > 
>
Darrick J. Wong Feb. 7, 2024, 5:36 p.m. UTC | #13
On Tue, Feb 06, 2024 at 08:28:09AM -0500, Brian Foster wrote:
> On Mon, Feb 05, 2024 at 02:07:27PM -0800, Darrick J. Wong wrote:
> > On Sun, Feb 04, 2024 at 11:03:07AM -0500, Brian Foster wrote:
> > > On Fri, Feb 02, 2024 at 03:33:43PM -0800, Darrick J. Wong wrote:
> > > > On Fri, Feb 02, 2024 at 02:41:56PM -0500, Brian Foster wrote:
> > > > > On Thu, Feb 01, 2024 at 12:16:03PM +1100, Dave Chinner wrote:
> > > > > > On Mon, Jan 29, 2024 at 10:02:16AM -0500, Brian Foster wrote:
> > > > > > > On Fri, Jan 26, 2024 at 10:46:12AM +1100, Dave Chinner wrote:
> > > > > > > > On Thu, Jan 25, 2024 at 07:46:55AM -0500, Brian Foster wrote:
> > > > > > > > > On Mon, Jan 22, 2024 at 02:23:44PM +1100, Dave Chinner wrote:
> > > > > > > > > > On Fri, Jan 19, 2024 at 02:36:45PM -0500, Brian Foster wrote:
> > > > > ...
> > > > > > Here's the fixes for the iget vs inactive vs freeze problems in the
> > > > > > upstream kernel:
> > > > > > 
> > > > > > https://lore.kernel.org/linux-xfs/20240201005217.1011010-1-david@fromorbit.com/T/#t
> > > > > > 
> > > > > > With that sorted, are there any other issues we know about that
> > > > > > running a blockgc scan during freeze might work around?
> > > > > > 
> > > > > 
> > > > > The primary motivation for the scan patch was the downstream/stable
> > > > > deadlock issue. The reason I posted it upstream is because when I
> > > > > considered the overall behavior change, I thought it uniformly
> > > > > beneficial to both contexts based on the (minor) benefits of the side
> > > > > effects of the scan. You don't need me to enumerate them, and none of
> > > > > them are uniquely important or worth overanalyzing.
> > > > > 
> > > > > The only real question that matters here is do you agree with the
> > > > > general reasoning for a blockgc scan during freeze, or shall I drop the
> > > > > patch?
> > > > 
> > > 
> > > Hi Darrick,
> > > 
> > > > I don't see any particular downside to flushing {block,inode}gc work
> > > > during a freeze, other than the loss of speculative preallocations
> > > > sounds painful.
> > > > 
> > > 
> > > Yeah, that's definitely a tradeoff. The more I thought about that, the
> > > more ISTM that any workload that might be sensitive enough to the
> > > penalty of an extra blockgc scan, the less likely it's probably going to
> > > see freeze cycles all that often.
> > > 
> > > I suspect the same applies to the bit of extra work added to the freeze
> > > as well , but maybe there's some more painful scenario..?
> > 
> > <shrug> I suppose if you had a few gigabytes of speculative
> > preallocations across a bunch of log files (or log structured tree
> > files, or whatever) then you could lose those preallocations and make
> > fragmentation worse.  Since blockgc can run on open files, maybe we
> > should leave that out of the freeze preparation syncfs?
> > 
> 
> By "leave that out," do you mean leave out the blockgc scan on freeze,
> or use a special mode that explicitly skips over opened/writeable files?

I meant s/xfs_blockgc_free_space/xfs_inodegc_flush/ in the patch you sent.

But I'd wondered over the years if blockgc ought to ignore files that
are still opened for write unless we're scouring for free space due to
an ENOSPC.  Maybe the current heuristic of skipping files with dirty
pagecache or IOLOCK contention is good enough.

> FWIW, this sounds more like a generic improvement to the background scan
> to me. Background blockgc currently filters out on things like whether
> the file is dirty in pagecache. If you have a log file or something, I
> would think the regular background scan may end up processing such files
> more frequently than a freeze induced one will..? And for anything that
> isn't under active or continuous modification, freeze is already going
> to flush everything out for the first post-unfreeze background scan to
> take care of.

Mhm.

> So I dunno, I think I agree and disagree. :) I think it would be
> perfectly reasonable to add an open/writeable file filter check to the
> regular background scan to make it less aggressive. This patch does
> invoke the background scan, but only because of the wonky read into a
> mapped buffer use case.

Does this livelock happen on a non-frozen filesystem too?  I wasn't too
sure if you wrote about that in the commit message because there's a
real livelock bug w.r.t. that or if that sentence was simply explaining
the use of an async scan.

> I still think freeze should (eventually) rather
> invoke the more aggressive sync scan and process all pending work before
> quiesce and not alter behavior based on heuristics.

Admittedly, given how much recovery /can/ be required, I'm starting to
think that we could push more of that work to disk before the freeze.

> > OTOH most of the inodes on those lists are not open at all, so perhaps
> > we /should/ kick inodegc while preparing for freeze?  Such a patch could
> > reuse the justification below after s/blockgc/inodegc/.  Too bad we
> > didn't think far enough into the FIFREEZE design to allow userspace to
> > indicate if they want us to minimize freeze time or post-freeze
> > recovery time.
> > 
> 
> Yeah, I think this potentially ties in interestingly with the
> security/post freeze drop caches thing Christian brought up on fsdevel
> recently.

<nod> Back in the day Luis was trying to rearrange the suspend code so
that we'd freeze the filesystems in reverse mount order.  I guess the
trouble with that approach is that for suspend you'd also want to block
read requests.

> It would be more ideal if freeze actually had some controls
> that different use cases could use to suggest how aggressive (or not) to
> be with such things. Maybe that somewhat relates to the per-stage
> ->freeze_fs() prototype thing I posted earlier in the thread [1] to help
> support running a sync scan.

Agreed.  Frustratingly, I took a look at the FIFREEZE definition and
realized that it /does/ actually take a pointer to an int:

include/uapi/linux/fs.h:196:#define FIFREEZE _IOWR('X', 119, int)    /* Freeze */
include/uapi/linux/fs.h:197:#define FITHAW   _IOWR('X', 120, int)    /* Thaw */

But the current implementation ignores the parameter and Debian code
search shows that some people call ioctl(fd, FIFREEZE) which means that
we'd have to create a FIFREEZE2 just to add a parameter.

> Given the current implementation, I think ultimately it just depends on
> your perspective of what freeze is supposed to do. To me, it should
> reliably put the filesystem into a predictable state on-disk (based on
> the common snapshot use case).

I always thought freeze was only supposed to do the bare minimum needed
to quiesce the filesystem, assuming that the common case is that we
quickly thaw and resume runtime operations.  OTOH a dirty 1GB log will
take a while to recover, and if the point was to make a backup or
something, that just makes IT unhappy.

> It is a big hammer that should be
> scheduled with care wrt to any performance sensitive workloads, and
> should be minimally disruptive to the system when held for a
> non-deterministic/extended amount of time. Departures from that are
> either optimizations or extra feature/modifiers that we currently don't
> have a great way to control. Just my .02.

<nod> Is there anyone interested in working on adding a mode parameter
to FIFREEZE?  What happens if the freeze comes in via the block layer?

--D

> 
> Brian
> 
> [1] Appended to this reply:
>   https://lore.kernel.org/linux-xfs/ZbJYP63PgykS1CwU@bfoster/
> 
> > --D
> > 
> > > > Does Dave's patchset to recycle NEEDS_INACTIVE inodes eliminate the
> > > > stall problem?
> > > > 
> > > 
> > > I assume it does. I think some of the confusion here is that I probably
> > > would have gone in a slightly different direction on that issue, but
> > > that's a separate discussion.
> > > 
> > > As it relates to this patch, in hindsight I probably should have
> > > rewritten the commit log from the previous version. If I were to do that
> > > now, it might read more like this (factoring out sync vs. non-sync
> > > nuance and whatnot):
> > > 
> > > "
> > > xfs: run blockgc on freeze to keep inodes off the inactivation queues
> > > 
> > > blockgc processing is disabled when the filesystem is frozen. This means
> > > <words words words about blockgc> ...
> > > 
> > > Rather than hold pending blockgc inodes in inactivation queues when
> > > frozen, run a blockgc scan during the freeze sequence to process this
> > > subset of inodes up front. This allows reclaim to potentially free these
> > > inodes while frozen (by keeping them off inactive lists) and produces a
> > > more predictable/consistent on-disk freeze state. The latter is
> > > potentially beneficial for shapshots, as this means no dangling post-eof
> > > preallocs or cowblock recovery.
> > > 
> > > Potential tradeoffs for this are <yadda yadda, more words from above>
> > > ...
> > > "
> > > 
> > > ... but again, the primary motivation for this was still the whole
> > > deadlock thing. I think it's perfectly reasonable to look at this change
> > > and say it's not worth it. Thanks for the feedback.
> > > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > > -Dave.
> > > > > > -- 
> > > > > > Dave Chinner
> > > > > > david@fromorbit.com
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > 
> > > 
> > 
> 
>
Brian Foster Feb. 8, 2024, 12:54 p.m. UTC | #14
On Wed, Feb 07, 2024 at 09:36:20AM -0800, Darrick J. Wong wrote:
> On Tue, Feb 06, 2024 at 08:28:09AM -0500, Brian Foster wrote:
> > On Mon, Feb 05, 2024 at 02:07:27PM -0800, Darrick J. Wong wrote:
> > > On Sun, Feb 04, 2024 at 11:03:07AM -0500, Brian Foster wrote:
> > > > On Fri, Feb 02, 2024 at 03:33:43PM -0800, Darrick J. Wong wrote:
> > > > > On Fri, Feb 02, 2024 at 02:41:56PM -0500, Brian Foster wrote:
> > > > > > On Thu, Feb 01, 2024 at 12:16:03PM +1100, Dave Chinner wrote:
> > > > > > > On Mon, Jan 29, 2024 at 10:02:16AM -0500, Brian Foster wrote:
> > > > > > > > On Fri, Jan 26, 2024 at 10:46:12AM +1100, Dave Chinner wrote:
> > > > > > > > > On Thu, Jan 25, 2024 at 07:46:55AM -0500, Brian Foster wrote:
> > > > > > > > > > On Mon, Jan 22, 2024 at 02:23:44PM +1100, Dave Chinner wrote:
> > > > > > > > > > > On Fri, Jan 19, 2024 at 02:36:45PM -0500, Brian Foster wrote:
> > > > > > ...
> > > > > > > Here's the fixes for the iget vs inactive vs freeze problems in the
> > > > > > > upstream kernel:
> > > > > > > 
> > > > > > > https://lore.kernel.org/linux-xfs/20240201005217.1011010-1-david@fromorbit.com/T/#t
> > > > > > > 
> > > > > > > With that sorted, are there any other issues we know about that
> > > > > > > running a blockgc scan during freeze might work around?
> > > > > > > 
> > > > > > 
> > > > > > The primary motivation for the scan patch was the downstream/stable
> > > > > > deadlock issue. The reason I posted it upstream is because when I
> > > > > > considered the overall behavior change, I thought it uniformly
> > > > > > beneficial to both contexts based on the (minor) benefits of the side
> > > > > > effects of the scan. You don't need me to enumerate them, and none of
> > > > > > them are uniquely important or worth overanalyzing.
> > > > > > 
> > > > > > The only real question that matters here is do you agree with the
> > > > > > general reasoning for a blockgc scan during freeze, or shall I drop the
> > > > > > patch?
> > > > > 
> > > > 
> > > > Hi Darrick,
> > > > 
> > > > > I don't see any particular downside to flushing {block,inode}gc work
> > > > > during a freeze, other than the loss of speculative preallocations
> > > > > sounds painful.
> > > > > 
> > > > 
> > > > Yeah, that's definitely a tradeoff. The more I thought about that, the
> > > > more ISTM that any workload that might be sensitive enough to the
> > > > penalty of an extra blockgc scan, the less likely it's probably going to
> > > > see freeze cycles all that often.
> > > > 
> > > > I suspect the same applies to the bit of extra work added to the freeze
> > > > as well , but maybe there's some more painful scenario..?
> > > 
> > > <shrug> I suppose if you had a few gigabytes of speculative
> > > preallocations across a bunch of log files (or log structured tree
> > > files, or whatever) then you could lose those preallocations and make
> > > fragmentation worse.  Since blockgc can run on open files, maybe we
> > > should leave that out of the freeze preparation syncfs?
> > > 
> > 
> > By "leave that out," do you mean leave out the blockgc scan on freeze,
> > or use a special mode that explicitly skips over opened/writeable files?
> 
> I meant s/xfs_blockgc_free_space/xfs_inodegc_flush/ in the patch you sent.
> 

Ah, but isn't that already part of the xfs_inodegc_stop() path?

> But I'd wondered over the years if blockgc ought to ignore files that
> are still opened for write unless we're scouring for free space due to
> an ENOSPC.  Maybe the current heuristic of skipping files with dirty
> pagecache or IOLOCK contention is good enough.
> 

The existing heuristic may be fine for the current model of processing,
but I think this is interesting in another context. I have a couple
prototype variants around that work by keeping blockgc inodes out of
eviction until blockgc work is actually processed. To do something like
that probably warrants blockgc work to be more shrinker driven to
accommodate memory pressure, so ISTM that having some opened/writeable
logic to more intelligently select which blockgc inodes to process for
the purpose of eviction/reclaim could be rather useful.

> > FWIW, this sounds more like a generic improvement to the background scan
> > to me. Background blockgc currently filters out on things like whether
> > the file is dirty in pagecache. If you have a log file or something, I
> > would think the regular background scan may end up processing such files
> > more frequently than a freeze induced one will..? And for anything that
> > isn't under active or continuous modification, freeze is already going
> > to flush everything out for the first post-unfreeze background scan to
> > take care of.
> 
> Mhm.
> 
> > So I dunno, I think I agree and disagree. :) I think it would be
> > perfectly reasonable to add an open/writeable file filter check to the
> > regular background scan to make it less aggressive. This patch does
> > invoke the background scan, but only because of the wonky read into a
> > mapped buffer use case.
> 
> Does this livelock happen on a non-frozen filesystem too?  I wasn't too
> sure if you wrote about that in the commit message because there's a
> real livelock bug w.r.t. that or if that sentence was simply explaining
> the use of an async scan.
> 

The livelock was purely a SYNC blockgc scan vs. SB_FREEZE_PAGEFAULT
thing. The original patch I sent way back when did the sync scan, but
appeared to be susceptible to livelock if there was a blocked task
reading (i.e. holding iolock) and causing a page fault by copying into a
mapped buffer, because the scan will continuously try for the iolock
that won't ever be released. So the non-sync variant was a followup
suggestion as a best effort scan.

I think that if we could have a mode where the vfs called ->freeze_fs()
once per freeze stage instead of only once at the end, then we could
actually do a sync scan under SB_FREEZE_WRITE protection (possibly
followed by another non-sync scan under PAGEFAULT to catch any races).

> > I still think freeze should (eventually) rather
> > invoke the more aggressive sync scan and process all pending work before
> > quiesce and not alter behavior based on heuristics.
> 
> Admittedly, given how much recovery /can/ be required, I'm starting to
> think that we could push more of that work to disk before the freeze.
> 

I agree, but at least recovery is predictable. If you have a situation
where you have a bunch of inodes with post-eof blocks and then take a
snapshot, the fact that the snap might now have GBs of space off the end
of random inodes somewhere in the fs is pretty wonky behavior to me. I
suspect the only real way to reclaim that space is to cycle every inode
in the snapshot fs through the cache such that the extent list is read
and inode reclaim can identify whether there is post-eof space to trim,
but I've not actually experimented to see how bad that really is.

> > > OTOH most of the inodes on those lists are not open at all, so perhaps
> > > we /should/ kick inodegc while preparing for freeze?  Such a patch could
> > > reuse the justification below after s/blockgc/inodegc/.  Too bad we
> > > didn't think far enough into the FIFREEZE design to allow userspace to
> > > indicate if they want us to minimize freeze time or post-freeze
> > > recovery time.
> > > 
> > 
> > Yeah, I think this potentially ties in interestingly with the
> > security/post freeze drop caches thing Christian brought up on fsdevel
> > recently.
> 
> <nod> Back in the day Luis was trying to rearrange the suspend code so
> that we'd freeze the filesystems in reverse mount order.  I guess the
> trouble with that approach is that for suspend you'd also want to block
> read requests.
> 

Yeah, I vaguely recall some discussion on that and kind of got the sense
that suspend was... more involved of a problem. :P

> > It would be more ideal if freeze actually had some controls
> > that different use cases could use to suggest how aggressive (or not) to
> > be with such things. Maybe that somewhat relates to the per-stage
> > ->freeze_fs() prototype thing I posted earlier in the thread [1] to help
> > support running a sync scan.
> 
> Agreed.  Frustratingly, I took a look at the FIFREEZE definition and
> realized that it /does/ actually take a pointer to an int:
> 
> include/uapi/linux/fs.h:196:#define FIFREEZE _IOWR('X', 119, int)    /* Freeze */
> include/uapi/linux/fs.h:197:#define FITHAW   _IOWR('X', 120, int)    /* Thaw */
> 
> But the current implementation ignores the parameter and Debian code
> search shows that some people call ioctl(fd, FIFREEZE) which means that
> we'd have to create a FIFREEZE2 just to add a parameter.
> 
> > Given the current implementation, I think ultimately it just depends on
> > your perspective of what freeze is supposed to do. To me, it should
> > reliably put the filesystem into a predictable state on-disk (based on
> > the common snapshot use case).
> 
> I always thought freeze was only supposed to do the bare minimum needed
> to quiesce the filesystem, assuming that the common case is that we
> quickly thaw and resume runtime operations.  OTOH a dirty 1GB log will
> take a while to recover, and if the point was to make a backup or
> something, that just makes IT unhappy.
> 

It seems the reality is that it's neither of these things we think it
should be, and rather just the result of a series of short term tweaks
and mods over a long period of time. :) I think at one point the entire
inode working set was reclaimed on freeze, but that was removed due to
being crazy. But now we obviously leave around blockgc inodes as a side
effect. I think the log is intentionally left dirty (but forced and
covered) to ensure the snapshot processes unlinked inodes as well (and I
think there's been a lot of back and forth on that over the years), but
could be misremembering...

> > It is a big hammer that should be
> > scheduled with care wrt to any performance sensitive workloads, and
> > should be minimally disruptive to the system when held for a
> > non-deterministic/extended amount of time. Departures from that are
> > either optimizations or extra feature/modifiers that we currently don't
> > have a great way to control. Just my .02.
> 
> <nod> Is there anyone interested in working on adding a mode parameter
> to FIFREEZE?  What happens if the freeze comes in via the block layer?
> 

We'd probably want to audit and maybe try to understand the various use
cases before getting too far with userspace API (at least defining flags
anyways). At least I'm not sure I really have a clear view of anything
outside of the simple snapshot case.

IOW, suppose in theory freeze by default was the biggest hammer possible
and made the filesystem generally flushed and clean on disk, but then
grew a flag input for behavior modifiers. What behavior would we modify
and why? Would somebody want a NOFLUSH thing that is purely a runtime
quiesce? Or something inbetween where data and log are flushed, but no
blockgc scanning or log forcing and so forth? Something else?

Brian

> --D
> 
> > 
> > Brian
> > 
> > [1] Appended to this reply:
> >   https://lore.kernel.org/linux-xfs/ZbJYP63PgykS1CwU@bfoster/
> > 
> > > --D
> > > 
> > > > > Does Dave's patchset to recycle NEEDS_INACTIVE inodes eliminate the
> > > > > stall problem?
> > > > > 
> > > > 
> > > > I assume it does. I think some of the confusion here is that I probably
> > > > would have gone in a slightly different direction on that issue, but
> > > > that's a separate discussion.
> > > > 
> > > > As it relates to this patch, in hindsight I probably should have
> > > > rewritten the commit log from the previous version. If I were to do that
> > > > now, it might read more like this (factoring out sync vs. non-sync
> > > > nuance and whatnot):
> > > > 
> > > > "
> > > > xfs: run blockgc on freeze to keep inodes off the inactivation queues
> > > > 
> > > > blockgc processing is disabled when the filesystem is frozen. This means
> > > > <words words words about blockgc> ...
> > > > 
> > > > Rather than hold pending blockgc inodes in inactivation queues when
> > > > frozen, run a blockgc scan during the freeze sequence to process this
> > > > subset of inodes up front. This allows reclaim to potentially free these
> > > > inodes while frozen (by keeping them off inactive lists) and produces a
> > > > more predictable/consistent on-disk freeze state. The latter is
> > > > potentially beneficial for shapshots, as this means no dangling post-eof
> > > > preallocs or cowblock recovery.
> > > > 
> > > > Potential tradeoffs for this are <yadda yadda, more words from above>
> > > > ...
> > > > "
> > > > 
> > > > ... but again, the primary motivation for this was still the whole
> > > > deadlock thing. I think it's perfectly reasonable to look at this change
> > > > and say it's not worth it. Thanks for the feedback.
> > > > 
> > > > Brian
> > > > 
> > > > > --D
> > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > > -Dave.
> > > > > > > -- 
> > > > > > > Dave Chinner
> > > > > > > david@fromorbit.com
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > 
> > 
>
Dave Chinner Feb. 9, 2024, 12:23 a.m. UTC | #15
On Thu, Feb 08, 2024 at 07:54:00AM -0500, Brian Foster wrote:
> On Wed, Feb 07, 2024 at 09:36:20AM -0800, Darrick J. Wong wrote:
> > On Tue, Feb 06, 2024 at 08:28:09AM -0500, Brian Foster wrote:
> > > On Mon, Feb 05, 2024 at 02:07:27PM -0800, Darrick J. Wong wrote:
> > > > On Sun, Feb 04, 2024 at 11:03:07AM -0500, Brian Foster wrote:
> > But I'd wondered over the years if blockgc ought to ignore files that
> > are still opened for write unless we're scouring for free space due to
> > an ENOSPC.  Maybe the current heuristic of skipping files with dirty
> > pagecache or IOLOCK contention is good enough.
> > 
> 
> The existing heuristic may be fine for the current model of processing,
> but I think this is interesting in another context. I have a couple
> prototype variants around that work by keeping blockgc inodes out of
> eviction until blockgc work is actually processed. To do something like
> that probably warrants blockgc work to be more shrinker driven to
> accommodate memory pressure, so ISTM that having some opened/writeable
> logic to more intelligently select which blockgc inodes to process for
> the purpose of eviction/reclaim could be rather useful.
> 
> > > FWIW, this sounds more like a generic improvement to the background scan
> > > to me. Background blockgc currently filters out on things like whether
> > > the file is dirty in pagecache. If you have a log file or something, I
> > > would think the regular background scan may end up processing such files
> > > more frequently than a freeze induced one will..? And for anything that
> > > isn't under active or continuous modification, freeze is already going
> > > to flush everything out for the first post-unfreeze background scan to
> > > take care of.
> > 
> > Mhm.
> > 
> > > So I dunno, I think I agree and disagree. :) I think it would be
> > > perfectly reasonable to add an open/writeable file filter check to the
> > > regular background scan to make it less aggressive. This patch does
> > > invoke the background scan, but only because of the wonky read into a
> > > mapped buffer use case.
> > 
> > Does this livelock happen on a non-frozen filesystem too?  I wasn't too
> > sure if you wrote about that in the commit message because there's a
> > real livelock bug w.r.t. that or if that sentence was simply explaining
> > the use of an async scan.
> > 
> 
> The livelock was purely a SYNC blockgc scan vs. SB_FREEZE_PAGEFAULT
> thing. The original patch I sent way back when did the sync scan, but
> appeared to be susceptible to livelock if there was a blocked task
> reading (i.e. holding iolock) and causing a page fault by copying into a
> mapped buffer, because the scan will continuously try for the iolock
> that won't ever be released. So the non-sync variant was a followup
> suggestion as a best effort scan.
> 
> I think that if we could have a mode where the vfs called ->freeze_fs()
> once per freeze stage instead of only once at the end, then we could
> actually do a sync scan under SB_FREEZE_WRITE protection (possibly
> followed by another non-sync scan under PAGEFAULT to catch any races).

To what purpose, though? It's being talked about as a way to run a
blockgc pass during freeze on XFS, but we don't need a blockgc
passes in freeze for XFS for deadlock avoidance or correct freeze
behaviour.

Hence I just don't see what problems this complexity is going to
fix. What am I missing?

> > > I still think freeze should (eventually) rather
> > > invoke the more aggressive sync scan and process all pending work before
> > > quiesce and not alter behavior based on heuristics.
> > 
> > Admittedly, given how much recovery /can/ be required, I'm starting to
> > think that we could push more of that work to disk before the freeze.
> 
> I agree, but at least recovery is predictable. If you have a situation
> where you have a bunch of inodes with post-eof blocks and then take a
> snapshot, the fact that the snap might now have GBs of space off the end
> of random inodes somewhere in the fs is pretty wonky behavior to me. I

We've had that behaviour for 20+ years.  There's no evidence that it
is problematic and, realistically, it's not different to the
filesystem state after a system crash. i.e. crash your system, and
every inode in memory that has preallocated blocks beyond EOF now
has them persistently on disk, and they don't get cleaned up until
the next time they get cycled through the inode cache.

Further, we intentionally disable blockgc on any inode that has been
opend O_APPEND or had fallocate() called to preallocate blocks. So
regardless of anything else, blockgc scans are not a guarantee that
we clean up post-eof blocks during a freeze - they will exist on
disk after a freeze regardless of what we do during the freeze
process.

Also, we use the same quiesce code for remount,ro as we use for
freeze. Hence remount,ro leaves all the inodes in memory with
pending blockgc unchanged. The post-eof blocks are persisted on
disk, and they won't get removed until the filesystem is mounted rw
again and the inode is cycled through cache.

And then there's system crashes. They leave post-eof blocks
persistent on disk, too, and there is no possibility of blockgc
being done on them. Recovery does not run GC on these post-eof
blocks, nor should it.

So given that remount,ro and system crashes result in exactly the
same on-disk state as a freeze and we can't do anything about
crashes resulting in this state, why should we try to avoid
having post-eof blocks on disk during a freeze? They are going to be
on disk before a freeze, and then recreated immediately after a
freeze, so why does a freeze specifically need to remove them?

> suspect the only real way to reclaim that space is to cycle every inode
> in the snapshot fs through the cache such that the extent list is read
> and inode reclaim can identify whether there is post-eof space to trim,
> but I've not actually experimented to see how bad that really is.

We know how expensive that is: quotacheck has this exact same
requirement.  i.e. quotacheck has to cycle every inode in the
filesystem through the inode cache, and that will run post-eof block
removal as the inodes cycle out of cache.

And let's not forget that the reason we journal dquots is so that we
don't have to run a quotacheck on every mount because the overhead
is prohibitive on filesystems with millions of inodes.

If a full filesystem inode scan is necessary for blockgc, then do it
from userspace. A simple bulkstat pass will cycle every inode in the
filesystem through the cache, and so problem solved without having
to change a single line of XFS code anywhere.

> > > Given the current implementation, I think ultimately it just depends on
> > > your perspective of what freeze is supposed to do. To me, it should
> > > reliably put the filesystem into a predictable state on-disk (based on
> > > the common snapshot use case).
> > 
> > I always thought freeze was only supposed to do the bare minimum needed
> > to quiesce the filesystem, assuming that the common case is that we
> > quickly thaw and resume runtime operations.  OTOH a dirty 1GB log will
> > take a while to recover, and if the point was to make a backup or
> > something, that just makes IT unhappy.
> 
> It seems the reality is that it's neither of these things we think it
> should be, and rather just the result of a series of short term tweaks
> and mods over a long period of time. :)

IMO, freeze *requirements* are unchanged from what they were 25
years ago. Freeze is simply required to bring the filesystem down to
a consistent state on disk that is entirely readable without
requiring the filesytem to perform any more write operations to
perform those read and hold it unchanged in that state until the
filesystem is thawed.

A further desirable runtime characteristic is that it is
fast and has minimal runtime impact on application behaviour.

Yes, the implementaiton has changed over the years as we've found
bugs, changed the way other subsystems work, etc. But the actual
functional requirements of what freeze is supposed to provide have
not changed at all.

> I think at one point the entire
> inode working set was reclaimed on freeze, but that was removed due to
> being crazy.

Yes, up until 2020 it did do inode reclaim, but that was an
implementation detail. And it wasn't removed because it was "crazy",
it was removed because it was no longer necessary for correct
behaviour.

Back in 2005 we fixed some freeze issues by making it behave like
remount,ro. They both do largely the same thing in terms of bringing
the fs down to a consistent clean state on disk, so sharing the
implementaiton made a lot of sense. At the time, the only way to
prevent reclaimable inodes from running transactions after a
specific point in time was to process them all. i.e. inactivate all
the inodes, write them back and reclaim them. That was the problem
we needed to fix in freeze, and remount,ro already solved it. It was
the obvious, simple fix.

Did it cause problems? No, it didn't, because this "inode reclaim"
doesn't affect the working set of inodes in memory. i.e. the working
set is pinned in memory by cached dentries and so, by definition,
they are not on the inactive inode lists that can be purged by
reclaim during freeze/remount,ro.

To put this in modern context, the old code was essentially:

	xfs_inodegc_flush()
	xfs_ail_push_all_sync()
	xfs_reclaim_inodes()

Note that there is no xfs_blockgc_flush_all() call in there - it
didn't touch inodes in the working set, just processed inodes
already queued for inactivation work. And in kernels since about
3.10, the xfs_reclaim_inodes() call simply expedited the reclaim
that was going to happen in the background within 5 seconds, so it's
not like this actually changed anything material in terms of inode
cache behaviour.

Indeed, the commit that removed the inode reclaim in 2020 says this:

    For xfs_quiesce_attr() we can just remove the inode reclaim calls as
    they are a historic relic that was required to flush dirty inodes
    that contained unlogged changes. We now log all changes to the
    inodes, so the sync AIL push from xfs_log_quiesce() called by
    xfs_quiesce_attr() will do all the required inode writeback for
    freeze.

IOWs, it was removed because we realised it was redundant and no
longer necessary for correct behaviour of the freeze operation.
That's just normal development process, there was nothing "crazy"
about the old code, it just wasn't necessary anymore.

> But now we obviously leave around blockgc inodes as a side
> effect.

As per above: we have -always done that-. We have never run blockgc
on the current working set of inodes during a freeze. We don't want
to - that will perturb the runtime behaviour of the applications
beyond the freeze operation, and potentially not in a good way.

This is how we've implemented freeze for 20-odd years, and there's
no evidence that it is actually a behaviour that needs changing.
Having a bug in freeze realted to an inode that needs blockgc does
not mean "freeze should leave no inodes in memory that need
blockgc". All it means is that we need to handle inodes that need
blockgc during a freeze in a better way.....

> I think the log is intentionally left dirty (but forced and
> covered) to ensure the snapshot processes unlinked inodes as well (and I
> think there's been a lot of back and forth on that over the years), but
> could be misremembering...

That's a different problem altogether....

> > > It is a big hammer that should be
> > > scheduled with care wrt to any performance sensitive workloads, and
> > > should be minimally disruptive to the system when held for a
> > > non-deterministic/extended amount of time. Departures from that are
> > > either optimizations or extra feature/modifiers that we currently don't
> > > have a great way to control. Just my .02.
> > 
> > <nod> Is there anyone interested in working on adding a mode parameter
> > to FIFREEZE?  What happens if the freeze comes in via the block layer?
> > 
> 
> We'd probably want to audit and maybe try to understand the various use
> cases before getting too far with userspace API (at least defining flags
> anyways). At least I'm not sure I really have a clear view of anything
> outside of the simple snapshot case.

That's what I keep asking: what user problems are these proposed
changes actually trying to solve? Start with use cases, not grand
designs!

> IOW, suppose in theory freeze by default was the biggest hammer possible
> and made the filesystem generally flushed and clean on disk, but then
> grew a flag input for behavior modifiers. What behavior would we modify
> and why?

It was originally defined by the XFS_IOC_FREEZE API on Irix but we
never used it for anything on Linux. We didn't even validate it was
zero, so it really is just a result of a poor API implementation
done 25 years ago. I can see how this sort of little detail is
easily missed in the bigger "port an entire filesystem to a
different OS" picture....

> Would somebody want a NOFLUSH thing that is purely a runtime
> quiesce? Or something inbetween where data and log are flushed, but no
> blockgc scanning or log forcing and so forth? Something else?

No idea what the original intent was.  sync() is the filesystem
integrity flush, freeze is the "on-disk consistent" integrity flush,
and there's not much between those two things that is actually
useful to users, admins and/or applications.

The only thing I've seen proposed for the FIFREEZE argument is a
"timeout" variable to indicate the maximum time the filesystem
should stay frozen for before it automatically thaws itself (i.e. an
anti foot-gun mechanism). That went nowhere because it's just not
possible for admins and applications to use a timeout in a reliable
way...

-Dave.
Brian Foster Feb. 12, 2024, 7:43 p.m. UTC | #16
On Fri, Feb 09, 2024 at 11:23:48AM +1100, Dave Chinner wrote:
> On Thu, Feb 08, 2024 at 07:54:00AM -0500, Brian Foster wrote:
> > On Wed, Feb 07, 2024 at 09:36:20AM -0800, Darrick J. Wong wrote:
> > > On Tue, Feb 06, 2024 at 08:28:09AM -0500, Brian Foster wrote:
> > > > On Mon, Feb 05, 2024 at 02:07:27PM -0800, Darrick J. Wong wrote:
> > > > > On Sun, Feb 04, 2024 at 11:03:07AM -0500, Brian Foster wrote:
> > > But I'd wondered over the years if blockgc ought to ignore files that
> > > are still opened for write unless we're scouring for free space due to
> > > an ENOSPC.  Maybe the current heuristic of skipping files with dirty
> > > pagecache or IOLOCK contention is good enough.
> > > 
> > 
> > The existing heuristic may be fine for the current model of processing,
> > but I think this is interesting in another context. I have a couple
> > prototype variants around that work by keeping blockgc inodes out of
> > eviction until blockgc work is actually processed. To do something like
> > that probably warrants blockgc work to be more shrinker driven to
> > accommodate memory pressure, so ISTM that having some opened/writeable
> > logic to more intelligently select which blockgc inodes to process for
> > the purpose of eviction/reclaim could be rather useful.
> > 
> > > > FWIW, this sounds more like a generic improvement to the background scan
> > > > to me. Background blockgc currently filters out on things like whether
> > > > the file is dirty in pagecache. If you have a log file or something, I
> > > > would think the regular background scan may end up processing such files
> > > > more frequently than a freeze induced one will..? And for anything that
> > > > isn't under active or continuous modification, freeze is already going
> > > > to flush everything out for the first post-unfreeze background scan to
> > > > take care of.
> > > 
> > > Mhm.
> > > 
> > > > So I dunno, I think I agree and disagree. :) I think it would be
> > > > perfectly reasonable to add an open/writeable file filter check to the
> > > > regular background scan to make it less aggressive. This patch does
> > > > invoke the background scan, but only because of the wonky read into a
> > > > mapped buffer use case.
> > > 
> > > Does this livelock happen on a non-frozen filesystem too?  I wasn't too
> > > sure if you wrote about that in the commit message because there's a
> > > real livelock bug w.r.t. that or if that sentence was simply explaining
> > > the use of an async scan.
> > > 
> > 
> > The livelock was purely a SYNC blockgc scan vs. SB_FREEZE_PAGEFAULT
> > thing. The original patch I sent way back when did the sync scan, but
> > appeared to be susceptible to livelock if there was a blocked task
> > reading (i.e. holding iolock) and causing a page fault by copying into a
> > mapped buffer, because the scan will continuously try for the iolock
> > that won't ever be released. So the non-sync variant was a followup
> > suggestion as a best effort scan.
> > 
> > I think that if we could have a mode where the vfs called ->freeze_fs()
> > once per freeze stage instead of only once at the end, then we could
> > actually do a sync scan under SB_FREEZE_WRITE protection (possibly
> > followed by another non-sync scan under PAGEFAULT to catch any races).
> 
> To what purpose, though? It's being talked about as a way to run a
> blockgc pass during freeze on XFS, but we don't need a blockgc
> passes in freeze for XFS for deadlock avoidance or correct freeze
> behaviour.
> 
> Hence I just don't see what problems this complexity is going to
> fix. What am I missing?
>

I mentioned in a reply or two back to Darrick that the deadlock/inode
lookup behavior topic was a correlation I lazily/wrongly carried forward
from the old discussion and caused unnecessary confusion.

> > > > I still think freeze should (eventually) rather
> > > > invoke the more aggressive sync scan and process all pending work before
> > > > quiesce and not alter behavior based on heuristics.
> > > 
> > > Admittedly, given how much recovery /can/ be required, I'm starting to
> > > think that we could push more of that work to disk before the freeze.
> > 
> > I agree, but at least recovery is predictable. If you have a situation
> > where you have a bunch of inodes with post-eof blocks and then take a
> > snapshot, the fact that the snap might now have GBs of space off the end
> > of random inodes somewhere in the fs is pretty wonky behavior to me. I
> 
> We've had that behaviour for 20+ years.  There's no evidence that it
> is problematic and, realistically, it's not different to the
> filesystem state after a system crash. i.e. crash your system, and
> every inode in memory that has preallocated blocks beyond EOF now
> has them persistently on disk, and they don't get cleaned up until
> the next time they get cycled through the inode cache.
> 
> Further, we intentionally disable blockgc on any inode that has been
> opend O_APPEND or had fallocate() called to preallocate blocks. So
> regardless of anything else, blockgc scans are not a guarantee that
> we clean up post-eof blocks during a freeze - they will exist on
> disk after a freeze regardless of what we do during the freeze
> process.
> 
> Also, we use the same quiesce code for remount,ro as we use for
> freeze. Hence remount,ro leaves all the inodes in memory with
> pending blockgc unchanged. The post-eof blocks are persisted on
> disk, and they won't get removed until the filesystem is mounted rw
> again and the inode is cycled through cache.
> 
> And then there's system crashes. They leave post-eof blocks
> persistent on disk, too, and there is no possibility of blockgc
> being done on them. Recovery does not run GC on these post-eof
> blocks, nor should it.
> 
> So given that remount,ro and system crashes result in exactly the
> same on-disk state as a freeze and we can't do anything about
> crashes resulting in this state, why should we try to avoid
> having post-eof blocks on disk during a freeze? They are going to be
> on disk before a freeze, and then recreated immediately after a
> freeze, so why does a freeze specifically need to remove them?
> 

Because remount-ro isn't actually the same (xfs_remount_ro() runs a sync
blockgc scan) and there is value in the state of a frozen fs on disk
being more predictable than a system crash.

We don't (not necessarily can't) address the speculative prealloc
problem on system crashes, but the COW blocks case actually does because
it doesn't have much of a choice. We also can and do provide better than
crash-like behavior for freeze by virtue of syncing the fs and forcing
and covering the log.

> > suspect the only real way to reclaim that space is to cycle every inode
> > in the snapshot fs through the cache such that the extent list is read
> > and inode reclaim can identify whether there is post-eof space to trim,
> > but I've not actually experimented to see how bad that really is.
> 
> We know how expensive that is: quotacheck has this exact same
> requirement.  i.e. quotacheck has to cycle every inode in the
> filesystem through the inode cache, and that will run post-eof block
> removal as the inodes cycle out of cache.
> 
> And let's not forget that the reason we journal dquots is so that we
> don't have to run a quotacheck on every mount because the overhead
> is prohibitive on filesystems with millions of inodes.
> 
> If a full filesystem inode scan is necessary for blockgc, then do it
> from userspace. A simple bulkstat pass will cycle every inode in the
> filesystem through the cache, and so problem solved without having
> to change a single line of XFS code anywhere.
> 
> > > > Given the current implementation, I think ultimately it just depends on
> > > > your perspective of what freeze is supposed to do. To me, it should
> > > > reliably put the filesystem into a predictable state on-disk (based on
> > > > the common snapshot use case).
> > > 
> > > I always thought freeze was only supposed to do the bare minimum needed
> > > to quiesce the filesystem, assuming that the common case is that we
> > > quickly thaw and resume runtime operations.  OTOH a dirty 1GB log will
> > > take a while to recover, and if the point was to make a backup or
> > > something, that just makes IT unhappy.
> > 
> > It seems the reality is that it's neither of these things we think it
> > should be, and rather just the result of a series of short term tweaks
> > and mods over a long period of time. :)
> 
> IMO, freeze *requirements* are unchanged from what they were 25
> years ago. Freeze is simply required to bring the filesystem down to
> a consistent state on disk that is entirely readable without
> requiring the filesytem to perform any more write operations to
> perform those read and hold it unchanged in that state until the
> filesystem is thawed.
> 
> A further desirable runtime characteristic is that it is
> fast and has minimal runtime impact on application behaviour.
> 
> Yes, the implementaiton has changed over the years as we've found
> bugs, changed the way other subsystems work, etc. But the actual
> functional requirements of what freeze is supposed to provide have
> not changed at all.
> 
> > I think at one point the entire
> > inode working set was reclaimed on freeze, but that was removed due to
> > being crazy.
> 
> Yes, up until 2020 it did do inode reclaim, but that was an
> implementation detail. And it wasn't removed because it was "crazy",
> it was removed because it was no longer necessary for correct
> behaviour.
> 
> Back in 2005 we fixed some freeze issues by making it behave like
> remount,ro. They both do largely the same thing in terms of bringing
> the fs down to a consistent clean state on disk, so sharing the
> implementaiton made a lot of sense. At the time, the only way to
> prevent reclaimable inodes from running transactions after a
> specific point in time was to process them all. i.e. inactivate all
> the inodes, write them back and reclaim them. That was the problem
> we needed to fix in freeze, and remount,ro already solved it. It was
> the obvious, simple fix.
> 
> Did it cause problems? No, it didn't, because this "inode reclaim"
> doesn't affect the working set of inodes in memory. i.e. the working
> set is pinned in memory by cached dentries and so, by definition,
> they are not on the inactive inode lists that can be purged by
> reclaim during freeze/remount,ro.
> 

Hmm.. yeah, I see what xfs_reclaim_inodes() does. I could have sworn we
had a full invalidation in there somewhere at some point in the past. I
don't see that on a quick look back, so I could either be misremembering
or misinterpreting the reclaim call.

> To put this in modern context, the old code was essentially:
> 
> 	xfs_inodegc_flush()
> 	xfs_ail_push_all_sync()
> 	xfs_reclaim_inodes()
> 
> Note that there is no xfs_blockgc_flush_all() call in there - it
> didn't touch inodes in the working set, just processed inodes
> already queued for inactivation work. And in kernels since about
> 3.10, the xfs_reclaim_inodes() call simply expedited the reclaim
> that was going to happen in the background within 5 seconds, so it's
> not like this actually changed anything material in terms of inode
> cache behaviour.
> 
> Indeed, the commit that removed the inode reclaim in 2020 says this:
> 
>     For xfs_quiesce_attr() we can just remove the inode reclaim calls as
>     they are a historic relic that was required to flush dirty inodes
>     that contained unlogged changes. We now log all changes to the
>     inodes, so the sync AIL push from xfs_log_quiesce() called by
>     xfs_quiesce_attr() will do all the required inode writeback for
>     freeze.
> 
> IOWs, it was removed because we realised it was redundant and no
> longer necessary for correct behaviour of the freeze operation.
> That's just normal development process, there was nothing "crazy"
> about the old code, it just wasn't necessary anymore.
> 
> > But now we obviously leave around blockgc inodes as a side
> > effect.
> 
> As per above: we have -always done that-. We have never run blockgc
> on the current working set of inodes during a freeze. We don't want
> to - that will perturb the runtime behaviour of the applications
> beyond the freeze operation, and potentially not in a good way.
> 

This is the primary tradeoff we've been discussing. How would this
sufficiently peturb applications to the point of being problematic? ISTM
that the only case where this is even observable are for files being
actively extended across the freeze, and I'm not seeing how an added
blockgc cycle per freeze operation would be beyond negligible from a
performance or fragmentation perspective for common use cases.

What is less clear to me is whether the same applies to the COW/reflink
use case. Darrick seemed to be on the fence. He would have to comment
where he is on the runtime vs. snapshot -> recovery tradeoff. This is
certainly not worth doing in current form if actively harmful in either
situation.

Brian

> This is how we've implemented freeze for 20-odd years, and there's
> no evidence that it is actually a behaviour that needs changing.
> Having a bug in freeze realted to an inode that needs blockgc does
> not mean "freeze should leave no inodes in memory that need
> blockgc". All it means is that we need to handle inodes that need
> blockgc during a freeze in a better way.....
> 
> > I think the log is intentionally left dirty (but forced and
> > covered) to ensure the snapshot processes unlinked inodes as well (and I
> > think there's been a lot of back and forth on that over the years), but
> > could be misremembering...
> 
> That's a different problem altogether....
> 
> > > > It is a big hammer that should be
> > > > scheduled with care wrt to any performance sensitive workloads, and
> > > > should be minimally disruptive to the system when held for a
> > > > non-deterministic/extended amount of time. Departures from that are
> > > > either optimizations or extra feature/modifiers that we currently don't
> > > > have a great way to control. Just my .02.
> > > 
> > > <nod> Is there anyone interested in working on adding a mode parameter
> > > to FIFREEZE?  What happens if the freeze comes in via the block layer?
> > > 
> > 
> > We'd probably want to audit and maybe try to understand the various use
> > cases before getting too far with userspace API (at least defining flags
> > anyways). At least I'm not sure I really have a clear view of anything
> > outside of the simple snapshot case.
> 
> That's what I keep asking: what user problems are these proposed
> changes actually trying to solve? Start with use cases, not grand
> designs!
> 
> > IOW, suppose in theory freeze by default was the biggest hammer possible
> > and made the filesystem generally flushed and clean on disk, but then
> > grew a flag input for behavior modifiers. What behavior would we modify
> > and why?
> 
> It was originally defined by the XFS_IOC_FREEZE API on Irix but we
> never used it for anything on Linux. We didn't even validate it was
> zero, so it really is just a result of a poor API implementation
> done 25 years ago. I can see how this sort of little detail is
> easily missed in the bigger "port an entire filesystem to a
> different OS" picture....
> 
> > Would somebody want a NOFLUSH thing that is purely a runtime
> > quiesce? Or something inbetween where data and log are flushed, but no
> > blockgc scanning or log forcing and so forth? Something else?
> 
> No idea what the original intent was.  sync() is the filesystem
> integrity flush, freeze is the "on-disk consistent" integrity flush,
> and there's not much between those two things that is actually
> useful to users, admins and/or applications.
> 
> The only thing I've seen proposed for the FIFREEZE argument is a
> "timeout" variable to indicate the maximum time the filesystem
> should stay frozen for before it automatically thaws itself (i.e. an
> anti foot-gun mechanism). That went nowhere because it's just not
> possible for admins and applications to use a timeout in a reliable
> way...
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Brian Foster Feb. 13, 2024, 5:56 p.m. UTC | #17
On Mon, Feb 12, 2024 at 02:43:09PM -0500, Brian Foster wrote:
> On Fri, Feb 09, 2024 at 11:23:48AM +1100, Dave Chinner wrote:
> > On Thu, Feb 08, 2024 at 07:54:00AM -0500, Brian Foster wrote:
> > > On Wed, Feb 07, 2024 at 09:36:20AM -0800, Darrick J. Wong wrote:
> > > > On Tue, Feb 06, 2024 at 08:28:09AM -0500, Brian Foster wrote:
> > > > > On Mon, Feb 05, 2024 at 02:07:27PM -0800, Darrick J. Wong wrote:
> > > > > > On Sun, Feb 04, 2024 at 11:03:07AM -0500, Brian Foster wrote:
> > > > But I'd wondered over the years if blockgc ought to ignore files that
> > > > are still opened for write unless we're scouring for free space due to
> > > > an ENOSPC.  Maybe the current heuristic of skipping files with dirty
> > > > pagecache or IOLOCK contention is good enough.
> > > > 
> > > 
> > > The existing heuristic may be fine for the current model of processing,
> > > but I think this is interesting in another context. I have a couple
> > > prototype variants around that work by keeping blockgc inodes out of
> > > eviction until blockgc work is actually processed. To do something like
> > > that probably warrants blockgc work to be more shrinker driven to
> > > accommodate memory pressure, so ISTM that having some opened/writeable
> > > logic to more intelligently select which blockgc inodes to process for
> > > the purpose of eviction/reclaim could be rather useful.
> > > 
> > > > > FWIW, this sounds more like a generic improvement to the background scan
> > > > > to me. Background blockgc currently filters out on things like whether
> > > > > the file is dirty in pagecache. If you have a log file or something, I
> > > > > would think the regular background scan may end up processing such files
> > > > > more frequently than a freeze induced one will..? And for anything that
> > > > > isn't under active or continuous modification, freeze is already going
> > > > > to flush everything out for the first post-unfreeze background scan to
> > > > > take care of.
> > > > 
> > > > Mhm.
> > > > 
> > > > > So I dunno, I think I agree and disagree. :) I think it would be
> > > > > perfectly reasonable to add an open/writeable file filter check to the
> > > > > regular background scan to make it less aggressive. This patch does
> > > > > invoke the background scan, but only because of the wonky read into a
> > > > > mapped buffer use case.
> > > > 
> > > > Does this livelock happen on a non-frozen filesystem too?  I wasn't too
> > > > sure if you wrote about that in the commit message because there's a
> > > > real livelock bug w.r.t. that or if that sentence was simply explaining
> > > > the use of an async scan.
> > > > 
> > > 
> > > The livelock was purely a SYNC blockgc scan vs. SB_FREEZE_PAGEFAULT
> > > thing. The original patch I sent way back when did the sync scan, but
> > > appeared to be susceptible to livelock if there was a blocked task
> > > reading (i.e. holding iolock) and causing a page fault by copying into a
> > > mapped buffer, because the scan will continuously try for the iolock
> > > that won't ever be released. So the non-sync variant was a followup
> > > suggestion as a best effort scan.
> > > 
> > > I think that if we could have a mode where the vfs called ->freeze_fs()
> > > once per freeze stage instead of only once at the end, then we could
> > > actually do a sync scan under SB_FREEZE_WRITE protection (possibly
> > > followed by another non-sync scan under PAGEFAULT to catch any races).
> > 
> > To what purpose, though? It's being talked about as a way to run a
> > blockgc pass during freeze on XFS, but we don't need a blockgc
> > passes in freeze for XFS for deadlock avoidance or correct freeze
> > behaviour.
> > 
> > Hence I just don't see what problems this complexity is going to
> > fix. What am I missing?
> >
> 
> I mentioned in a reply or two back to Darrick that the deadlock/inode
> lookup behavior topic was a correlation I lazily/wrongly carried forward
> from the old discussion and caused unnecessary confusion.
> 
> > > > > I still think freeze should (eventually) rather
> > > > > invoke the more aggressive sync scan and process all pending work before
> > > > > quiesce and not alter behavior based on heuristics.
> > > > 
> > > > Admittedly, given how much recovery /can/ be required, I'm starting to
> > > > think that we could push more of that work to disk before the freeze.
> > > 
> > > I agree, but at least recovery is predictable. If you have a situation
> > > where you have a bunch of inodes with post-eof blocks and then take a
> > > snapshot, the fact that the snap might now have GBs of space off the end
> > > of random inodes somewhere in the fs is pretty wonky behavior to me. I
> > 
> > We've had that behaviour for 20+ years.  There's no evidence that it
> > is problematic and, realistically, it's not different to the
> > filesystem state after a system crash. i.e. crash your system, and
> > every inode in memory that has preallocated blocks beyond EOF now
> > has them persistently on disk, and they don't get cleaned up until
> > the next time they get cycled through the inode cache.
> > 
> > Further, we intentionally disable blockgc on any inode that has been
> > opend O_APPEND or had fallocate() called to preallocate blocks. So
> > regardless of anything else, blockgc scans are not a guarantee that
> > we clean up post-eof blocks during a freeze - they will exist on
> > disk after a freeze regardless of what we do during the freeze
> > process.
> > 
> > Also, we use the same quiesce code for remount,ro as we use for
> > freeze. Hence remount,ro leaves all the inodes in memory with
> > pending blockgc unchanged. The post-eof blocks are persisted on
> > disk, and they won't get removed until the filesystem is mounted rw
> > again and the inode is cycled through cache.
> > 
> > And then there's system crashes. They leave post-eof blocks
> > persistent on disk, too, and there is no possibility of blockgc
> > being done on them. Recovery does not run GC on these post-eof
> > blocks, nor should it.
> > 
> > So given that remount,ro and system crashes result in exactly the
> > same on-disk state as a freeze and we can't do anything about
> > crashes resulting in this state, why should we try to avoid
> > having post-eof blocks on disk during a freeze? They are going to be
> > on disk before a freeze, and then recreated immediately after a
> > freeze, so why does a freeze specifically need to remove them?
> > 
> 
> Because remount-ro isn't actually the same (xfs_remount_ro() runs a sync
> blockgc scan) and there is value in the state of a frozen fs on disk
> being more predictable than a system crash.
> 
> We don't (not necessarily can't) address the speculative prealloc
> problem on system crashes, but the COW blocks case actually does because
> it doesn't have much of a choice. We also can and do provide better than
> crash-like behavior for freeze by virtue of syncing the fs and forcing
> and covering the log.
> 
> > > suspect the only real way to reclaim that space is to cycle every inode
> > > in the snapshot fs through the cache such that the extent list is read
> > > and inode reclaim can identify whether there is post-eof space to trim,
> > > but I've not actually experimented to see how bad that really is.
> > 
> > We know how expensive that is: quotacheck has this exact same
> > requirement.  i.e. quotacheck has to cycle every inode in the
> > filesystem through the inode cache, and that will run post-eof block
> > removal as the inodes cycle out of cache.
> > 
> > And let's not forget that the reason we journal dquots is so that we
> > don't have to run a quotacheck on every mount because the overhead
> > is prohibitive on filesystems with millions of inodes.
> > 
> > If a full filesystem inode scan is necessary for blockgc, then do it
> > from userspace. A simple bulkstat pass will cycle every inode in the
> > filesystem through the cache, and so problem solved without having
> > to change a single line of XFS code anywhere.
> > 
> > > > > Given the current implementation, I think ultimately it just depends on
> > > > > your perspective of what freeze is supposed to do. To me, it should
> > > > > reliably put the filesystem into a predictable state on-disk (based on
> > > > > the common snapshot use case).
> > > > 
> > > > I always thought freeze was only supposed to do the bare minimum needed
> > > > to quiesce the filesystem, assuming that the common case is that we
> > > > quickly thaw and resume runtime operations.  OTOH a dirty 1GB log will
> > > > take a while to recover, and if the point was to make a backup or
> > > > something, that just makes IT unhappy.
> > > 
> > > It seems the reality is that it's neither of these things we think it
> > > should be, and rather just the result of a series of short term tweaks
> > > and mods over a long period of time. :)
> > 
> > IMO, freeze *requirements* are unchanged from what they were 25
> > years ago. Freeze is simply required to bring the filesystem down to
> > a consistent state on disk that is entirely readable without
> > requiring the filesytem to perform any more write operations to
> > perform those read and hold it unchanged in that state until the
> > filesystem is thawed.
> > 
> > A further desirable runtime characteristic is that it is
> > fast and has minimal runtime impact on application behaviour.
> > 
> > Yes, the implementaiton has changed over the years as we've found
> > bugs, changed the way other subsystems work, etc. But the actual
> > functional requirements of what freeze is supposed to provide have
> > not changed at all.
> > 
> > > I think at one point the entire
> > > inode working set was reclaimed on freeze, but that was removed due to
> > > being crazy.
> > 
> > Yes, up until 2020 it did do inode reclaim, but that was an
> > implementation detail. And it wasn't removed because it was "crazy",
> > it was removed because it was no longer necessary for correct
> > behaviour.
> > 
> > Back in 2005 we fixed some freeze issues by making it behave like
> > remount,ro. They both do largely the same thing in terms of bringing
> > the fs down to a consistent clean state on disk, so sharing the
> > implementaiton made a lot of sense. At the time, the only way to
> > prevent reclaimable inodes from running transactions after a
> > specific point in time was to process them all. i.e. inactivate all
> > the inodes, write them back and reclaim them. That was the problem
> > we needed to fix in freeze, and remount,ro already solved it. It was
> > the obvious, simple fix.
> > 
> > Did it cause problems? No, it didn't, because this "inode reclaim"
> > doesn't affect the working set of inodes in memory. i.e. the working
> > set is pinned in memory by cached dentries and so, by definition,
> > they are not on the inactive inode lists that can be purged by
> > reclaim during freeze/remount,ro.
> > 
> 
> Hmm.. yeah, I see what xfs_reclaim_inodes() does. I could have sworn we
> had a full invalidation in there somewhere at some point in the past. I
> don't see that on a quick look back, so I could either be misremembering
> or misinterpreting the reclaim call.
> 
> > To put this in modern context, the old code was essentially:
> > 
> > 	xfs_inodegc_flush()
> > 	xfs_ail_push_all_sync()
> > 	xfs_reclaim_inodes()
> > 
> > Note that there is no xfs_blockgc_flush_all() call in there - it
> > didn't touch inodes in the working set, just processed inodes
> > already queued for inactivation work. And in kernels since about
> > 3.10, the xfs_reclaim_inodes() call simply expedited the reclaim
> > that was going to happen in the background within 5 seconds, so it's
> > not like this actually changed anything material in terms of inode
> > cache behaviour.
> > 
> > Indeed, the commit that removed the inode reclaim in 2020 says this:
> > 
> >     For xfs_quiesce_attr() we can just remove the inode reclaim calls as
> >     they are a historic relic that was required to flush dirty inodes
> >     that contained unlogged changes. We now log all changes to the
> >     inodes, so the sync AIL push from xfs_log_quiesce() called by
> >     xfs_quiesce_attr() will do all the required inode writeback for
> >     freeze.
> > 
> > IOWs, it was removed because we realised it was redundant and no
> > longer necessary for correct behaviour of the freeze operation.
> > That's just normal development process, there was nothing "crazy"
> > about the old code, it just wasn't necessary anymore.
> > 
> > > But now we obviously leave around blockgc inodes as a side
> > > effect.
> > 
> > As per above: we have -always done that-. We have never run blockgc
> > on the current working set of inodes during a freeze. We don't want
> > to - that will perturb the runtime behaviour of the applications
> > beyond the freeze operation, and potentially not in a good way.
> > 
> 
> This is the primary tradeoff we've been discussing. How would this
> sufficiently peturb applications to the point of being problematic? ISTM
> that the only case where this is even observable are for files being
> actively extended across the freeze, and I'm not seeing how an added
> blockgc cycle per freeze operation would be beyond negligible from a
> performance or fragmentation perspective for common use cases.
> 
> What is less clear to me is whether the same applies to the COW/reflink
> use case. Darrick seemed to be on the fence. He would have to comment
> where he is on the runtime vs. snapshot -> recovery tradeoff. This is
> certainly not worth doing in current form if actively harmful in either
> situation.
> 

I ran a couple tests to try and substantiate this a bit more and measure
impact on both use cases vs. a pathological freeze workload...

Without going too deep into the details, a concurrent sequential
buffered write workload seemed less sensitive to pathological freezing
(i.e. a 1m freeze loop). It reduced a concurrent, sequential buffered
write workload that resulted ~5GB average sized extents in a baseline
test (no freezing) down to around ~3GB sized extents with a freeze cycle
every 60s. This workload involved a total of around 512GB written across
8 files (64GB per file).

A random write aio+dio workload to a prewritten and then reflinked file
was much more sensitive to blockgc scans. This wrote less data to keep
runtime down, but a 5m freeze loop (for a total of only two freeze
cycles) brought the extent count of an 8GB file from ~8191 1MB extents
(i.e. the cowextsize hint) on a baseline test up to 119488 when freezing
was introduced, for an average size of around ~70kb per extent.

Darrick,

I think this lends merit to your idea around having blockgc filter out
files currently open for write. Given the impact of only a couple scans
on the cowblocks test, it seems that even if a file seeing this sort of
workload (i.e. suppose a cloned VM image or something) idled for more
than a couple minutes or so, there's a decent chance a background scan
would come through and have the same kind of effect.

WRT freeze, this is enough for me to say we probably shouldn't be any
more aggressive with cowblocks files. I suppose a cowblocks only "open
for write" filter would still allow an eofblocks scan to trim preallocs
before a snapshot is taken. The same filter applied to eofblocks would
make that less effective, but then I suppose that would also remove most
of the risk associated with a freeze time blockgc scan. Not sure it's
really worth it at that point though.

Anyways, just something to think about. ISTM that's a potentially more
valuable improvement on its own than anything freeze related..

Brian

> Brian
> 
> > This is how we've implemented freeze for 20-odd years, and there's
> > no evidence that it is actually a behaviour that needs changing.
> > Having a bug in freeze realted to an inode that needs blockgc does
> > not mean "freeze should leave no inodes in memory that need
> > blockgc". All it means is that we need to handle inodes that need
> > blockgc during a freeze in a better way.....
> > 
> > > I think the log is intentionally left dirty (but forced and
> > > covered) to ensure the snapshot processes unlinked inodes as well (and I
> > > think there's been a lot of back and forth on that over the years), but
> > > could be misremembering...
> > 
> > That's a different problem altogether....
> > 
> > > > > It is a big hammer that should be
> > > > > scheduled with care wrt to any performance sensitive workloads, and
> > > > > should be minimally disruptive to the system when held for a
> > > > > non-deterministic/extended amount of time. Departures from that are
> > > > > either optimizations or extra feature/modifiers that we currently don't
> > > > > have a great way to control. Just my .02.
> > > > 
> > > > <nod> Is there anyone interested in working on adding a mode parameter
> > > > to FIFREEZE?  What happens if the freeze comes in via the block layer?
> > > > 
> > > 
> > > We'd probably want to audit and maybe try to understand the various use
> > > cases before getting too far with userspace API (at least defining flags
> > > anyways). At least I'm not sure I really have a clear view of anything
> > > outside of the simple snapshot case.
> > 
> > That's what I keep asking: what user problems are these proposed
> > changes actually trying to solve? Start with use cases, not grand
> > designs!
> > 
> > > IOW, suppose in theory freeze by default was the biggest hammer possible
> > > and made the filesystem generally flushed and clean on disk, but then
> > > grew a flag input for behavior modifiers. What behavior would we modify
> > > and why?
> > 
> > It was originally defined by the XFS_IOC_FREEZE API on Irix but we
> > never used it for anything on Linux. We didn't even validate it was
> > zero, so it really is just a result of a poor API implementation
> > done 25 years ago. I can see how this sort of little detail is
> > easily missed in the bigger "port an entire filesystem to a
> > different OS" picture....
> > 
> > > Would somebody want a NOFLUSH thing that is purely a runtime
> > > quiesce? Or something inbetween where data and log are flushed, but no
> > > blockgc scanning or log forcing and so forth? Something else?
> > 
> > No idea what the original intent was.  sync() is the filesystem
> > integrity flush, freeze is the "on-disk consistent" integrity flush,
> > and there's not much between those two things that is actually
> > useful to users, admins and/or applications.
> > 
> > The only thing I've seen proposed for the FIFREEZE argument is a
> > "timeout" variable to indicate the maximum time the filesystem
> > should stay frozen for before it automatically thaws itself (i.e. an
> > anti foot-gun mechanism). That went nowhere because it's just not
> > possible for admins and applications to use a timeout in a reliable
> > way...
> > 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 
>
diff mbox series

Patch

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index d0009430a627..43e72e266666 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -657,8 +657,13 @@  xfs_fs_alloc_inode(
 }
 
 /*
- * Now that the generic code is guaranteed not to be accessing
- * the linux inode, we can inactivate and reclaim the inode.
+ * Now that the generic code is guaranteed not to be accessing the inode, we can
+ * inactivate and reclaim it.
+ *
+ * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
+ * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
+ * allocation in this context. A transaction alloc that blocks on frozen state
+ * from a context with ->s_umount held will deadlock with unfreeze.
  */
 STATIC void
 xfs_fs_destroy_inode(
@@ -811,15 +816,18 @@  xfs_fs_sync_fs(
 	 * down inodegc because once SB_FREEZE_FS is set it's too late to
 	 * prevent inactivation races with freeze. The fs doesn't get called
 	 * again by the freezing process until after SB_FREEZE_FS has been set,
-	 * so it's now or never.  Same logic applies to speculative allocation
-	 * garbage collection.
+	 * so it's now or never.
 	 *
-	 * We don't care if this is a normal syncfs call that does this or
-	 * freeze that does this - we can run this multiple times without issue
-	 * and we won't race with a restart because a restart can only occur
-	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
+	 * The same logic applies to block garbage collection. Run a best-effort
+	 * blockgc scan to reduce the working set of inodes that the shrinker
+	 * would send to inactivation queue purgatory while frozen. We can't run
+	 * a sync scan with page faults blocked because that could potentially
+	 * livelock against a read task blocked on a page fault (i.e. if reading
+	 * into a mapped buffer) while holding iolock.
 	 */
 	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
+		xfs_blockgc_free_space(mp, NULL);
+
 		xfs_inodegc_stop(mp);
 		xfs_blockgc_stop(mp);
 	}