[0/3] xfs: miscellaneous bug fixes

Message ID	20241112221920.1105007-1-david@fromorbit.com (mailing list archive)
Headers	show Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0593D2194BD for <linux-xfs@vger.kernel.org>; Tue, 12 Nov 2024 22:19:26 +0000 (UTC) From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Cc: cem@kernel.org Subject: [PATCH 0/3] xfs: miscellaneous bug fixes Date: Wed, 13 Nov 2024 09:05:13 +1100 Message-ID: <20241112221920.1105007-1-david@fromorbit.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	xfs: miscellaneous bug fixes \| expand [0/3] xfs: miscellaneous bug fixes [1/3] xfs: fix sparse inode limits on runt AG [2/3] xfs: delalloc and quota softlimit timers are incoherent [3/3] xfs: prevent mount and log shutdown race

Message ID

20241112221920.1105007-1-david@fromorbit.com (mailing list archive)

Headers

From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Cc: cem@kernel.org
Subject: [PATCH 0/3] xfs: miscellaneous bug fixes
Date: Wed, 13 Nov 2024 09:05:13 +1100
Message-ID: <20241112221920.1105007-1-david@fromorbit.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

xfs: miscellaneous bug fixes | expand

Message

Dave Chinner Nov. 12, 2024, 10:05 p.m. UTC

These are three bug fixes for recent issues.

The first is a repost of the original patch to prevent allocation of
sparse inode clusters at the end of an unaligned runt AG. There
was plenty of discussion over that fix here:

https://lore.kernel.org/linux-xfs/20241024025142.4082218-1-david@fromorbit.com/

And the outcome of that discussion is that we can't allow sparse
inode clusters overlapping the end of the runt AG without an on disk
format definition change. Hence this patch to ensure the check is
done correctly is the only change we need to make to the kernel to
avoid this problem in the future.

Filesystems that have this problem on disk will need to run
xfs_repair to remove the bad cluster, but no data loss is possible
from this because the kernel currently disallows inode allocation
from the bad cluster and so none of the inodes in the sparse cluster
can actually be used. Hence there is no possible data loss or other
metadata corruption possible from this situation, all we need to do
is ensure that it doesn't happen again once repair has done it's
work.

The other two patches are for issues I've recently hit when running
lots of fstests in parallel. That changes loading and hence timing
of events during tests, exposing latent race conditions in the code.
The quota fix removes racy debug code that has been there since the
quota code was first committed in 1996.

The log shutdown race fix is a much more recent issue created by
trying to ensure shutdowns operate in a sane and predictable manner.
The logic flaw is that we allow multiple log shutdowns to start and
force the log before selecting on a single log shutdown task. This
leads to a situation where shutdown log item callback processing
gets stuck waiting on a task holding a buffer lock that is waiting
on a log force that is waiting on shutdown log item callback
processing to complete...

Thoughts?

Comments

Darrick J. Wong Nov. 12, 2024, 11:59 p.m. UTC | #1

On Wed, Nov 13, 2024 at 09:05:13AM +1100, Dave Chinner wrote:
> These are three bug fixes for recent issues.
> 
> The first is a repost of the original patch to prevent allocation of
> sparse inode clusters at the end of an unaligned runt AG. There
> was plenty of discussion over that fix here:
> 
> https://lore.kernel.org/linux-xfs/20241024025142.4082218-1-david@fromorbit.com/
> 
> And the outcome of that discussion is that we can't allow sparse
> inode clusters overlapping the end of the runt AG without an on disk
> format definition change. Hence this patch to ensure the check is
> done correctly is the only change we need to make to the kernel to
> avoid this problem in the future.
> 
> Filesystems that have this problem on disk will need to run
> xfs_repair to remove the bad cluster, but no data loss is possible
> from this because the kernel currently disallows inode allocation
> from the bad cluster and so none of the inodes in the sparse cluster
> can actually be used. Hence there is no possible data loss or other
> metadata corruption possible from this situation, all we need to do
> is ensure that it doesn't happen again once repair has done it's
> work.

<shrug> How many systems are in this state?  Would those users rather we
fix the validation code in repair/scrub/wherever to allow ichunks that
overrun the end of a runt AG?

--D

> The other two patches are for issues I've recently hit when running
> lots of fstests in parallel. That changes loading and hence timing
> of events during tests, exposing latent race conditions in the code.
> The quota fix removes racy debug code that has been there since the
> quota code was first committed in 1996.
> 
> The log shutdown race fix is a much more recent issue created by
> trying to ensure shutdowns operate in a sane and predictable manner.
> The logic flaw is that we allow multiple log shutdowns to start and
> force the log before selecting on a single log shutdown task. This
> leads to a situation where shutdown log item callback processing
> gets stuck waiting on a task holding a buffer lock that is waiting
> on a log force that is waiting on shutdown log item callback
> processing to complete...
> 
> Thoughts?
> 
>

Dave Chinner Nov. 13, 2024, 1:09 a.m. UTC | #2

On Tue, Nov 12, 2024 at 03:59:46PM -0800, Darrick J. Wong wrote:
> On Wed, Nov 13, 2024 at 09:05:13AM +1100, Dave Chinner wrote:
> > These are three bug fixes for recent issues.
> > 
> > The first is a repost of the original patch to prevent allocation of
> > sparse inode clusters at the end of an unaligned runt AG. There
> > was plenty of discussion over that fix here:
> > 
> > https://lore.kernel.org/linux-xfs/20241024025142.4082218-1-david@fromorbit.com/
> > 
> > And the outcome of that discussion is that we can't allow sparse
> > inode clusters overlapping the end of the runt AG without an on disk
> > format definition change. Hence this patch to ensure the check is
> > done correctly is the only change we need to make to the kernel to
> > avoid this problem in the future.
> > 
> > Filesystems that have this problem on disk will need to run
> > xfs_repair to remove the bad cluster, but no data loss is possible
> > from this because the kernel currently disallows inode allocation
> > from the bad cluster and so none of the inodes in the sparse cluster
> > can actually be used. Hence there is no possible data loss or other
> > metadata corruption possible from this situation, all we need to do
> > is ensure that it doesn't happen again once repair has done it's
> > work.
> 
> <shrug> How many systems are in this state?

Some. Maybe many. Unfortunately the number is largely
unquantifiable. However, when it happens it dumps corruption reports
dumped in the log, so I'd say that there aren't that many of them
out there because we aren't getting swamped with corruption reports.

> Would those users rather we
> fix the validation code in repair/scrub/wherever to allow ichunks that
> overrun the end of a runt AG?

Uh, the previous discussion ended at "considering inode chunks
overlapping the end of the runt AG as valid requires an incompat
feature flag as older kernels cannot access inodes in that
location". i.e. older kernels will flag those inodes as corrupt if
we don't add an incompat feature flag to indicate they are valid.

At that point, we have a situation where they are forced to upgrade
userspace tools to do anything with the filesytsem that the kernel
added the new incompat feature flag for on upgrade.

That's a much worse situation, because they might not realise they
need to upgrade all the userspace tools and disaster recovery
utilities to handle this new format that the kernel upgrade
introduced....

The repair/scrub/whatever code already detects and fix the issue by
removing the bad cluster from the runt AG. We just need to stop the
kernel creating the bad clusters again.

IOWs, it just simpler for everyone to fix the bug like this and
continue to consider the sparse inode cluster at the end of the AG
is invalid.

Alternatively, if users can grow the block device, then they can
simply round up the size of the block device to a whole inode
chunk. They don't need to run repair to fix the issue; the cluster
is now valid because a whole chunk will fit at end of the runt AG.

-Dave.