diff mbox series

[2/2] xfs: recovery should not clear di_flushiter unconditionally

Message ID 20231110044500.718022-3-david@fromorbit.com (mailing list archive)
State New
Headers show
Series xfs: fix recovery corruption on s390 w/ nrext64 | expand

Commit Message

Dave Chinner Nov. 10, 2023, 4:33 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

Because on v3 inodes, di_flushiter doesn't exist. It overlaps with
zero padding in the inode, except when NREXT64=1 configurations are
in use and the zero padding is no longer padding but holds the 64
bit extent counter.

This manifests obviously on big endian platforms (e.g. s390) because
the log dinode is in host order and the overlap is the LSBs of the
extent count field. It is not noticed on little endian machines
because the overlap is at the MSB end of the extent count field and
we need to get more than 2^^48 extents in the inode before it
manifests. i.e. the heat death of the universe will occur before we
see the problem in little endian machines.

This is a zero-day issue for NREXT64=1 configuraitons on big endian
machines. Fix it by only clearing di_flushiter on v2 inodes during
recovery.

Fixes: 9b7d16e34bbe ("xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers")
cc: stable@kernel.org # 5.19+
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode_item_recover.c | 32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

Comments

Darrick J. Wong Nov. 10, 2023, 7:32 p.m. UTC | #1
On Fri, Nov 10, 2023 at 03:33:14PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because on v3 inodes, di_flushiter doesn't exist. It overlaps with
> zero padding in the inode, except when NREXT64=1 configurations are
> in use and the zero padding is no longer padding but holds the 64
> bit extent counter.
> 
> This manifests obviously on big endian platforms (e.g. s390) because
> the log dinode is in host order and the overlap is the LSBs of the
> extent count field. It is not noticed on little endian machines
> because the overlap is at the MSB end of the extent count field and
> we need to get more than 2^^48 extents in the inode before it
> manifests. i.e. the heat death of the universe will occur before we
> see the problem in little endian machines.
> 
> This is a zero-day issue for NREXT64=1 configuraitons on big endian
> machines. Fix it by only clearing di_flushiter on v2 inodes during
> recovery.
> 
> Fixes: 9b7d16e34bbe ("xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers")
> cc: stable@kernel.org # 5.19+
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_inode_item_recover.c | 32 +++++++++++++++++---------------
>  1 file changed, 17 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> index f4c31c2b60d5..dbdab4ce7c44 100644
> --- a/fs/xfs/xfs_inode_item_recover.c
> +++ b/fs/xfs/xfs_inode_item_recover.c
> @@ -371,24 +371,26 @@ xlog_recover_inode_commit_pass2(
>  	 * superblock flag to determine whether we need to look at di_flushiter
>  	 * to skip replay when the on disk inode is newer than the log one
>  	 */
> -	if (!xfs_has_v3inodes(mp) &&
> -	    ldip->di_flushiter < be16_to_cpu(dip->di_flushiter)) {
> -		/*
> -		 * Deal with the wrap case, DI_MAX_FLUSH is less
> -		 * than smaller numbers
> -		 */
> -		if (be16_to_cpu(dip->di_flushiter) == DI_MAX_FLUSH &&
> -		    ldip->di_flushiter < (DI_MAX_FLUSH >> 1)) {
> -			/* do nothing */
> -		} else {
> -			trace_xfs_log_recover_inode_skip(log, in_f);
> -			error = 0;
> -			goto out_release;
> +	if (!xfs_has_v3inodes(mp)) {
> +		if (ldip->di_flushiter < be16_to_cpu(dip->di_flushiter)) {
> +			/*
> +			 * Deal with the wrap case, DI_MAX_FLUSH is less
> +			 * than smaller numbers
> +			 */
> +			if (be16_to_cpu(dip->di_flushiter) == DI_MAX_FLUSH &&
> +			    ldip->di_flushiter < (DI_MAX_FLUSH >> 1)) {
> +				/* do nothing */
> +			} else {
> +				trace_xfs_log_recover_inode_skip(log, in_f);
> +				error = 0;
> +				goto out_release;
> +			}
>  		}
> +
> +		/* Take the opportunity to reset the flush iteration count */
> +		ldip->di_flushiter = 0;

Hmm.  Well this fixes the zeroday problem, so thank you for getting the
root of this!

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Though hch did suggest reducing the amount of indenting here by
compressing the if tests together.  I can't decide if it's worth
rearranging that old V4 code since none of it's scheduled for removal
until 2030, but it /is/ legacy code that maybe we just don't care to
touch?

<shrug>

--D

>  	}
>  
> -	/* Take the opportunity to reset the flush iteration count */
> -	ldip->di_flushiter = 0;
>  
>  	if (unlikely(S_ISREG(ldip->di_mode))) {
>  		if ((ldip->di_format != XFS_DINODE_FMT_EXTENTS) &&
> -- 
> 2.42.0
>
Dave Chinner Nov. 10, 2023, 8:36 p.m. UTC | #2
On Fri, Nov 10, 2023 at 11:32:55AM -0800, Darrick J. Wong wrote:
> On Fri, Nov 10, 2023 at 03:33:14PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Because on v3 inodes, di_flushiter doesn't exist. It overlaps with
> > zero padding in the inode, except when NREXT64=1 configurations are
> > in use and the zero padding is no longer padding but holds the 64
> > bit extent counter.
> > 
> > This manifests obviously on big endian platforms (e.g. s390) because
> > the log dinode is in host order and the overlap is the LSBs of the
> > extent count field. It is not noticed on little endian machines
> > because the overlap is at the MSB end of the extent count field and
> > we need to get more than 2^^48 extents in the inode before it
> > manifests. i.e. the heat death of the universe will occur before we
> > see the problem in little endian machines.
> > 
> > This is a zero-day issue for NREXT64=1 configuraitons on big endian
> > machines. Fix it by only clearing di_flushiter on v2 inodes during
> > recovery.
> > 
> > Fixes: 9b7d16e34bbe ("xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers")
> > cc: stable@kernel.org # 5.19+
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_inode_item_recover.c | 32 +++++++++++++++++---------------
> >  1 file changed, 17 insertions(+), 15 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> > index f4c31c2b60d5..dbdab4ce7c44 100644
> > --- a/fs/xfs/xfs_inode_item_recover.c
> > +++ b/fs/xfs/xfs_inode_item_recover.c
> > @@ -371,24 +371,26 @@ xlog_recover_inode_commit_pass2(
> >  	 * superblock flag to determine whether we need to look at di_flushiter
> >  	 * to skip replay when the on disk inode is newer than the log one
> >  	 */
> > -	if (!xfs_has_v3inodes(mp) &&
> > -	    ldip->di_flushiter < be16_to_cpu(dip->di_flushiter)) {
> > -		/*
> > -		 * Deal with the wrap case, DI_MAX_FLUSH is less
> > -		 * than smaller numbers
> > -		 */
> > -		if (be16_to_cpu(dip->di_flushiter) == DI_MAX_FLUSH &&
> > -		    ldip->di_flushiter < (DI_MAX_FLUSH >> 1)) {
> > -			/* do nothing */
> > -		} else {
> > -			trace_xfs_log_recover_inode_skip(log, in_f);
> > -			error = 0;
> > -			goto out_release;
> > +	if (!xfs_has_v3inodes(mp)) {
> > +		if (ldip->di_flushiter < be16_to_cpu(dip->di_flushiter)) {
> > +			/*
> > +			 * Deal with the wrap case, DI_MAX_FLUSH is less
> > +			 * than smaller numbers
> > +			 */
> > +			if (be16_to_cpu(dip->di_flushiter) == DI_MAX_FLUSH &&
> > +			    ldip->di_flushiter < (DI_MAX_FLUSH >> 1)) {
> > +				/* do nothing */
> > +			} else {
> > +				trace_xfs_log_recover_inode_skip(log, in_f);
> > +				error = 0;
> > +				goto out_release;
> > +			}
> >  		}
> > +
> > +		/* Take the opportunity to reset the flush iteration count */
> > +		ldip->di_flushiter = 0;
> 
> Hmm.  Well this fixes the zeroday problem, so thank you for getting the
> root of this!
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> Though hch did suggest reducing the amount of indenting here by
> compressing the if tests together.  I can't decide if it's worth
> rearranging that old V4 code since none of it's scheduled for removal
> until 2030, but it /is/ legacy code that maybe we just don't care to
> touch?

I'd prefer not to touch it. It's now all isolated in a
"!xfs_has_v3inodes()" branch, so I think we can largely ignore the
grot for now. If we have to do a larger rework of this code in
future, then we can look at reworking it.

But right now, I really don't feel like risking breaking something
else by doing unnecessary cleanups on code we haven't needed to
touch in over a decade.

-Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
index f4c31c2b60d5..dbdab4ce7c44 100644
--- a/fs/xfs/xfs_inode_item_recover.c
+++ b/fs/xfs/xfs_inode_item_recover.c
@@ -371,24 +371,26 @@  xlog_recover_inode_commit_pass2(
 	 * superblock flag to determine whether we need to look at di_flushiter
 	 * to skip replay when the on disk inode is newer than the log one
 	 */
-	if (!xfs_has_v3inodes(mp) &&
-	    ldip->di_flushiter < be16_to_cpu(dip->di_flushiter)) {
-		/*
-		 * Deal with the wrap case, DI_MAX_FLUSH is less
-		 * than smaller numbers
-		 */
-		if (be16_to_cpu(dip->di_flushiter) == DI_MAX_FLUSH &&
-		    ldip->di_flushiter < (DI_MAX_FLUSH >> 1)) {
-			/* do nothing */
-		} else {
-			trace_xfs_log_recover_inode_skip(log, in_f);
-			error = 0;
-			goto out_release;
+	if (!xfs_has_v3inodes(mp)) {
+		if (ldip->di_flushiter < be16_to_cpu(dip->di_flushiter)) {
+			/*
+			 * Deal with the wrap case, DI_MAX_FLUSH is less
+			 * than smaller numbers
+			 */
+			if (be16_to_cpu(dip->di_flushiter) == DI_MAX_FLUSH &&
+			    ldip->di_flushiter < (DI_MAX_FLUSH >> 1)) {
+				/* do nothing */
+			} else {
+				trace_xfs_log_recover_inode_skip(log, in_f);
+				error = 0;
+				goto out_release;
+			}
 		}
+
+		/* Take the opportunity to reset the flush iteration count */
+		ldip->di_flushiter = 0;
 	}
 
-	/* Take the opportunity to reset the flush iteration count */
-	ldip->di_flushiter = 0;
 
 	if (unlikely(S_ISREG(ldip->di_mode))) {
 		if ((ldip->di_format != XFS_DINODE_FMT_EXTENTS) &&