diff mbox series

[09/10] xfs: Enforce attr3 buffer recovery order

Message ID 20210726060716.3295008-10-david@fromorbit.com (mailing list archive)
State Superseded
Headers show
Series xfs: fix log cache flush regressions and bugs | expand

Commit Message

Dave Chinner July 26, 2021, 6:07 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

From the department of "WTAF? How did we miss that!?"...

When we are recovering a buffer, the first thing we do is check the
buffer magic number and extract the LSN from the buffer. If the LSN
is older than the current LSN, we replay the modification to it. If
the metadata on disk is newer than the transaction in the log, we
skip it. This is a fundamental v5 filesystem metadata recovery
behaviour.

generic/482 failed with an attribute writeback failure during log
recovery. The write verifier caught the corruption before it got
written to disk, and the attr buffer dump looked like:

XFS (dm-3): Metadata corruption detected at xfs_attr3_leaf_verify+0x275/0x2e0, xfs_attr3_leaf block 0x19be8
XFS (dm-3): Unmount and run xfs_repair
XFS (dm-3): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 4d 2a 01 e1  ........;...M*..
00000010: 00 00 00 00 00 01 9b e8 00 00 00 01 00 00 05 38  ...............8
                                  ^^^^^^^^^^^^^^^^^^^^^^^
00000020: df 39 5e 51 58 ac 44 b6 8d c5 e7 10 44 09 bc 17  .9^QX.D.....D...
00000030: 00 00 00 00 00 02 00 83 00 03 00 cc 0f 24 01 00  .............$..
00000040: 00 68 0e bc 0f c8 00 10 00 00 00 00 00 00 00 00  .h..............
00000050: 00 00 3c 31 0f 24 01 00 00 00 3c 32 0f 88 01 00  ..<1.$....<2....
00000060: 00 00 3c 33 0f d8 01 00 00 00 00 00 00 00 00 00  ..<3............
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
.....

The highlighted bytes are the LSN that was replayed into the
buffer: 0x100000538. This is cycle 1, block 0x538. Prior to replay,
that block on disk looks like this:

$ sudo xfs_db -c "fsb 0x417d" -c "type attr3" -c p /dev/mapper/thin-vol
hdr.info.hdr.forw = 0
hdr.info.hdr.back = 0
hdr.info.hdr.magic = 0x3bee
hdr.info.crc = 0xb5af0bc6 (correct)
hdr.info.bno = 105448
hdr.info.lsn = 0x100000900
               ^^^^^^^^^^^
hdr.info.uuid = df395e51-58ac-44b6-8dc5-e7104409bc17
hdr.info.owner = 131203
hdr.count = 2
hdr.usedbytes = 120
hdr.firstused = 3796
hdr.holes = 1
hdr.freemap[0-2] = [base,size]

Note the LSN stamped into the buffer on disk: 1/0x900. The version
on disk is much newer than the log transaction that was being
replayed. That's a bug, and should -never- happen.

So I immediately went to look at xlog_recover_get_buf_lsn() to check
that we handled the LSN correctly. I was wondering if there was a
similar "two commits with the same start LSN skips the second
replay" problem with buffers. I didn't get that far, because I found
a much more basic, rudimentary bug: xlog_recover_get_buf_lsn()
doesn't recognise buffers with XFS_ATTR3_LEAF_MAGIC set in them!!!

IOWs, attr3 leaf buffers fall through the magic number checks
unrecognised, so trigger the "recover immediately" behaviour instead
of undergoing an LSN check. IOWs, we incorrectly replay ATTR3 leaf
buffers and that causes silent on disk corruption of inode attribute
forks and potentially other things....

Git history shows this is *another* zero day bug, this time
introduced in commit 50d5c8d8e938 ("xfs: check LSN ordering for v5
superblocks during recovery") which failed to handle the attr3 leaf
buffers in recovery. And we've failed to handle them ever since...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item_recover.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Christoph Hellwig July 26, 2021, 7:35 a.m. UTC | #1
Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong July 26, 2021, 5:57 p.m. UTC | #2
On Mon, Jul 26, 2021 at 04:07:15PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> From the department of "WTAF? How did we miss that!?"...
> 
> When we are recovering a buffer, the first thing we do is check the
> buffer magic number and extract the LSN from the buffer. If the LSN
> is older than the current LSN, we replay the modification to it. If
> the metadata on disk is newer than the transaction in the log, we
> skip it. This is a fundamental v5 filesystem metadata recovery
> behaviour.
> 
> generic/482 failed with an attribute writeback failure during log
> recovery. The write verifier caught the corruption before it got
> written to disk, and the attr buffer dump looked like:
> 
> XFS (dm-3): Metadata corruption detected at xfs_attr3_leaf_verify+0x275/0x2e0, xfs_attr3_leaf block 0x19be8
> XFS (dm-3): Unmount and run xfs_repair
> XFS (dm-3): First 128 bytes of corrupted metadata buffer:
> 00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 4d 2a 01 e1  ........;...M*..
> 00000010: 00 00 00 00 00 01 9b e8 00 00 00 01 00 00 05 38  ...............8
>                                   ^^^^^^^^^^^^^^^^^^^^^^^
> 00000020: df 39 5e 51 58 ac 44 b6 8d c5 e7 10 44 09 bc 17  .9^QX.D.....D...
> 00000030: 00 00 00 00 00 02 00 83 00 03 00 cc 0f 24 01 00  .............$..
> 00000040: 00 68 0e bc 0f c8 00 10 00 00 00 00 00 00 00 00  .h..............
> 00000050: 00 00 3c 31 0f 24 01 00 00 00 3c 32 0f 88 01 00  ..<1.$....<2....
> 00000060: 00 00 3c 33 0f d8 01 00 00 00 00 00 00 00 00 00  ..<3............
> 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> .....
> 
> The highlighted bytes are the LSN that was replayed into the
> buffer: 0x100000538. This is cycle 1, block 0x538. Prior to replay,
> that block on disk looks like this:
> 
> $ sudo xfs_db -c "fsb 0x417d" -c "type attr3" -c p /dev/mapper/thin-vol
> hdr.info.hdr.forw = 0
> hdr.info.hdr.back = 0
> hdr.info.hdr.magic = 0x3bee
> hdr.info.crc = 0xb5af0bc6 (correct)
> hdr.info.bno = 105448
> hdr.info.lsn = 0x100000900
>                ^^^^^^^^^^^
> hdr.info.uuid = df395e51-58ac-44b6-8dc5-e7104409bc17
> hdr.info.owner = 131203
> hdr.count = 2
> hdr.usedbytes = 120
> hdr.firstused = 3796
> hdr.holes = 1
> hdr.freemap[0-2] = [base,size]
> 
> Note the LSN stamped into the buffer on disk: 1/0x900. The version
> on disk is much newer than the log transaction that was being
> replayed. That's a bug, and should -never- happen.
> 
> So I immediately went to look at xlog_recover_get_buf_lsn() to check
> that we handled the LSN correctly. I was wondering if there was a
> similar "two commits with the same start LSN skips the second
> replay" problem with buffers. I didn't get that far, because I found
> a much more basic, rudimentary bug: xlog_recover_get_buf_lsn()
> doesn't recognise buffers with XFS_ATTR3_LEAF_MAGIC set in them!!!
> 
> IOWs, attr3 leaf buffers fall through the magic number checks
> unrecognised, so trigger the "recover immediately" behaviour instead
> of undergoing an LSN check. IOWs, we incorrectly replay ATTR3 leaf
> buffers and that causes silent on disk corruption of inode attribute
> forks and potentially other things....
> 
> Git history shows this is *another* zero day bug, this time
> introduced in commit 50d5c8d8e938 ("xfs: check LSN ordering for v5
> superblocks during recovery") which failed to handle the attr3 leaf
> buffers in recovery. And we've failed to handle them ever since...

I wonder, what happens if we happen to have a rt bitmap block where a
sparse allocation pattern at the start of the rt device just happens to
match one of these magic numbers + fs UUID?  Does that imply that log
recovery can be tricked into forgetting to replay rtbitmap blocks?

> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_buf_item_recover.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
> index d44e8b4a3391..05fd816edf59 100644
> --- a/fs/xfs/xfs_buf_item_recover.c
> +++ b/fs/xfs/xfs_buf_item_recover.c
> @@ -796,6 +796,7 @@ xlog_recover_get_buf_lsn(
>  	switch (magicda) {
>  	case XFS_DIR3_LEAF1_MAGIC:
>  	case XFS_DIR3_LEAFN_MAGIC:
> +	case XFS_ATTR3_LEAF_MAGIC:
>  	case XFS_DA3_NODE_MAGIC:
>  		lsn = be64_to_cpu(((struct xfs_da3_blkinfo *)blk)->lsn);
>  		uuid = &((struct xfs_da3_blkinfo *)blk)->uuid;
> -- 
> 2.31.1
>
Dave Chinner July 26, 2021, 9:52 p.m. UTC | #3
On Mon, Jul 26, 2021 at 10:57:01AM -0700, Darrick J. Wong wrote:
> On Mon, Jul 26, 2021 at 04:07:15PM +1000, Dave Chinner wrote:
> > IOWs, attr3 leaf buffers fall through the magic number checks
> > unrecognised, so trigger the "recover immediately" behaviour instead
> > of undergoing an LSN check. IOWs, we incorrectly replay ATTR3 leaf
> > buffers and that causes silent on disk corruption of inode attribute
> > forks and potentially other things....
> > 
> > Git history shows this is *another* zero day bug, this time
> > introduced in commit 50d5c8d8e938 ("xfs: check LSN ordering for v5
> > superblocks during recovery") which failed to handle the attr3 leaf
> > buffers in recovery. And we've failed to handle them ever since...
> 
> I wonder, what happens if we happen to have a rt bitmap block where a
> sparse allocation pattern at the start of the rt device just happens to
> match one of these magic numbers + fs UUID?  Does that imply that log
> recovery can be tricked into forgetting to replay rtbitmap blocks?

Possibly. RT bitmap/summary buffers are marked by type in the
xfs_buf_log_format type field so log recovery can recognise these
and do the right thing with them. So it really comes down to whether
log recovery handles XFS_BLFT_RTBITMAP_BUF types differently to any
other buffers. Which, without looking at the code, I doubt it does,
so there's probably fixes needed there, too...

Cheers,

Dave.
Darrick J. Wong July 26, 2021, 10:34 p.m. UTC | #4
On Tue, Jul 27, 2021 at 07:52:21AM +1000, Dave Chinner wrote:
> On Mon, Jul 26, 2021 at 10:57:01AM -0700, Darrick J. Wong wrote:
> > On Mon, Jul 26, 2021 at 04:07:15PM +1000, Dave Chinner wrote:
> > > IOWs, attr3 leaf buffers fall through the magic number checks
> > > unrecognised, so trigger the "recover immediately" behaviour instead
> > > of undergoing an LSN check. IOWs, we incorrectly replay ATTR3 leaf
> > > buffers and that causes silent on disk corruption of inode attribute
> > > forks and potentially other things....
> > > 
> > > Git history shows this is *another* zero day bug, this time
> > > introduced in commit 50d5c8d8e938 ("xfs: check LSN ordering for v5
> > > superblocks during recovery") which failed to handle the attr3 leaf
> > > buffers in recovery. And we've failed to handle them ever since...
> > 
> > I wonder, what happens if we happen to have a rt bitmap block where a
> > sparse allocation pattern at the start of the rt device just happens to
> > match one of these magic numbers + fs UUID?  Does that imply that log
> > recovery can be tricked into forgetting to replay rtbitmap blocks?
> 
> Possibly. RT bitmap/summary buffers are marked by type in the
> xfs_buf_log_format type field so log recovery can recognise these
> and do the right thing with them. So it really comes down to whether
> log recovery handles XFS_BLFT_RTBITMAP_BUF types differently to any
> other buffers. Which, without looking at the code, I doubt it does,
> so there's probably fixes needed there, too...

It handles them the same as every other buffer, which is to say that I
think we've found another recovery zeroday.

xlog_recover_buf_commit_pass2 reads the ondisk buffer, and then calls
xlog_recover_get_buf_lsn to fish the LSN out of the ondisk buffer.  That
second function doesn't corroborate the ondisk magic with the XFS_BLFT_*
flags recovered from the buffer item, so if the log item was for an rt
bitmap block and the user controls the rt layout as I describe above,
they can totally screw up log recovery.

Only after we return a garbage LSN do we call xlog_recover_do_reg_buffer
-> xlog_recover_validate_buf_type and look at the buf_f flags to attach
verifier ops, but by then it's too late to undo the damage.

I think the answer is to combine the two functions so that we check the
BLFT and the ondisk magic.  If they match, we can set b_ops and return
the ondisk LSN and then decide if we're really going to replay the bli
contents.  If they don't match, I guess we recover the whole bli?  Or
abort?  I'll try to get to that after $meetings.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
index d44e8b4a3391..05fd816edf59 100644
--- a/fs/xfs/xfs_buf_item_recover.c
+++ b/fs/xfs/xfs_buf_item_recover.c
@@ -796,6 +796,7 @@  xlog_recover_get_buf_lsn(
 	switch (magicda) {
 	case XFS_DIR3_LEAF1_MAGIC:
 	case XFS_DIR3_LEAFN_MAGIC:
+	case XFS_ATTR3_LEAF_MAGIC:
 	case XFS_DA3_NODE_MAGIC:
 		lsn = be64_to_cpu(((struct xfs_da3_blkinfo *)blk)->lsn);
 		uuid = &((struct xfs_da3_blkinfo *)blk)->uuid;