diff mbox series

[08/11] xfs: logging the on disk inode LSN can make it go backwards

Message ID 20210727071012.3358033-9-david@fromorbit.com (mailing list archive)
State Accepted
Headers show
Series xfs: fix log cache flush regressions and bugs | expand

Commit Message

Dave Chinner July 27, 2021, 7:10 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

When we log an inode, we format the "log inode" core and set an LSN
in that inode core. We do that via xfs_inode_item_format_core(),
which calls:

	xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);

to format the log inode. It writes the LSN from the inode item into
the log inode, and if recovery decides the inode item needs to be
replayed, it recovers the log inode LSN field and writes it into the
on disk inode LSN field.

Now this might seem like a reasonable thing to do, but it is wrong
on multiple levels. Firstly, if the item is not yet in the AIL,
item->li_lsn is zero. i.e. the first time the inode it is logged and
formatted, the LSN we write into the log inode will be zero. If we
only log it once, recovery will run and can write this zero LSN into
the inode.

This means that the next time the inode is logged and log recovery
runs, it will *always* replay changes to the inode regardless of
whether the inode is newer on disk than the version in the log and
that violates the entire purpose of recording the LSN in the inode
at writeback time (i.e. to stop it going backwards in time on disk
during recovery).

Secondly, if we commit the CIL to the journal so the inode item
moves to the AIL, and then relog the inode, the LSN that gets
stamped into the log inode will be the LSN of the inode's current
location in the AIL, not it's age on disk. And it's not the LSN that
will be associated with the current change. That means when log
recovery replays this inode item, the LSN that ends up on disk is
the LSN for the previous changes in the log, not the current
changes being replayed. IOWs, after recovery the LSN on disk is not
in sync with the LSN of the modifications that were replayed into
the inode. This, again, violates the recovery ordering semantics
that on-disk writeback LSNs provide.

Hence the inode LSN in the log dinode is -always- invalid.

Thirdly, recovery actually has the LSN of the log transaction it is
replaying right at hand - it uses it to determine if it should
replay the inode by comparing it to the on-disk inode's LSN. But it
doesn't use that LSN to stamp the LSN into the inode which will be
written back when the transaction is fully replayed. It uses the one
in the log dinode, which we know is always going to be incorrect.

Looking back at the change history, the inode logging was broken by
commit 93f958f9c41f ("xfs: cull unnecessary icdinode fields") way
back in 2016 by a stupid idiot who thought he knew how this code
worked. i.e. me. That commit replaced an in memory di_lsn field that
was updated only at inode writeback time from the inode item.li_lsn
value - and hence always contained the same LSN that appeared in the
on-disk inode - with a read of the inode item LSN at inode format
time. CLearly these are not the same thing.

Before 93f958f9c41f, the log recovery behaviour was irrelevant,
because the LSN in the log inode always matched the on-disk LSN at
the time the inode was logged, hence recovery of the transaction
would never make the on-disk LSN in the inode go backwards or get
out of sync.

A symptom of the problem is this, caught from a failure of
generic/482. Before log recovery, the inode has been allocated but
never used:

xfs_db> inode 393388
xfs_db> p
core.magic = 0x494e
core.mode = 0
....
v3.crc = 0x99126961 (correct)
v3.change_count = 0
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jan  1 10:00:00 1970
v3.crtime.nsec = 0

After log recovery:

xfs_db> p
core.magic = 0x494e
core.mode = 020444
....
v3.crc = 0x23e68f23 (correct)
v3.change_count = 2
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jul 22 17:03:03 2021
v3.crtime.nsec = 751000000
...

You can see that the LSN of the on-disk inode is 0, even though it
clearly has been written to disk. I point out this inode, because
the generic/482 failure occurred because several adjacent inodes in
this specific inode cluster were not replayed correctly and still
appeared to be zero on disk when all the other metadata (inobt,
finobt, directories, etc) indicated they should be allocated and
written back.

The fix for this is two-fold. The first is that we need to either
revert the LSN changes in 93f958f9c41f or stop logging the inode LSN
altogether. If we do the former, log recovery does not need to
change but we add 8 bytes of memory per inode to store what is
largely a write-only inode field. If we do the latter, log recovery
needs to stamp the on-disk inode in the same manner that inode
writeback does.

I prefer the latter, because we shouldn't really be trying to log
and replay changes to the on disk LSN as the on-disk value is the
canonical source of the on-disk version of the inode. It also
matches the way we recover buffer items - we create a buf_log_item
that carries the current recovery transaction LSN that gets stamped
into the buffer by the write verifier when it gets written back
when the transaction is fully recovered.

However, this might break log recovery on older kernels even more,
so I'm going to simply ignore the logged value in recovery and stamp
the on-disk inode with the LSN of the transaction being recovered
that will trigger writeback on transaction recovery completion. This
will ensure that the on-disk inode LSN always reflects the LSN of
the last change that was written to disk, regardless of whether it
comes from log recovery or runtime writeback.

Fixes: 93f958f9c41f ("xfs: cull unnecessary icdinode fields")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_log_format.h  | 11 +++++++++-
 fs/xfs/xfs_inode_item_recover.c | 39 ++++++++++++++++++++++++---------
 2 files changed, 39 insertions(+), 11 deletions(-)

Comments

Darrick J. Wong July 27, 2021, 7 p.m. UTC | #1
On Tue, Jul 27, 2021 at 05:10:09PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we log an inode, we format the "log inode" core and set an LSN
> in that inode core. We do that via xfs_inode_item_format_core(),
> which calls:
> 
> 	xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);
> 
> to format the log inode. It writes the LSN from the inode item into
> the log inode, and if recovery decides the inode item needs to be
> replayed, it recovers the log inode LSN field and writes it into the
> on disk inode LSN field.
> 
> Now this might seem like a reasonable thing to do, but it is wrong
> on multiple levels. Firstly, if the item is not yet in the AIL,
> item->li_lsn is zero. i.e. the first time the inode it is logged and
> formatted, the LSN we write into the log inode will be zero. If we
> only log it once, recovery will run and can write this zero LSN into
> the inode.
> 
> This means that the next time the inode is logged and log recovery
> runs, it will *always* replay changes to the inode regardless of
> whether the inode is newer on disk than the version in the log and
> that violates the entire purpose of recording the LSN in the inode
> at writeback time (i.e. to stop it going backwards in time on disk
> during recovery).
> 
> Secondly, if we commit the CIL to the journal so the inode item
> moves to the AIL, and then relog the inode, the LSN that gets
> stamped into the log inode will be the LSN of the inode's current
> location in the AIL, not it's age on disk. And it's not the LSN that
> will be associated with the current change. That means when log
> recovery replays this inode item, the LSN that ends up on disk is
> the LSN for the previous changes in the log, not the current
> changes being replayed. IOWs, after recovery the LSN on disk is not
> in sync with the LSN of the modifications that were replayed into
> the inode. This, again, violates the recovery ordering semantics
> that on-disk writeback LSNs provide.
> 
> Hence the inode LSN in the log dinode is -always- invalid.
> 
> Thirdly, recovery actually has the LSN of the log transaction it is
> replaying right at hand - it uses it to determine if it should
> replay the inode by comparing it to the on-disk inode's LSN. But it
> doesn't use that LSN to stamp the LSN into the inode which will be
> written back when the transaction is fully replayed. It uses the one
> in the log dinode, which we know is always going to be incorrect.
> 
> Looking back at the change history, the inode logging was broken by
> commit 93f958f9c41f ("xfs: cull unnecessary icdinode fields") way
> back in 2016 by a stupid idiot who thought he knew how this code
> worked. i.e. me. That commit replaced an in memory di_lsn field that
> was updated only at inode writeback time from the inode item.li_lsn
> value - and hence always contained the same LSN that appeared in the
> on-disk inode - with a read of the inode item LSN at inode format
> time. CLearly these are not the same thing.
> 
> Before 93f958f9c41f, the log recovery behaviour was irrelevant,
> because the LSN in the log inode always matched the on-disk LSN at
> the time the inode was logged, hence recovery of the transaction
> would never make the on-disk LSN in the inode go backwards or get
> out of sync.
> 
> A symptom of the problem is this, caught from a failure of
> generic/482. Before log recovery, the inode has been allocated but
> never used:
> 
> xfs_db> inode 393388
> xfs_db> p
> core.magic = 0x494e
> core.mode = 0
> ....
> v3.crc = 0x99126961 (correct)
> v3.change_count = 0
> v3.lsn = 0
> v3.flags2 = 0
> v3.cowextsize = 0
> v3.crtime.sec = Thu Jan  1 10:00:00 1970
> v3.crtime.nsec = 0
> 
> After log recovery:
> 
> xfs_db> p
> core.magic = 0x494e
> core.mode = 020444
> ....
> v3.crc = 0x23e68f23 (correct)
> v3.change_count = 2
> v3.lsn = 0
> v3.flags2 = 0
> v3.cowextsize = 0
> v3.crtime.sec = Thu Jul 22 17:03:03 2021
> v3.crtime.nsec = 751000000
> ...
> 
> You can see that the LSN of the on-disk inode is 0, even though it
> clearly has been written to disk. I point out this inode, because
> the generic/482 failure occurred because several adjacent inodes in
> this specific inode cluster were not replayed correctly and still
> appeared to be zero on disk when all the other metadata (inobt,
> finobt, directories, etc) indicated they should be allocated and
> written back.
> 
> The fix for this is two-fold. The first is that we need to either
> revert the LSN changes in 93f958f9c41f or stop logging the inode LSN
> altogether. If we do the former, log recovery does not need to
> change but we add 8 bytes of memory per inode to store what is
> largely a write-only inode field. If we do the latter, log recovery
> needs to stamp the on-disk inode in the same manner that inode
> writeback does.
> 
> I prefer the latter, because we shouldn't really be trying to log
> and replay changes to the on disk LSN as the on-disk value is the
> canonical source of the on-disk version of the inode. It also
> matches the way we recover buffer items - we create a buf_log_item
> that carries the current recovery transaction LSN that gets stamped
> into the buffer by the write verifier when it gets written back
> when the transaction is fully recovered.
> 
> However, this might break log recovery on older kernels even more,
> so I'm going to simply ignore the logged value in recovery and stamp
> the on-disk inode with the LSN of the transaction being recovered
> that will trigger writeback on transaction recovery completion. This
> will ensure that the on-disk inode LSN always reflects the LSN of
> the last change that was written to disk, regardless of whether it
> comes from log recovery or runtime writeback.
> 
> Fixes: 93f958f9c41f ("xfs: cull unnecessary icdinode fields")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_log_format.h  | 11 +++++++++-
>  fs/xfs/xfs_inode_item_recover.c | 39 ++++++++++++++++++++++++---------
>  2 files changed, 39 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index d548ea4b6aab..2c5bcbc19264 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -411,7 +411,16 @@ struct xfs_log_dinode {
>  	/* start of the extended dinode, writable fields */
>  	uint32_t	di_crc;		/* CRC of the inode */
>  	uint64_t	di_changecount;	/* number of attribute changes */
> -	xfs_lsn_t	di_lsn;		/* flush sequence */
> +
> +	/*
> +	 * The LSN we write to this field during formatting is not a reflection
> +	 * of the current on-disk LSN. It should never be used for recovery
> +	 * sequencing, nor should it be recovered into the on-disk inode at all.
> +	 * See xlog_recover_inode_commit_pass2() and xfs_log_dinode_to_disk()
> +	 * for details.
> +	 */
> +	xfs_lsn_t	di_lsn;

I wonder if we should change the name of this to di_lsn_DONOTUSE?  The
net effect of this patch is that xfs_log_dinode.di_lsn is written in the
same questionable manner that it always has been for the sake of not
breaking recovery on old kernels; but from now on we're never going to
read it ever again, not even in log recovery.

The AIL will flush the inode cluster (and stamp the /correct/ LSN into
the cluster buffer), and recovery will now compare the ondisk LSN
against the LSN of the recovered transaction to decide if it should skip
the update.

If I got that right,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> +
>  	uint64_t	di_flags2;	/* more random flags */
>  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
>  	uint8_t		di_pad2[12];	/* more padding for future expansion */
> diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> index 7b79518b6c20..e0072a6cd2d3 100644
> --- a/fs/xfs/xfs_inode_item_recover.c
> +++ b/fs/xfs/xfs_inode_item_recover.c
> @@ -145,7 +145,8 @@ xfs_log_dinode_to_disk_ts(
>  STATIC void
>  xfs_log_dinode_to_disk(
>  	struct xfs_log_dinode	*from,
> -	struct xfs_dinode	*to)
> +	struct xfs_dinode	*to,
> +	xfs_lsn_t		lsn)
>  {
>  	to->di_magic = cpu_to_be16(from->di_magic);
>  	to->di_mode = cpu_to_be16(from->di_mode);
> @@ -182,7 +183,7 @@ xfs_log_dinode_to_disk(
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
>  		to->di_ino = cpu_to_be64(from->di_ino);
> -		to->di_lsn = cpu_to_be64(from->di_lsn);
> +		to->di_lsn = cpu_to_be64(lsn);
>  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
>  		uuid_copy(&to->di_uuid, &from->di_uuid);
>  		to->di_flushiter = 0;
> @@ -261,16 +262,25 @@ xlog_recover_inode_commit_pass2(
>  	}
>  
>  	/*
> -	 * If the inode has an LSN in it, recover the inode only if it's less
> -	 * than the lsn of the transaction we are replaying. Note: we still
> -	 * need to replay an owner change even though the inode is more recent
> -	 * than the transaction as there is no guarantee that all the btree
> -	 * blocks are more recent than this transaction, too.
> +	 * If the inode has an LSN in it, recover the inode only if the on-disk
> +	 * inode's LSN is older than the lsn of the transaction we are
> +	 * replaying. We can have multiple checkpoints with the same start LSN,
> +	 * so the current LSN being equal to the on-disk LSN doesn't necessarily
> +	 * mean that the on-disk inode is more recent than the change being
> +	 * replayed.
> +	 *
> +	 * We must check the current_lsn against the on-disk inode
> +	 * here because the we can't trust the log dinode to contain a valid LSN
> +	 * (see comment below before replaying the log dinode for details).
> +	 *
> +	 * Note: we still need to replay an owner change even though the inode
> +	 * is more recent than the transaction as there is no guarantee that all
> +	 * the btree blocks are more recent than this transaction, too.
>  	 */
>  	if (dip->di_version >= 3) {
>  		xfs_lsn_t	lsn = be64_to_cpu(dip->di_lsn);
>  
> -		if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0) {
> +		if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) > 0) {
>  			trace_xfs_log_recover_inode_skip(log, in_f);
>  			error = 0;
>  			goto out_owner_change;
> @@ -368,8 +378,17 @@ xlog_recover_inode_commit_pass2(
>  		goto out_release;
>  	}
>  
> -	/* recover the log dinode inode into the on disk inode */
> -	xfs_log_dinode_to_disk(ldip, dip);
> +	/*
> +	 * Recover the log dinode inode into the on disk inode.
> +	 *
> +	 * The LSN in the log dinode is garbage - it can be zero or reflect
> +	 * stale in-memory runtime state that isn't coherent with the changes
> +	 * logged in this transaction or the changes written to the on-disk
> +	 * inode.  Hence we write the current lSN into the inode because that
> +	 * matches what xfs_iflush() would write inode the inode when flushing
> +	 * the changes in this transaction.
> +	 */
> +	xfs_log_dinode_to_disk(ldip, dip, current_lsn);
>  
>  	fields = in_f->ilf_fields;
>  	if (fields & XFS_ILOG_DEV)
> -- 
> 2.31.1
>
Chandan Babu R June 13, 2023, 6:20 a.m. UTC | #2
On Tue, Jul 27, 2021 at 05:10:09PM +1000, Dave Chinner wrote:

Hi Dave,

I am trying to backport this patch to Linux-v5.4 LTS kernel. I am unable to
understand as to how log recovery could overwrite a disk inode (holding newer
contents) with a Log inode (containing stale contents). Please refer to my
query below.

> From: Dave Chinner <dchinner@redhat.com>

> When we log an inode, we format the "log inode" core and set an LSN
> in that inode core. We do that via xfs_inode_item_format_core(),
> which calls:
>
> 	xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);

> to format the log inode. It writes the LSN from the inode item into
> the log inode, and if recovery decides the inode item needs to be
> replayed, it recovers the log inode LSN field and writes it into the
> on disk inode LSN field.

> Now this might seem like a reasonable thing to do, but it is wrong
> on multiple levels. Firstly, if the item is not yet in the AIL,
> item->li_lsn is zero. i.e. the first time the inode it is logged and
> formatted, the LSN we write into the log inode will be zero. If we
> only log it once, recovery will run and can write this zero LSN into
> the inode.

Assume that the following is the state before the inode is written back,
Disk inode LSN = x
Log inode LSN = 0
Transaction LSN = x + 1

At this point, if the filesystem shuts down abruptly, log recovery could
change the disk inode's LSN to zero.

> This means that the next time the inode is logged and log recovery
> runs, it will *always* replay changes to the inode regardless of
> whether the inode is newer on disk than the version in the log and
> that violates the entire purpose of recording the LSN in the inode
> at writeback time (i.e. to stop it going backwards in time on disk
> during recovery).

After the log recovery indicated by me above, if the filesystem modifies the
inode then the following is the state of metadata in memory and on disk,

Disk inode LSN = 0 (Due to changes made by log recovery during mount)
Log inode LSN = 0 (xfs_log_item->li_lsn is zero until the log item is moved to
                   the AIL in the current mount cycle)
Transaction LSN = x + 2

Now, if the filesystem shuts down abruptly once again, log recovery replays
the contents of the Log dinode since Disk inode's LSN is less than
transaction's LSN. In this example, the Log inode's contents were newer than
the disk inode.

Your description suggests that there could be a scenario where a Log inode
holding stale content can still overwrite the contents of the Disk inode
holding newer content. I am unable to come with an example of how that could
happen. Could please explain this to me.

> Secondly, if we commit the CIL to the journal so the inode item
> moves to the AIL, and then relog the inode, the LSN that gets
> stamped into the log inode will be the LSN of the inode's current
> location in the AIL, not it's age on disk. And it's not the LSN that
> will be associated with the current change. That means when log
> recovery replays this inode item, the LSN that ends up on disk is
> the LSN for the previous changes in the log, not the current
> changes being replayed. IOWs, after recovery the LSN on disk is not
> in sync with the LSN of the modifications that were replayed into
> the inode. This, again, violates the recovery ordering semantics
> that on-disk writeback LSNs provide.

> Hence the inode LSN in the log dinode is -always- invalid.

> Thirdly, recovery actually has the LSN of the log transaction it is
> replaying right at hand - it uses it to determine if it should
> replay the inode by comparing it to the on-disk inode's LSN. But it
> doesn't use that LSN to stamp the LSN into the inode which will be
> written back when the transaction is fully replayed. It uses the one
> in the log dinode, which we know is always going to be incorrect.

> Looking back at the change history, the inode logging was broken by
> commit 93f958f9c41f ("xfs: cull unnecessary icdinode fields") way
> back in 2016 by a stupid idiot who thought he knew how this code
> worked. i.e. me. That commit replaced an in memory di_lsn field that
> was updated only at inode writeback time from the inode item.li_lsn
> value - and hence always contained the same LSN that appeared in the
> on-disk inode - with a read of the inode item LSN at inode format
> time. CLearly these are not the same thing.

> Before 93f958f9c41f, the log recovery behaviour was irrelevant,
> because the LSN in the log inode always matched the on-disk LSN at
> the time the inode was logged, hence recovery of the transaction
> would never make the on-disk LSN in the inode go backwards or get
> out of sync.

> A symptom of the problem is this, caught from a failure of
> generic/482. Before log recovery, the inode has been allocated but
> never used:

> xfs_db> inode 393388
> xfs_db> p
> core.magic = 0x494e
> core.mode = 0
> ....
> v3.crc = 0x99126961 (correct)
> v3.change_count = 0
> v3.lsn = 0
> v3.flags2 = 0
> v3.cowextsize = 0
> v3.crtime.sec = Thu Jan  1 10:00:00 1970
> v3.crtime.nsec = 0

> After log recovery:

> xfs_db> p
> core.magic = 0x494e
> core.mode = 020444
> ....
> v3.crc = 0x23e68f23 (correct)
> v3.change_count = 2
> v3.lsn = 0
> v3.flags2 = 0
> v3.cowextsize = 0
> v3.crtime.sec = Thu Jul 22 17:03:03 2021
> v3.crtime.nsec = 751000000
> ...

> You can see that the LSN of the on-disk inode is 0, even though it
> clearly has been written to disk. I point out this inode, because
> the generic/482 failure occurred because several adjacent inodes in
> this specific inode cluster were not replayed correctly and still
> appeared to be zero on disk when all the other metadata (inobt,
> finobt, directories, etc) indicated they should be allocated and
> written back.

> The fix for this is two-fold. The first is that we need to either
> revert the LSN changes in 93f958f9c41f or stop logging the inode LSN
> altogether. If we do the former, log recovery does not need to
> change but we add 8 bytes of memory per inode to store what is
> largely a write-only inode field. If we do the latter, log recovery
> needs to stamp the on-disk inode in the same manner that inode
> writeback does.

> I prefer the latter, because we shouldn't really be trying to log
> and replay changes to the on disk LSN as the on-disk value is the
> canonical source of the on-disk version of the inode. It also
> matches the way we recover buffer items - we create a buf_log_item
> that carries the current recovery transaction LSN that gets stamped
> into the buffer by the write verifier when it gets written back
> when the transaction is fully recovered.

> However, this might break log recovery on older kernels even more,
> so I'm going to simply ignore the logged value in recovery and stamp
> the on-disk inode with the LSN of the transaction being recovered
> that will trigger writeback on transaction recovery completion. This
> will ensure that the on-disk inode LSN always reflects the LSN of
> the last change that was written to disk, regardless of whether it
> comes from log recovery or runtime writeback.

> Fixes: 93f958f9c41f ("xfs: cull unnecessary icdinode fields")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_log_format.h  | 11 +++++++++-
>  fs/xfs/xfs_inode_item_recover.c | 39 ++++++++++++++++++++++++---------
>  2 files changed, 39 insertions(+), 11 deletions(-)

> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index d548ea4b6aab..2c5bcbc19264 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -411,7 +411,16 @@ struct xfs_log_dinode {
>  	/* start of the extended dinode, writable fields */
>  	uint32_t	di_crc;		/* CRC of the inode */
>  	uint64_t	di_changecount;	/* number of attribute changes */
> -	xfs_lsn_t	di_lsn;		/* flush sequence */
> +
> +	/*
> +	 * The LSN we write to this field during formatting is not a reflection
> +	 * of the current on-disk LSN. It should never be used for recovery
> +	 * sequencing, nor should it be recovered into the on-disk inode at all.
> +	 * See xlog_recover_inode_commit_pass2() and xfs_log_dinode_to_disk()
> +	 * for details.
> +	 */
> +	xfs_lsn_t	di_lsn;
> +
>  	uint64_t	di_flags2;	/* more random flags */
>  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
>  	uint8_t		di_pad2[12];	/* more padding for future expansion */
> diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> index 7b79518b6c20..e0072a6cd2d3 100644
> --- a/fs/xfs/xfs_inode_item_recover.c
> +++ b/fs/xfs/xfs_inode_item_recover.c
> @@ -145,7 +145,8 @@ xfs_log_dinode_to_disk_ts(
>  STATIC void
>  xfs_log_dinode_to_disk(
>  	struct xfs_log_dinode	*from,
> -	struct xfs_dinode	*to)
> +	struct xfs_dinode	*to,
> +	xfs_lsn_t		lsn)
>  {
>  	to->di_magic = cpu_to_be16(from->di_magic);
>  	to->di_mode = cpu_to_be16(from->di_mode);
> @@ -182,7 +183,7 @@ xfs_log_dinode_to_disk(
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
>  		to->di_ino = cpu_to_be64(from->di_ino);
> -		to->di_lsn = cpu_to_be64(from->di_lsn);
> +		to->di_lsn = cpu_to_be64(lsn);
>  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
>  		uuid_copy(&to->di_uuid, &from->di_uuid);
>  		to->di_flushiter = 0;
> @@ -261,16 +262,25 @@ xlog_recover_inode_commit_pass2(
>  	}
 
>  	/*
> -	 * If the inode has an LSN in it, recover the inode only if it's less
> -	 * than the lsn of the transaction we are replaying. Note: we still
> -	 * need to replay an owner change even though the inode is more recent
> -	 * than the transaction as there is no guarantee that all the btree
> -	 * blocks are more recent than this transaction, too.
> +	 * If the inode has an LSN in it, recover the inode only if the on-disk
> +	 * inode's LSN is older than the lsn of the transaction we are
> +	 * replaying. We can have multiple checkpoints with the same start LSN,
> +	 * so the current LSN being equal to the on-disk LSN doesn't necessarily
> +	 * mean that the on-disk inode is more recent than the change being
> +	 * replayed.
> +	 *
> +	 * We must check the current_lsn against the on-disk inode
> +	 * here because the we can't trust the log dinode to contain a valid LSN
> +	 * (see comment below before replaying the log dinode for details).
> +	 *
> +	 * Note: we still need to replay an owner change even though the inode
> +	 * is more recent than the transaction as there is no guarantee that all
> +	 * the btree blocks are more recent than this transaction, too.
>  	 */
>  	if (dip->di_version >= 3) {
>  		xfs_lsn_t	lsn = be64_to_cpu(dip->di_lsn);
 
> -		if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0) {
> +		if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) > 0) {
>  			trace_xfs_log_recover_inode_skip(log, in_f);
>  			error = 0;
>  			goto out_owner_change;
> @@ -368,8 +378,17 @@ xlog_recover_inode_commit_pass2(
>  		goto out_release;
>  	}
 
> -	/* recover the log dinode inode into the on disk inode */
> -	xfs_log_dinode_to_disk(ldip, dip);
> +	/*
> +	 * Recover the log dinode inode into the on disk inode.
> +	 *
> +	 * The LSN in the log dinode is garbage - it can be zero or reflect
> +	 * stale in-memory runtime state that isn't coherent with the changes
> +	 * logged in this transaction or the changes written to the on-disk
> +	 * inode.  Hence we write the current lSN into the inode because that
> +	 * matches what xfs_iflush() would write inode the inode when flushing
> +	 * the changes in this transaction.
> +	 */
> +	xfs_log_dinode_to_disk(ldip, dip, current_lsn);
 
>  	fields = in_f->ilf_fields;
>  	if (fields & XFS_ILOG_DEV)
> -- 
> 2.31.17
Dave Chinner June 14, 2023, 9:13 a.m. UTC | #3
On Tue, Jun 13, 2023 at 11:50:03AM +0530, Chandan Babu R wrote:
> On Tue, Jul 27, 2021 at 05:10:09PM +1000, Dave Chinner wrote:
> 
> Hi Dave,
> 
> I am trying to backport this patch to Linux-v5.4 LTS kernel. I am unable to
> understand as to how log recovery could overwrite a disk inode (holding newer
> contents) with a Log inode (containing stale contents). Please refer to my
> query below.

I've paged most of this out of my brain.

> 
> > From: Dave Chinner <dchinner@redhat.com>
> 
> > When we log an inode, we format the "log inode" core and set an LSN
> > in that inode core. We do that via xfs_inode_item_format_core(),
> > which calls:
> >
> > 	xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);
> 
> > to format the log inode. It writes the LSN from the inode item into
> > the log inode, and if recovery decides the inode item needs to be
> > replayed, it recovers the log inode LSN field and writes it into the
> > on disk inode LSN field.
> 
> > Now this might seem like a reasonable thing to do, but it is wrong
> > on multiple levels. Firstly, if the item is not yet in the AIL,
> > item->li_lsn is zero. i.e. the first time the inode it is logged and
> > formatted, the LSN we write into the log inode will be zero. If we
> > only log it once, recovery will run and can write this zero LSN into
> > the inode.
> 
> Assume that the following is the state before the inode is written back,
> Disk inode LSN = x
> Log inode LSN = 0
> Transaction LSN = x + 1
> 
> At this point, if the filesystem shuts down abruptly, log recovery could
> change the disk inode's LSN to zero.

Yes, that happened when we first logged a clean inode, so it was
dirty in memory but hadn't been placed in the AIL.

> > This means that the next time the inode is logged and log recovery
> > runs, it will *always* replay changes to the inode regardless of
> > whether the inode is newer on disk than the version in the log and
> > that violates the entire purpose of recording the LSN in the inode
> > at writeback time (i.e. to stop it going backwards in time on disk
> > during recovery).
> 
> After the log recovery indicated by me above, if the filesystem modifies the
> inode then the following is the state of metadata in memory and on disk,
> 
> Disk inode LSN = 0 (Due to changes made by log recovery during mount)
> Log inode LSN = 0 (xfs_log_item->li_lsn is zero until the log item is moved to
>                    the AIL in the current mount cycle)
> Transaction LSN = x + 2
>
> Now, if the filesystem shuts down abruptly once again, log recovery replays
> the contents of the Log dinode since Disk inode's LSN is less than
> transaction's LSN. In this example, the Log inode's contents were newer than
> the disk inode.

I'm not sure about the log shutdown aspects you talk about - the
test that exposed the issue wasn't doing shutdowns.  The test that
exposed this problem was generic/482, which uses dm-logwrites to
replay the filesystem up to each FUA operation, at which point it
mounts it to perform recovery and then runs reapir to check it for
consistency.

IOWs, it performs "recovery at every checkpoint in the log"
independently to determine if log recovery is doing the write
things. It tripped over recovery failing to replay inode changes
due to log/on-disk inode lsn inconsistencies like the following:

Checkpoint	log inode lsn	disk inode lsn		recovery disk inode lsn
1		0			0		0 (wrong!)
2		1			0		1 (wrong!)
3		2			0		2 (wrong!)
		<inode writeback>
					3		2 -should be 3-
		<inode reclaimed>
4		0 			3		-wont get replayed-
							-wrong!-
							-corruption!-

5		4			3		4 (wrong!)
6		5			3		5 (wrong!)
.....
25		6			3		6 (wrong!)
		<inode writeback>
					25		6 -should be 25-
26		25			25		25 -wrong!-


Do you see how the log inode LSN does not match up with the actual
LSN that is written into the inode at writeback time?

Do you see how the log inode LSN could magically go zero in the
middle of the log (Ckpt 4)? That's going to cause the inode log item
to be skipped, when it should have been replayed. That's a
corruption event.

Then if we look at checkpoint 26, where the on-disk inode LSN is 25,
and the log inode LSN is 25 - that log inode should not get replayed
because the LSNs are the same, but the old code did perform replay
and so this bug largely masked the typical off-by-one checkpoint the
log inode lsn contained.

> Your description suggests that there could be a scenario where a Log inode
> holding stale content can still overwrite the contents of the Disk inode
> holding newer content. I am unable to come with an example of how that could
> happen. Could please explain this to me.

I can't really remember. There are lots of ways using the wrong LSN
for recovery can go wrong; I was really just describing stuff I saw
in the test failures.....

> > Secondly, if we commit the CIL to the journal so the inode item
> > moves to the AIL, and then relog the inode, the LSN that gets
> > stamped into the log inode will be the LSN of the inode's current
> > location in the AIL, not it's age on disk. And it's not the LSN that
> > will be associated with the current change. That means when log
> > recovery replays this inode item, the LSN that ends up on disk is
> > the LSN for the previous changes in the log, not the current
> > changes being replayed. IOWs, after recovery the LSN on disk is not
> > in sync with the LSN of the modifications that were replayed into
> > the inode. This, again, violates the recovery ordering semantics
> > that on-disk writeback LSNs provide.

... and this is clearly evident in the example I give above.

> 
> > Hence the inode LSN in the log dinode is -always- invalid.
> 
> > Thirdly, recovery actually has the LSN of the log transaction it is
> > replaying right at hand - it uses it to determine if it should
> > replay the inode by comparing it to the on-disk inode's LSN. But it
> > doesn't use that LSN to stamp the LSN into the inode which will be
> > written back when the transaction is fully replayed. It uses the one
> > in the log dinode, which we know is always going to be incorrect.

This is also demonstrated in the example I give above.

Really, I don't remember any of the finer details of this fix now;
it was a long time ago and I don't really have the time to go back
an reconstruct it from first principles right now.

-Dave.
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index d548ea4b6aab..2c5bcbc19264 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -411,7 +411,16 @@  struct xfs_log_dinode {
 	/* start of the extended dinode, writable fields */
 	uint32_t	di_crc;		/* CRC of the inode */
 	uint64_t	di_changecount;	/* number of attribute changes */
-	xfs_lsn_t	di_lsn;		/* flush sequence */
+
+	/*
+	 * The LSN we write to this field during formatting is not a reflection
+	 * of the current on-disk LSN. It should never be used for recovery
+	 * sequencing, nor should it be recovered into the on-disk inode at all.
+	 * See xlog_recover_inode_commit_pass2() and xfs_log_dinode_to_disk()
+	 * for details.
+	 */
+	xfs_lsn_t	di_lsn;
+
 	uint64_t	di_flags2;	/* more random flags */
 	uint32_t	di_cowextsize;	/* basic cow extent size for file */
 	uint8_t		di_pad2[12];	/* more padding for future expansion */
diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
index 7b79518b6c20..e0072a6cd2d3 100644
--- a/fs/xfs/xfs_inode_item_recover.c
+++ b/fs/xfs/xfs_inode_item_recover.c
@@ -145,7 +145,8 @@  xfs_log_dinode_to_disk_ts(
 STATIC void
 xfs_log_dinode_to_disk(
 	struct xfs_log_dinode	*from,
-	struct xfs_dinode	*to)
+	struct xfs_dinode	*to,
+	xfs_lsn_t		lsn)
 {
 	to->di_magic = cpu_to_be16(from->di_magic);
 	to->di_mode = cpu_to_be16(from->di_mode);
@@ -182,7 +183,7 @@  xfs_log_dinode_to_disk(
 		to->di_flags2 = cpu_to_be64(from->di_flags2);
 		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
 		to->di_ino = cpu_to_be64(from->di_ino);
-		to->di_lsn = cpu_to_be64(from->di_lsn);
+		to->di_lsn = cpu_to_be64(lsn);
 		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
 		uuid_copy(&to->di_uuid, &from->di_uuid);
 		to->di_flushiter = 0;
@@ -261,16 +262,25 @@  xlog_recover_inode_commit_pass2(
 	}
 
 	/*
-	 * If the inode has an LSN in it, recover the inode only if it's less
-	 * than the lsn of the transaction we are replaying. Note: we still
-	 * need to replay an owner change even though the inode is more recent
-	 * than the transaction as there is no guarantee that all the btree
-	 * blocks are more recent than this transaction, too.
+	 * If the inode has an LSN in it, recover the inode only if the on-disk
+	 * inode's LSN is older than the lsn of the transaction we are
+	 * replaying. We can have multiple checkpoints with the same start LSN,
+	 * so the current LSN being equal to the on-disk LSN doesn't necessarily
+	 * mean that the on-disk inode is more recent than the change being
+	 * replayed.
+	 *
+	 * We must check the current_lsn against the on-disk inode
+	 * here because the we can't trust the log dinode to contain a valid LSN
+	 * (see comment below before replaying the log dinode for details).
+	 *
+	 * Note: we still need to replay an owner change even though the inode
+	 * is more recent than the transaction as there is no guarantee that all
+	 * the btree blocks are more recent than this transaction, too.
 	 */
 	if (dip->di_version >= 3) {
 		xfs_lsn_t	lsn = be64_to_cpu(dip->di_lsn);
 
-		if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0) {
+		if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) > 0) {
 			trace_xfs_log_recover_inode_skip(log, in_f);
 			error = 0;
 			goto out_owner_change;
@@ -368,8 +378,17 @@  xlog_recover_inode_commit_pass2(
 		goto out_release;
 	}
 
-	/* recover the log dinode inode into the on disk inode */
-	xfs_log_dinode_to_disk(ldip, dip);
+	/*
+	 * Recover the log dinode inode into the on disk inode.
+	 *
+	 * The LSN in the log dinode is garbage - it can be zero or reflect
+	 * stale in-memory runtime state that isn't coherent with the changes
+	 * logged in this transaction or the changes written to the on-disk
+	 * inode.  Hence we write the current lSN into the inode because that
+	 * matches what xfs_iflush() would write inode the inode when flushing
+	 * the changes in this transaction.
+	 */
+	xfs_log_dinode_to_disk(ldip, dip, current_lsn);
 
 	fields = in_f->ilf_fields;
 	if (fields & XFS_ILOG_DEV)