[v3] xfs: introduce protection for drop nlink

Message ID	202309131744458239465@zte.com.cn (mailing list archive)
State	Superseded, archived
Headers	show Return-Path: <linux-xfs-owner@vger.kernel.org> Date: Wed, 13 Sep 2023 17:44:45 +0800 (CST) Message-ID: <202309131744458239465@zte.com.cn> Mime-Version: 1.0 From: <cheng.lin130@zte.com.cn> To: <david@fromorbit.com>, <djwong@kernel.org> Cc: <linux-xfs@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <jiang.yong5@zte.com.cn>, <wang.liang82@zte.com.cn>, <liu.dong3@zte.com.cn> Subject: =?utf-8?q?=5BPATCH_v3=5D_xfs=3A_introduce_protection_for_drop_nlink?= Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	[v3] xfs: introduce protection for drop nlink \| expand [v3] xfs: introduce protection for drop nlink

Cheng Lin Sept. 13, 2023, 9:44 a.m. UTC

From: Cheng Lin <cheng.lin130@zte.com.cn>

When abnormal drop_nlink are detected on the inode,
shutdown filesystem, to avoid corruption propagation.

Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
---
 fs/xfs/xfs_inode.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Dave Chinner Sept. 13, 2023, 11:42 p.m. UTC | #1

On Wed, Sep 13, 2023 at 05:44:45PM +0800, cheng.lin130@zte.com.cn wrote:
> From: Cheng Lin <cheng.lin130@zte.com.cn>
> 
> When abnormal drop_nlink are detected on the inode,
> shutdown filesystem, to avoid corruption propagation.
> 
> Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
> ---
>  fs/xfs/xfs_inode.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 9e62cc500..40cc106ae 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -919,6 +919,15 @@ xfs_droplink(
>  	xfs_trans_t *tp,
>  	xfs_inode_t *ip)
>  {
> +
> +	if (VFS_I(ip)->i_nlink == 0) {
> +		xfs_alert(ip->i_mount,
> +			  "%s: Deleting inode %llu with no links.",
> +			  __func__, ip->i_ino);
> +		tp->t_flags |= XFS_TRANS_DIRTY;

Marking the transaction dirty is not necessary.

Otherwise this seems fine.

-Dave.

Cheng Lin Sept. 15, 2023, 8:20 a.m. UTC | #2

> On Wed, Sep 13, 2023 at 05:44:45PM +0800, cheng.lin130@zte.com.cn wrote:
> > From: Cheng Lin <cheng.lin130@zte.com.cn>
> >
> > When abnormal drop_nlink are detected on the inode,
> > shutdown filesystem, to avoid corruption propagation.
> >
> > Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
> > ---
> >  fs/xfs/xfs_inode.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 9e62cc500..40cc106ae 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -919,6 +919,15 @@ xfs_droplink(
> >      xfs_trans_t *tp,
> >      xfs_inode_t *ip)
> >  {
> > +
> > +    if (VFS_I(ip)->i_nlink == 0) {
> > +        xfs_alert(ip->i_mount,
> > +              "%s: Deleting inode %llu with no links.",
> > +              __func__, ip->i_ino);
> > +        tp->t_flags |= XFS_TRANS_DIRTY;
> Marking the transaction dirty is not necessary.
> Otherwise this seems fine.
I found the transaction maybe not dirty at this point. Dirty the transaction
can make sure to shutdown fs in subsequent xfs_trans_cancel().

Cheng Lin Sept. 15, 2023, 9:50 a.m. UTC | #3

> On Wed, Sep 13, 2023 at 05:44:45PM +0800, cheng.lin130@zte.com.cn wrote:
> > From: Cheng Lin <cheng.lin130@zte.com.cn>
> >
> > When abnormal drop_nlink are detected on the inode,
> > shutdown filesystem, to avoid corruption propagation.
> >
> > Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
> > ---
> >  fs/xfs/xfs_inode.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 9e62cc500..40cc106ae 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -919,6 +919,15 @@ xfs_droplink(
> >      xfs_trans_t *tp,
> >      xfs_inode_t *ip)
> >  {
> > +
> > +    if (VFS_I(ip)->i_nlink == 0) {
> > +        xfs_alert(ip->i_mount,
> > +              "%s: Deleting inode %llu with no links.",
> > +              __func__, ip->i_ino);
> > +        tp->t_flags |= XFS_TRANS_DIRTY;
> Marking the transaction dirty is not necessary.
> Otherwise this seems fine.
Another strategy: 
Set nlink to an invalid value(like XFS_NLINK_PINNED), and
Complete this transaction before shutdown fs. To make sure
nlink not be zero. If the nlink of a directory are zero, it may
be cleaned up.
Is that appropriate?
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com

Dave Chinner Sept. 17, 2023, 10:44 p.m. UTC | #4

On Fri, Sep 15, 2023 at 05:50:56PM +0800, cheng.lin130@zte.com.cn wrote:
> > On Wed, Sep 13, 2023 at 05:44:45PM +0800, cheng.lin130@zte.com.cn wrote:
> > > From: Cheng Lin <cheng.lin130@zte.com.cn>
> > >
> > > When abnormal drop_nlink are detected on the inode,
> > > shutdown filesystem, to avoid corruption propagation.
> > >
> > > Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
> > > ---
> > >  fs/xfs/xfs_inode.c | 9 +++++++++
> > >  1 file changed, 9 insertions(+)
> > >
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index 9e62cc500..40cc106ae 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -919,6 +919,15 @@ xfs_droplink(
> > >      xfs_trans_t *tp,
> > >      xfs_inode_t *ip)
> > >  {
> > > +
> > > +    if (VFS_I(ip)->i_nlink == 0) {
> > > +        xfs_alert(ip->i_mount,
> > > +              "%s: Deleting inode %llu with no links.",
> > > +              __func__, ip->i_ino);
> > > +        tp->t_flags |= XFS_TRANS_DIRTY;
> > Marking the transaction dirty is not necessary.
> > Otherwise this seems fine.
> Another strategy: 
> Set nlink to an invalid value(like XFS_NLINK_PINNED), and
> Complete this transaction before shutdown fs. To make sure
> nlink not be zero. If the nlink of a directory are zero, it may
> be cleaned up.
> Is that appropriate?

No, all I'm asking you to do is drop dirtying of the transaction
from this patch because it is a) unnecessary and b) a layering
violation.

It is unnecessary because the transaction will almost always be
dirty before we get to xfs_droplink(). In the cases where it isn't
dirty (e.g. xfs_remove() on a directory) we explicitly check that
nlink == 2 before proceeding to call xfs_droplink(). Hence we can't
actually get to xfs_droplink() with a clean transaction, and so
marking it dirty here on underrun is unnecessary as returning an
error from xfs_droplink() will result in shutting down the
filesystem when the transaction is cancelled.

-Dave.

Cheng Lin Sept. 18, 2023, 3:44 a.m. UTC | #5

> On Fri, Sep 15, 2023 at 05:50:56PM +0800, cheng.lin130@zte.com.cn wrote:
> > > On Wed, Sep 13, 2023 at 05:44:45PM +0800, cheng.lin130@zte.com.cn wrote:
> > > > From: Cheng Lin <cheng.lin130@zte.com.cn>
> > > >
> > > > When abnormal drop_nlink are detected on the inode,
> > > > shutdown filesystem, to avoid corruption propagation.
> > > >
> > > > Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
> > > > ---
> > > >  fs/xfs/xfs_inode.c | 9 +++++++++
> > > >  1 file changed, 9 insertions(+)
> > > >
> > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > index 9e62cc500..40cc106ae 100644
> > > > --- a/fs/xfs/xfs_inode.c
> > > > +++ b/fs/xfs/xfs_inode.c
> > > > @@ -919,6 +919,15 @@ xfs_droplink(
> > > >      xfs_trans_t *tp,
> > > >      xfs_inode_t *ip)
> > > >  {
> > > > +
> > > > +    if (VFS_I(ip)->i_nlink == 0) {
> > > > +        xfs_alert(ip->i_mount,
> > > > +              "%s: Deleting inode %llu with no links.",
> > > > +              __func__, ip->i_ino);
> > > > +        tp->t_flags |= XFS_TRANS_DIRTY;
> > > Marking the transaction dirty is not necessary.
> > > Otherwise this seems fine.
> > Another strategy:
> > Set nlink to an invalid value(like XFS_NLINK_PINNED), and
> > Complete this transaction before shutdown fs. To make sure
> > nlink not be zero. If the nlink of a directory are zero, it may
> > be cleaned up.
> > Is that appropriate?
> No, all I'm asking you to do is drop dirtying of the transaction
> from this patch because it is a) unnecessary and b) a layering
> violation.
> It is unnecessary because the transaction will almost always be
> dirty before we get to xfs_droplink(). In the cases where it isn't
> dirty (e.g. xfs_remove() on a directory) we explicitly check that
> nlink == 2 before proceeding to call xfs_droplink(). Hence we can't
> actually get to xfs_droplink() with a clean transaction, and so
If the corrupted inode is a parent directory, when remove its
subdirectory, the parent's nlink will be decreased to 0. But the
transaction of subdirectory removing is not dirty (There are not
check about the parent directory). In this situation, the transaction
will be failed and the filesystem will be alive.

> marking it dirty here on underrun is unnecessary as returning an
> error from xfs_droplink() will result in shutting down the
> filesystem when the transaction is cancelled.
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com

Dave Chinner Sept. 18, 2023, 5:48 a.m. UTC | #6

On Mon, Sep 18, 2023 at 11:44:53AM +0800, cheng.lin130@zte.com.cn wrote:
> > On Fri, Sep 15, 2023 at 05:50:56PM +0800, cheng.lin130@zte.com.cn wrote:
> > > > On Wed, Sep 13, 2023 at 05:44:45PM +0800, cheng.lin130@zte.com.cn wrote:
> > > > > From: Cheng Lin <cheng.lin130@zte.com.cn>
> > > > >
> > > > > When abnormal drop_nlink are detected on the inode,
> > > > > shutdown filesystem, to avoid corruption propagation.
> > > > >
> > > > > Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
> > > > > ---
> > > > >  fs/xfs/xfs_inode.c | 9 +++++++++
> > > > >  1 file changed, 9 insertions(+)
> > > > >
> > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > > index 9e62cc500..40cc106ae 100644
> > > > > --- a/fs/xfs/xfs_inode.c
> > > > > +++ b/fs/xfs/xfs_inode.c
> > > > > @@ -919,6 +919,15 @@ xfs_droplink(
> > > > >      xfs_trans_t *tp,
> > > > >      xfs_inode_t *ip)
> > > > >  {
> > > > > +
> > > > > +    if (VFS_I(ip)->i_nlink == 0) {
> > > > > +        xfs_alert(ip->i_mount,
> > > > > +              "%s: Deleting inode %llu with no links.",
> > > > > +              __func__, ip->i_ino);
> > > > > +        tp->t_flags |= XFS_TRANS_DIRTY;
> > > > Marking the transaction dirty is not necessary.
> > > > Otherwise this seems fine.
> > > Another strategy:
> > > Set nlink to an invalid value(like XFS_NLINK_PINNED), and
> > > Complete this transaction before shutdown fs. To make sure
> > > nlink not be zero. If the nlink of a directory are zero, it may
> > > be cleaned up.
> > > Is that appropriate?
> > No, all I'm asking you to do is drop dirtying of the transaction
> > from this patch because it is a) unnecessary and b) a layering
> > violation.
> > It is unnecessary because the transaction will almost always be
> > dirty before we get to xfs_droplink(). In the cases where it isn't
> > dirty (e.g. xfs_remove() on a directory) we explicitly check that
> > nlink == 2 before proceeding to call xfs_droplink(). Hence we can't
> > actually get to xfs_droplink() with a clean transaction, and so
> If the corrupted inode is a parent directory, when remove its
> subdirectory, the parent's nlink will be decreased to 0.  But the
> transaction of subdirectory removing is not dirty (There are not
> check about the parent directory). In this situation, the transaction
> will be failed and the filesystem will be alive.

Yes, and that's perfectly fine. The transaction cancelling code has
worked this way for the past 20 years or so...

Indeed, you said your customer wants the system to stay alive if possible,
right? Well, so do we.

If the parent directory has a bogus nlink count, and that prevents
us from removing items from the directory, then as long as we
haven't dirtied anything and we can return a -EFSCORRUPTED error to
userspace to say the unlink failed and we don't have to shut the
filesystem down. All we now have is a directory that has objects in
it that can't be removed....

For a higher level perspective, we only need to shut the filesystem
down if we cannot safely back out of the modification operation that
was requested. Whilst the transaction is clean, we can safely return
errors to userspace and continue operation because everything in
memory and on disk is still consistent, even if we have found a
corruption in non-crtical the metadata. Just returning an error to
userspace can't make the problem any worse.

This also is how we treat corruption that is found during read
operations - we return -EFSCORRUPTED to userspace because something in
the directory or inode we were trying to read from was corrupted. We
do not need to shut down the filesystem because there is
no risk of making things worse or the in-memory filesystem state
getting out of sync with the on-disk state.

It is only when we are trying to modify something that corruption
becomes a problem with fatal consequences. Once we've made a
modification, the in-memory state is different to the on-disk state
and whilst we are in that state any corruption we discover becomes
fatal. That is because there is no way to reconcile the changes
we've already made in memory with what is on-disk - we don't know
that the in-memory changes are good because we tripped over
corruption, and so we must not propagate bad in-memory state and
metadata to disk over the top of what may be still be uncorrupted
metadata on disk.

This "in memory matches on disk" state is effectively what the dirty
flag in the transaction tracks, and it's done as part of the normal
running of a transaction as items are tagged for logging. Marking a
transaction dirty that has nothign tagged for logging is actually an
incorrect state; we may handle it correctly, but it should never
actually occur and we should definitely not be open coding dirtying
of transactions to create this state.

IOWs, the transaction modification error handling paths already do
the right thing according to the state carried by the transaction at
the time the error was encountered.

-Dave.

Cheng Lin Sept. 18, 2023, 6:34 a.m. UTC | #7

> On Mon, Sep 18, 2023 at 11:44:53AM +0800, cheng.lin130@zte.com.cn wrote:
> > > On Fri, Sep 15, 2023 at 05:50:56PM +0800, cheng.lin130@zte.com.cn wrote:
> > > > > On Wed, Sep 13, 2023 at 05:44:45PM +0800, cheng.lin130@zte.com.cn wrote:
> > > > > > From: Cheng Lin <cheng.lin130@zte.com.cn>
> > > > > >
> > > > > > When abnormal drop_nlink are detected on the inode,
> > > > > > shutdown filesystem, to avoid corruption propagation.
> > > > > >
> > > > > > Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
> > > > > > ---
> > > > > >  fs/xfs/xfs_inode.c | 9 +++++++++
> > > > > >  1 file changed, 9 insertions(+)
> > > > > >
> > > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > > > index 9e62cc500..40cc106ae 100644
> > > > > > --- a/fs/xfs/xfs_inode.c
> > > > > > +++ b/fs/xfs/xfs_inode.c
> > > > > > @@ -919,6 +919,15 @@ xfs_droplink(
> > > > > >      xfs_trans_t *tp,
> > > > > >      xfs_inode_t *ip)
> > > > > >  {
> > > > > > +
> > > > > > +    if (VFS_I(ip)->i_nlink == 0) {
> > > > > > +        xfs_alert(ip->i_mount,
> > > > > > +              "%s: Deleting inode %llu with no links.",
> > > > > > +              __func__, ip->i_ino);
> > > > > > +        tp->t_flags |= XFS_TRANS_DIRTY;
> > > > > Marking the transaction dirty is not necessary.
> > > > > Otherwise this seems fine.
> > > > Another strategy:
> > > > Set nlink to an invalid value(like XFS_NLINK_PINNED), and
> > > > Complete this transaction before shutdown fs. To make sure
> > > > nlink not be zero. If the nlink of a directory are zero, it may
> > > > be cleaned up.
> > > > Is that appropriate?
> > > No, all I'm asking you to do is drop dirtying of the transaction
> > > from this patch because it is a) unnecessary and b) a layering
> > > violation.
> > > It is unnecessary because the transaction will almost always be
> > > dirty before we get to xfs_droplink(). In the cases where it isn't
> > > dirty (e.g. xfs_remove() on a directory) we explicitly check that
> > > nlink == 2 before proceeding to call xfs_droplink(). Hence we can't
> > > actually get to xfs_droplink() with a clean transaction, and so
> > If the corrupted inode is a parent directory, when remove its
> > subdirectory, the parent's nlink will be decreased to 0.  But the
> > transaction of subdirectory removing is not dirty (There are not
> > check about the parent directory). In this situation, the transaction
> > will be failed and the filesystem will be alive.
> Yes, and that's perfectly fine. The transaction cancelling code has
> worked this way for the past 20 years or so...
> Indeed, you said your customer wants the system to stay alive if possible,
> right? Well, so do we.
> If the parent directory has a bogus nlink count, and that prevents
> us from removing items from the directory, then as long as we
> haven't dirtied anything and we can return a -EFSCORRUPTED error to
> userspace to say the unlink failed and we don't have to shut the
> filesystem down. All we now have is a directory that has objects in
> it that can't be removed....
> For a higher level perspective, we only need to shut the filesystem
> down if we cannot safely back out of the modification operation that
> was requested. Whilst the transaction is clean, we can safely return
> errors to userspace and continue operation because everything in
> memory and on disk is still consistent, even if we have found a
> corruption in non-crtical the metadata. Just returning an error to
> userspace can't make the problem any worse.
> This also is how we treat corruption that is found during read
> operations - we return -EFSCORRUPTED to userspace because something in
> the directory or inode we were trying to read from was corrupted. We
> do not need to shut down the filesystem because there is
> no risk of making things worse or the in-memory filesystem state
> getting out of sync with the on-disk state.
> It is only when we are trying to modify something that corruption
> becomes a problem with fatal consequences. Once we've made a
> modification, the in-memory state is different to the on-disk state
> and whilst we are in that state any corruption we discover becomes
> fatal. That is because there is no way to reconcile the changes
> we've already made in memory with what is on-disk - we don't know
> that the in-memory changes are good because we tripped over
> corruption, and so we must not propagate bad in-memory state and
> metadata to disk over the top of what may be still be uncorrupted
> metadata on disk.
> This "in memory matches on disk" state is effectively what the dirty
> flag in the transaction tracks, and it's done as part of the normal
> running of a transaction as items are tagged for logging. Marking a
> transaction dirty that has nothign tagged for logging is actually an
> incorrect state; we may handle it correctly, but it should never
> actually occur and we should definitely not be open coding dirtying
> of transactions to create this state.
> IOWs, the transaction modification error handling paths already do
> the right thing according to the state carried by the transaction at
> the time the error was encountered.
Thanks for your detailed explainations, I get it.
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com

Darrick J. Wong Sept. 19, 2023, 3:33 a.m. UTC | #8

On Mon, Sep 18, 2023 at 03:48:38PM +1000, Dave Chinner wrote:
> On Mon, Sep 18, 2023 at 11:44:53AM +0800, cheng.lin130@zte.com.cn wrote:
> > > On Fri, Sep 15, 2023 at 05:50:56PM +0800, cheng.lin130@zte.com.cn wrote:
> > > > > On Wed, Sep 13, 2023 at 05:44:45PM +0800, cheng.lin130@zte.com.cn wrote:
> > > > > > From: Cheng Lin <cheng.lin130@zte.com.cn>
> > > > > >
> > > > > > When abnormal drop_nlink are detected on the inode,
> > > > > > shutdown filesystem, to avoid corruption propagation.
> > > > > >
> > > > > > Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
> > > > > > ---
> > > > > >  fs/xfs/xfs_inode.c | 9 +++++++++
> > > > > >  1 file changed, 9 insertions(+)
> > > > > >
> > > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > > > index 9e62cc500..40cc106ae 100644
> > > > > > --- a/fs/xfs/xfs_inode.c
> > > > > > +++ b/fs/xfs/xfs_inode.c
> > > > > > @@ -919,6 +919,15 @@ xfs_droplink(
> > > > > >      xfs_trans_t *tp,
> > > > > >      xfs_inode_t *ip)
> > > > > >  {
> > > > > > +
> > > > > > +    if (VFS_I(ip)->i_nlink == 0) {
> > > > > > +        xfs_alert(ip->i_mount,
> > > > > > +              "%s: Deleting inode %llu with no links.",
> > > > > > +              __func__, ip->i_ino);
> > > > > > +        tp->t_flags |= XFS_TRANS_DIRTY;
> > > > > Marking the transaction dirty is not necessary.
> > > > > Otherwise this seems fine.
> > > > Another strategy:
> > > > Set nlink to an invalid value(like XFS_NLINK_PINNED), and
> > > > Complete this transaction before shutdown fs. To make sure
> > > > nlink not be zero. If the nlink of a directory are zero, it may
> > > > be cleaned up.
> > > > Is that appropriate?
> > > No, all I'm asking you to do is drop dirtying of the transaction
> > > from this patch because it is a) unnecessary and b) a layering
> > > violation.
> > > It is unnecessary because the transaction will almost always be
> > > dirty before we get to xfs_droplink(). In the cases where it isn't
> > > dirty (e.g. xfs_remove() on a directory) we explicitly check that
> > > nlink == 2 before proceeding to call xfs_droplink(). Hence we can't
> > > actually get to xfs_droplink() with a clean transaction, and so
> > If the corrupted inode is a parent directory, when remove its
> > subdirectory, the parent's nlink will be decreased to 0.  But the
> > transaction of subdirectory removing is not dirty (There are not
> > check about the parent directory). In this situation, the transaction
> > will be failed and the filesystem will be alive.
> 
> Yes, and that's perfectly fine. The transaction cancelling code has
> worked this way for the past 20 years or so...
> 
> Indeed, you said your customer wants the system to stay alive if possible,
> right? Well, so do we.
> 
> If the parent directory has a bogus nlink count, and that prevents
> us from removing items from the directory, then as long as we
> haven't dirtied anything and we can return a -EFSCORRUPTED error to
> userspace to say the unlink failed and we don't have to shut the
> filesystem down. All we now have is a directory that has objects in
> it that can't be removed....
> 
> For a higher level perspective, we only need to shut the filesystem
> down if we cannot safely back out of the modification operation that
> was requested. Whilst the transaction is clean, we can safely return
> errors to userspace and continue operation because everything in
> memory and on disk is still consistent, even if we have found a
> corruption in non-crtical the metadata. Just returning an error to
> userspace can't make the problem any worse.
> 
> This also is how we treat corruption that is found during read
> operations - we return -EFSCORRUPTED to userspace because something in
> the directory or inode we were trying to read from was corrupted. We
> do not need to shut down the filesystem because there is
> no risk of making things worse or the in-memory filesystem state
> getting out of sync with the on-disk state.
> 
> It is only when we are trying to modify something that corruption
> becomes a problem with fatal consequences. Once we've made a
> modification, the in-memory state is different to the on-disk state
> and whilst we are in that state any corruption we discover becomes
> fatal. That is because there is no way to reconcile the changes
> we've already made in memory with what is on-disk - we don't know
> that the in-memory changes are good because we tripped over
> corruption, and so we must not propagate bad in-memory state and
> metadata to disk over the top of what may be still be uncorrupted
> metadata on disk.

It'd be a massive effort, but wouldn't it be fun if one could attach
defer ops to a transaction that updated incore state on commit but
otherwise never appeared on disk?

Let me cogitate on that during part 2 of vacation...

--D

> This "in memory matches on disk" state is effectively what the dirty
> flag in the transaction tracks, and it's done as part of the normal
> running of a transaction as items are tagged for logging. Marking a
> transaction dirty that has nothign tagged for logging is actually an
> incorrect state; we may handle it correctly, but it should never
> actually occur and we should definitely not be open coding dirtying
> of transactions to create this state.
> 
> IOWs, the transaction modification error handling paths already do
> the right thing according to the state carried by the transaction at
> the time the error was encountered.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Dave Chinner Sept. 20, 2023, 5:53 a.m. UTC | #9

On Mon, Sep 18, 2023 at 08:33:35PM -0700, Darrick J. Wong wrote:
> On Mon, Sep 18, 2023 at 03:48:38PM +1000, Dave Chinner wrote:
> > It is only when we are trying to modify something that corruption
> > becomes a problem with fatal consequences. Once we've made a
> > modification, the in-memory state is different to the on-disk state
> > and whilst we are in that state any corruption we discover becomes
> > fatal. That is because there is no way to reconcile the changes
> > we've already made in memory with what is on-disk - we don't know
> > that the in-memory changes are good because we tripped over
> > corruption, and so we must not propagate bad in-memory state and
> > metadata to disk over the top of what may be still be uncorrupted
> > metadata on disk.
> 
> It'd be a massive effort, but wouldn't it be fun if one could attach
> defer ops to a transaction that updated incore state on commit but
> otherwise never appeared on disk?
>
> Let me cogitate on that during part 2 of vacation...

Sure, I'm interested to see what you might come up with.

My thoughts on rollback of dirty transactions come from a different
perspective.

Conceptually being able to roll back individual transactions isn't
that difficult. All it takes is a bit more memory and CPU - when we
join the item to the transaction we take a copy of the item we are
about to modify.

If we then cancel a dirty transaction, we then roll back all the
dirty items to their original state before we unlock them.  This
works fine for all the on-disk stuff we track in log items.

I have vague thoughts about how this could potentially be tied into
the shadow buffers we already use for keeping a delta copy of all
the committed in-memory changes in the CIL that we haven't yet
committed to the journal - that's actually the entire delta between
what is on disk and what we've changed prior to the current
transaction we are cancelling.

Hence, in theory, a rollback for a dirty log item is simply "read it
from disk again, copy the CIL shadow buffer delta into it".

However, the complexity comes with trying to roll back associated
in-memory state changes that we don't track as log items.  e.g.
incore extent list changes, in memory inode flag state (e.g.
XFS_ISTALE), etc. that's where all the hard problems to solve lie, I
think.

Another problem is how do we rollback from the middle of an intent
(defer ops) chain? We have to complete that chain for things to end
up consistent on disk, so we can't just cancel the current
transaction and say we are done and everything is clean.  Maybe
that's what you are thinking of here - each chain has an "undo"
intent chain that can roll back all the changes already made?

Cheers,

Dave.

Darrick J. Wong Oct. 3, 2023, 5:33 p.m. UTC | #10

On Wed, Sep 20, 2023 at 03:53:40PM +1000, Dave Chinner wrote:
> On Mon, Sep 18, 2023 at 08:33:35PM -0700, Darrick J. Wong wrote:
> > On Mon, Sep 18, 2023 at 03:48:38PM +1000, Dave Chinner wrote:
> > > It is only when we are trying to modify something that corruption
> > > becomes a problem with fatal consequences. Once we've made a
> > > modification, the in-memory state is different to the on-disk state
> > > and whilst we are in that state any corruption we discover becomes
> > > fatal. That is because there is no way to reconcile the changes
> > > we've already made in memory with what is on-disk - we don't know
> > > that the in-memory changes are good because we tripped over
> > > corruption, and so we must not propagate bad in-memory state and
> > > metadata to disk over the top of what may be still be uncorrupted
> > > metadata on disk.
> > 
> > It'd be a massive effort, but wouldn't it be fun if one could attach
> > defer ops to a transaction that updated incore state on commit but
> > otherwise never appeared on disk?
> >
> > Let me cogitate on that during part 2 of vacation...
> 
> Sure, I'm interested to see what you might come up with.
> 
> My thoughts on rollback of dirty transactions come from a different
> perspective.
> 
> Conceptually being able to roll back individual transactions isn't
> that difficult. All it takes is a bit more memory and CPU - when we
> join the item to the transaction we take a copy of the item we are
> about to modify.
> 
> If we then cancel a dirty transaction, we then roll back all the
> dirty items to their original state before we unlock them.  This
> works fine for all the on-disk stuff we track in log items.
> 
> I have vague thoughts about how this could potentially be tied into
> the shadow buffers we already use for keeping a delta copy of all
> the committed in-memory changes in the CIL that we haven't yet
> committed to the journal - that's actually the entire delta between
> what is on disk and what we've changed prior to the current
> transaction we are cancelling.
> 
> Hence, in theory, a rollback for a dirty log item is simply "read it
> from disk again, copy the CIL shadow buffer delta into it".

<nod> That's more or less the same as what I was thinking.

> However, the complexity comes with trying to roll back associated
> in-memory state changes that we don't track as log items.  e.g.
> incore extent list changes, in memory inode flag state (e.g.
> XFS_ISTALE), etc. that's where all the hard problems to solve lie, I
> think.

Yeah.  I was thinking that each of those incore state changes could be
implemented as a defer_ops that have NOP ->create_intent and
->create_done functions.  The ->finish_item would actually update the
incore structure.  This would be a very large project, and I'm not sure
that it wouldn't be easier to snapshot the xfs_inode fields themselves,
similar to how inode log items snapshot xfs_dinode fields.

(Snapshotting probably doesn't work for the more complex incore
inode structures.)

Kent has been wrangling with this problem for a while in bcachefs and I
think he's actually gotten the rollbacks to work more or less correctly.
He told me that it was a significant restructuring of his codebase even
though *everything* is tracked in btrees and the cursor abstraction
there is more robust than xfs.

> Another problem is how do we rollback from the middle of an intent
> (defer ops) chain? We have to complete that chain for things to end
> up consistent on disk, so we can't just cancel the current
> transaction and say we are done and everything is clean.  Maybe
> that's what you are thinking of here - each chain has an "undo"
> intent chain that can roll back all the changes already made?

Yes.  Every time we call ->finish_item on a log intent item, we also log
a new intent item that undoes whatever that step did.  These items we'll
call "log undo intent" items, and put them on a separate list, e.g.
tp->t_undoops.  If the chain completes successfully then the last step
is to abort everything on t_undoops to release all that memory.

If the chain does not succeed, then we'd abort the intents on t_dfops,
splice t_undoops onto t_dfops, and call xfs_defer_finish to write the
log undo intent items to disk and finish them.  If /that/ fails then we
have to shutdown.

I think this also means that buffer updates that are logged from a
->finish_item function should not be cancelled per above, since the undo
intent item will take care of that.  That would be easy if btree updates
made by an efi/cui/rui items used ordered buffers instead of logging
them directly like we do now.

For bui items, I think we'd need ordered buffers for bmbt updates and
snapshotting inode items for the inode updates themselves.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

[v3] xfs: introduce protection for drop nlink

Commit Message

Comments

Patch