diff mbox series

[RESEND,v2,01/18] xfs: Fix multi-transaction larp replay

Message ID 20220804194013.99237-2-allison.henderson@oracle.com (mailing list archive)
State New, archived
Headers show
Series Parent Pointers | expand

Commit Message

Allison Henderson Aug. 4, 2022, 7:39 p.m. UTC
Recent parent pointer testing has exposed a bug in the underlying
attr replay.  A multi transaction replay currently performs a
single step of the replay, then deferrs the rest if there is more
to do.  This causes race conditions with other attr replays that
might be recovered before the remaining deferred work has had a
chance to finish.  This can lead to interleaved set and remove
operations that may clobber the attribute fork.  Fix this by
deferring all work for any attribute operation.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
---
 fs/xfs/xfs_attr_item.c | 35 ++++++++---------------------------
 1 file changed, 8 insertions(+), 27 deletions(-)

Comments

Darrick J. Wong Aug. 9, 2022, 4:52 p.m. UTC | #1
On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson wrote:
> Recent parent pointer testing has exposed a bug in the underlying
> attr replay.  A multi transaction replay currently performs a
> single step of the replay, then deferrs the rest if there is more
> to do.  This causes race conditions with other attr replays that
> might be recovered before the remaining deferred work has had a
> chance to finish.  This can lead to interleaved set and remove
> operations that may clobber the attribute fork.  Fix this by
> deferring all work for any attribute operation.
> 
> Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
> ---
>  fs/xfs/xfs_attr_item.c | 35 ++++++++---------------------------
>  1 file changed, 8 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
> index 5077a7ad5646..c13d724a3e13 100644
> --- a/fs/xfs/xfs_attr_item.c
> +++ b/fs/xfs/xfs_attr_item.c
> @@ -635,52 +635,33 @@ xfs_attri_item_recover(
>  		break;
>  	case XFS_ATTRI_OP_FLAGS_REMOVE:
>  		if (!xfs_inode_hasattr(args->dp))
> -			goto out;
> +			return 0;
>  		attr->xattri_dela_state = xfs_attr_init_remove_state(args);
>  		break;
>  	default:
>  		ASSERT(0);
> -		error = -EFSCORRUPTED;
> -		goto out;
> +		return -EFSCORRUPTED;
>  	}
>  
>  	xfs_init_attr_trans(args, &tres, &total);
>  	error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, &tp);
>  	if (error)
> -		goto out;
> +		return error;
>  
>  	args->trans = tp;
>  	done_item = xfs_trans_get_attrd(tp, attrip);
> +	args->trans->t_flags |= XFS_TRANS_HAS_INTENT_DONE;
> +	set_bit(XFS_LI_DIRTY, &done_item->attrd_item.li_flags);
>  
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	xfs_trans_ijoin(tp, ip, 0);
>  
> -	error = xfs_xattri_finish_update(attr, done_item);
> -	if (error == -EAGAIN) {
> -		/*
> -		 * There's more work to do, so add the intent item to this
> -		 * transaction so that we can continue it later.
> -		 */
> -		xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
> -		error = xfs_defer_ops_capture_and_commit(tp, capture_list);
> -		if (error)
> -			goto out_unlock;
> -
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -		xfs_irele(ip);
> -		return 0;
> -	}
> -	if (error) {
> -		xfs_trans_cancel(tp);
> -		goto out_unlock;
> -	}
> -
> +	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);

This seems a little convoluted to me.  Maybe?  Maybe not?

1. Log recovery recreates an incore xfs_attri_log_item from what it
finds in the log.

2. This function then logs an xattrd for the recovered xattri item.

3. Then it creates a new xfs_attr_intent to complete the operation.

4. Finally, it calls xfs_defer_ops_capture_and_commit, which logs a new
xattri for the intent created in step 3 and also commits the xattrd for
the first xattri.

IOWs, the only difference between before and after is that we're not
advancing one more step through the state machine as part of log
recovery.  From the perspective of the log, the recovery function merely
replaces the recovered xattri log item with a new one.

Why can't we just attach the recovered xattri to the xfs_defer_pending
that is created to point to the xfs_attr_intent that's created in step
3, and skip the xattrd?

I /think/ the answer to that question is that we might need to move the
log tail forward to free enough log space to finish the intent items, so
creating the extra xattrd/xattri (a) avoid the complexity of submitting
an incore intent item *and* a log intent item to the defer ops
machinery; and (b) avoid livelocks in log recovery.  Therefore, we
actually need to do it this way.

IOWS, I *think* this is ok, but want to see if others have differing
perspectives on how log item recovery works?

--D

>  	error = xfs_defer_ops_capture_and_commit(tp, capture_list);
> -out_unlock:
> +
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	xfs_irele(ip);
> -out:
> -	xfs_attr_free_item(attr);
> +
>  	return error;
>  }
>  
> -- 
> 2.25.1
>
Dave Chinner Aug. 10, 2022, 1:58 a.m. UTC | #2
On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote:
> On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson wrote:
> > Recent parent pointer testing has exposed a bug in the underlying
> > attr replay.  A multi transaction replay currently performs a
> > single step of the replay, then deferrs the rest if there is more
> > to do.

Yup.

> > This causes race conditions with other attr replays that
> > might be recovered before the remaining deferred work has had a
> > chance to finish.

What other attr replays are we racing against?  There can only be
one incomplete attr item intent/done chain per inode present in log
recovery, right?

> > This can lead to interleaved set and remove
> > operations that may clobber the attribute fork.  Fix this by
> > deferring all work for any attribute operation.

Which means this should be an impossible situation.

That is, if we crash before the final attrd DONE intent is written
to the log, it means that new attr intents for modifications made
*after* the current attr modification was completed will not be
present in the log. We have strict ordering of committed operations
in the journal, hence an operation on an inode has an incomplete
intent *must* be the last operation and the *only* incomplete intent
that is found in the journal for that inode.

Hence from an operational ordering persepective, this explanation
for issue being seen doesn't make any sense to me.  If there are
multiple incomplete attri intents then we've either got a runtime
journalling problem (a white-out issue? failing to relog the inode
in each new intent?) or a log recovery problem (failing to match
intent-done pairs correctly?), not a recovery deferral issue.

Hence I think we're still looking for the root cause of this
problem...

> > Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
> > ---
> >  fs/xfs/xfs_attr_item.c | 35 ++++++++---------------------------
> >  1 file changed, 8 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
> > index 5077a7ad5646..c13d724a3e13 100644
> > --- a/fs/xfs/xfs_attr_item.c
> > +++ b/fs/xfs/xfs_attr_item.c
> > @@ -635,52 +635,33 @@ xfs_attri_item_recover(
> >  		break;
> >  	case XFS_ATTRI_OP_FLAGS_REMOVE:
> >  		if (!xfs_inode_hasattr(args->dp))
> > -			goto out;
> > +			return 0;
> >  		attr->xattri_dela_state = xfs_attr_init_remove_state(args);
> >  		break;
> >  	default:
> >  		ASSERT(0);
> > -		error = -EFSCORRUPTED;
> > -		goto out;
> > +		return -EFSCORRUPTED;
> >  	}
> >  
> >  	xfs_init_attr_trans(args, &tres, &total);
> >  	error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, &tp);
> >  	if (error)
> > -		goto out;
> > +		return error;
> >  
> >  	args->trans = tp;
> >  	done_item = xfs_trans_get_attrd(tp, attrip);
> > +	args->trans->t_flags |= XFS_TRANS_HAS_INTENT_DONE;
> > +	set_bit(XFS_LI_DIRTY, &done_item->attrd_item.li_flags);
> >  
> >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> >  	xfs_trans_ijoin(tp, ip, 0);
> >  
> > -	error = xfs_xattri_finish_update(attr, done_item);
> > -	if (error == -EAGAIN) {
> > -		/*
> > -		 * There's more work to do, so add the intent item to this
> > -		 * transaction so that we can continue it later.
> > -		 */
> > -		xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
> > -		error = xfs_defer_ops_capture_and_commit(tp, capture_list);
> > -		if (error)
> > -			goto out_unlock;
> > -
> > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > -		xfs_irele(ip);
> > -		return 0;
> > -	}
> > -	if (error) {
> > -		xfs_trans_cancel(tp);
> > -		goto out_unlock;
> > -	}
> > -
> > +	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
> 
> This seems a little convoluted to me.  Maybe?  Maybe not?
> 
> 1. Log recovery recreates an incore xfs_attri_log_item from what it
> finds in the log.
> 
> 2. This function then logs an xattrd for the recovered xattri item.
> 
> 3. Then it creates a new xfs_attr_intent to complete the operation.
> 
> 4. Finally, it calls xfs_defer_ops_capture_and_commit, which logs a new
> xattri for the intent created in step 3 and also commits the xattrd for
> the first xattri.
> 
> IOWs, the only difference between before and after is that we're not
> advancing one more step through the state machine as part of log
> recovery.  From the perspective of the log, the recovery function merely
> replaces the recovered xattri log item with a new one.
> 
> Why can't we just attach the recovered xattri to the xfs_defer_pending
> that is created to point to the xfs_attr_intent that's created in step
> 3, and skip the xattrd?

Remember that attribute intents are different to all other intent
types that we have. The existing extent based intents define a
single indepedent operation that needs to be performed, and each
step of the intent chain is completely independent of the previous
step in the chain.  e.g. removing the extent from the rmap btree is
completely independent of removing it from the inode bmap btree -
all that matters is that the removal from the bmbt happens first.
The rmapbt removal can happen at any time after than, and is
completely independent of any other bmbt or rmapbt operation.
Similarly, the EFI can processed independently of all bmapbt and
rmapbt modifications, it just has to happen after those
modifications are done.

Hence if we crash during recovery, we can just restart from
where-ever we got to in the middle of the intent chains and not have
to care at all.  IOWs, eventual consistency works with these chains
because there is no dependencies between each step of the intent
chain and each step is completely independent of the other steps.

Attribute intent chains are completely different. They link steps in
a state machine together in a non-trivial, highly dependent chain.
We can't just restart the chain in the middle like we can for the
BUI->RUI->CUI->EFI chain because the on-disk attribute is in an
unknown state and recovering that exact state is .... complex.

Hence the the first step of recovery is to return the attribute we
are trying to modify back to a known state. That means we have to
perform a removal of any existing attribute under that name first.
Hence this first step should be replacing the existing attr intent
with the intent that defines the recovery operation we are going to
perform.

That means we need to translate set to replace so that cleanup is
run first, replace needs to clean up the attr under that name
regardless of whether it has the incomplete bit set on it or not.
Remove is the only operation that runs the same as at runtime, as
cleanup for remove is just repeating the remove operation from
scratch.

> I /think/ the answer to that question is that we might need to move the
> log tail forward to free enough log space to finish the intent items, so
> creating the extra xattrd/xattri (a) avoid the complexity of submitting
> an incore intent item *and* a log intent item to the defer ops
> machinery; and (b) avoid livelocks in log recovery.  Therefore, we
> actually need to do it this way.

We really need the initial operation to rewrite the intent to match
the recovery operation we are going to perform. Everything else is
secondary.

Cheers,

Dave.
Allison Henderson Aug. 10, 2022, 3:08 a.m. UTC | #3
On Tue, 2022-08-09 at 09:52 -0700, Darrick J. Wong wrote:
> On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson wrote:
> > Recent parent pointer testing has exposed a bug in the underlying
> > attr replay.  A multi transaction replay currently performs a
> > single step of the replay, then deferrs the rest if there is more
> > to do.  This causes race conditions with other attr replays that
> > might be recovered before the remaining deferred work has had a
> > chance to finish.  This can lead to interleaved set and remove
> > operations that may clobber the attribute fork.  Fix this by
> > deferring all work for any attribute operation.
> > 
> > Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
> > ---
> >  fs/xfs/xfs_attr_item.c | 35 ++++++++---------------------------
> >  1 file changed, 8 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
> > index 5077a7ad5646..c13d724a3e13 100644
> > --- a/fs/xfs/xfs_attr_item.c
> > +++ b/fs/xfs/xfs_attr_item.c
> > @@ -635,52 +635,33 @@ xfs_attri_item_recover(
> >  		break;
> >  	case XFS_ATTRI_OP_FLAGS_REMOVE:
> >  		if (!xfs_inode_hasattr(args->dp))
> > -			goto out;
> > +			return 0;
> >  		attr->xattri_dela_state =
> > xfs_attr_init_remove_state(args);
> >  		break;
> >  	default:
> >  		ASSERT(0);
> > -		error = -EFSCORRUPTED;
> > -		goto out;
> > +		return -EFSCORRUPTED;
> >  	}
> >  
> >  	xfs_init_attr_trans(args, &tres, &total);
> >  	error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE,
> > &tp);
> >  	if (error)
> > -		goto out;
> > +		return error;
> >  
> >  	args->trans = tp;
> >  	done_item = xfs_trans_get_attrd(tp, attrip);
> > +	args->trans->t_flags |= XFS_TRANS_HAS_INTENT_DONE;
> > +	set_bit(XFS_LI_DIRTY, &done_item->attrd_item.li_flags);
> >  
> >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> >  	xfs_trans_ijoin(tp, ip, 0);
> >  
> > -	error = xfs_xattri_finish_update(attr, done_item);
> > -	if (error == -EAGAIN) {
> > -		/*
> > -		 * There's more work to do, so add the intent item to
> > this
> > -		 * transaction so that we can continue it later.
> > -		 */
> > -		xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr-
> > >xattri_list);
> > -		error = xfs_defer_ops_capture_and_commit(tp,
> > capture_list);
> > -		if (error)
> > -			goto out_unlock;
> > -
> > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > -		xfs_irele(ip);
> > -		return 0;
> > -	}
> > -	if (error) {
> > -		xfs_trans_cancel(tp);
> > -		goto out_unlock;
> > -	}
> > -
> > +	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
> 
> This seems a little convoluted to me.  Maybe?  Maybe not?
> 
> 1. Log recovery recreates an incore xfs_attri_log_item from what it
> finds in the log.
> 
> 2. This function then logs an xattrd for the recovered xattri item.
> 
> 3. Then it creates a new xfs_attr_intent to complete the operation.
> 
> 4. Finally, it calls xfs_defer_ops_capture_and_commit, which logs a
> new
> xattri for the intent created in step 3 and also commits the xattrd
> for
> the first xattri.
> 
> IOWs, the only difference between before and after is that we're not
> advancing one more step through the state machine as part of log
> recovery.  From the perspective of the log, the recovery function
> merely
> replaces the recovered xattri log item with a new one.
> 
> Why can't we just attach the recovered xattri to the
> xfs_defer_pending
> that is created to point to the xfs_attr_intent that's created in
> step
> 3, and skip the xattrd?
Oh, I see.  I hadnt thought of doing it that way, this was based on the
initial solution suggested to the first patch of v1 (xfs: Add larp
state XFS_DAS_CREATE_FORK).  But what you mention below also makes
sense. So I suppose if no one has any gripes then maybe it should stay
as it is then.  Thx for the reviews!

Allison

> 
> I /think/ the answer to that question is that we might need to move
> the
> log tail forward to free enough log space to finish the intent items,
> so
> creating the extra xattrd/xattri (a) avoid the complexity of
> submitting
> an incore intent item *and* a log intent item to the defer ops
> machinery; and (b) avoid livelocks in log recovery.  Therefore, we
> actually need to do it this way.
> 
> IOWS, I *think* this is ok, but want to see if others have differing
> perspectives on how log item recovery works?
> 
> --D
> 
> >  	error = xfs_defer_ops_capture_and_commit(tp, capture_list);
> > -out_unlock:
> > +
> >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  	xfs_irele(ip);
> > -out:
> > -	xfs_attr_free_item(attr);
> > +
> >  	return error;
> >  }
> >  
> > -- 
> > 2.25.1
> >
Allison Henderson Aug. 10, 2022, 5:01 a.m. UTC | #4
On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote:
> On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote:
> > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson wrote:
> > > Recent parent pointer testing has exposed a bug in the underlying
> > > attr replay.  A multi transaction replay currently performs a
> > > single step of the replay, then deferrs the rest if there is more
> > > to do.
> 
> Yup.
> 
> > > This causes race conditions with other attr replays that
> > > might be recovered before the remaining deferred work has had a
> > > chance to finish.
> 
> What other attr replays are we racing against?  There can only be
> one incomplete attr item intent/done chain per inode present in log
> recovery, right?
No, a rename queues up a set and remove before committing the
transaction.  One for the new parent pointer, and another to remove the
old one.  It cant be an attr replace because technically the names are
different.

So the recovered set grows the leaf, and returns the egain, then rest
gets capture committed.  Next up is the recovered remove which pulls
out the fork, which causes problems when the rest of the set operation
resumes as a deferred operation.  Here is the link to the original
discussion, it was quite a while ago:

https://lore.kernel.org/all/Yrzw9F5aGsaldrmR@magnolia/

I hope that helps?
Allison

> 
> > > This can lead to interleaved set and remove
> > > operations that may clobber the attribute fork.  Fix this by
> > > deferring all work for any attribute operation.
> 
> Which means this should be an impossible situation.
> 
> That is, if we crash before the final attrd DONE intent is written
> to the log, it means that new attr intents for modifications made
> *after* the current attr modification was completed will not be
> present in the log. We have strict ordering of committed operations
> in the journal, hence an operation on an inode has an incomplete
> intent *must* be the last operation and the *only* incomplete intent
> that is found in the journal for that inode.
> 
> Hence from an operational ordering persepective, this explanation
> for issue being seen doesn't make any sense to me.  If there are
> multiple incomplete attri intents then we've either got a runtime
> journalling problem (a white-out issue? failing to relog the inode
> in each new intent?) or a log recovery problem (failing to match
> intent-done pairs correctly?), not a recovery deferral issue.
> 
> Hence I think we're still looking for the root cause of this
> problem...
> 
> > > Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
> > > ---
> > >  fs/xfs/xfs_attr_item.c | 35 ++++++++---------------------------
> > >  1 file changed, 8 insertions(+), 27 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
> > > index 5077a7ad5646..c13d724a3e13 100644
> > > --- a/fs/xfs/xfs_attr_item.c
> > > +++ b/fs/xfs/xfs_attr_item.c
> > > @@ -635,52 +635,33 @@ xfs_attri_item_recover(
> > >  		break;
> > >  	case XFS_ATTRI_OP_FLAGS_REMOVE:
> > >  		if (!xfs_inode_hasattr(args->dp))
> > > -			goto out;
> > > +			return 0;
> > >  		attr->xattri_dela_state =
> > > xfs_attr_init_remove_state(args);
> > >  		break;
> > >  	default:
> > >  		ASSERT(0);
> > > -		error = -EFSCORRUPTED;
> > > -		goto out;
> > > +		return -EFSCORRUPTED;
> > >  	}
> > >  
> > >  	xfs_init_attr_trans(args, &tres, &total);
> > >  	error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE,
> > > &tp);
> > >  	if (error)
> > > -		goto out;
> > > +		return error;
> > >  
> > >  	args->trans = tp;
> > >  	done_item = xfs_trans_get_attrd(tp, attrip);
> > > +	args->trans->t_flags |= XFS_TRANS_HAS_INTENT_DONE;
> > > +	set_bit(XFS_LI_DIRTY, &done_item->attrd_item.li_flags);
> > >  
> > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > >  	xfs_trans_ijoin(tp, ip, 0);
> > >  
> > > -	error = xfs_xattri_finish_update(attr, done_item);
> > > -	if (error == -EAGAIN) {
> > > -		/*
> > > -		 * There's more work to do, so add the intent item to
> > > this
> > > -		 * transaction so that we can continue it later.
> > > -		 */
> > > -		xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr-
> > > >xattri_list);
> > > -		error = xfs_defer_ops_capture_and_commit(tp,
> > > capture_list);
> > > -		if (error)
> > > -			goto out_unlock;
> > > -
> > > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > -		xfs_irele(ip);
> > > -		return 0;
> > > -	}
> > > -	if (error) {
> > > -		xfs_trans_cancel(tp);
> > > -		goto out_unlock;
> > > -	}
> > > -
> > > +	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
> > 
> > This seems a little convoluted to me.  Maybe?  Maybe not?
> > 
> > 1. Log recovery recreates an incore xfs_attri_log_item from what it
> > finds in the log.
> > 
> > 2. This function then logs an xattrd for the recovered xattri item.
> > 
> > 3. Then it creates a new xfs_attr_intent to complete the operation.
> > 
> > 4. Finally, it calls xfs_defer_ops_capture_and_commit, which logs a
> > new
> > xattri for the intent created in step 3 and also commits the xattrd
> > for
> > the first xattri.
> > 
> > IOWs, the only difference between before and after is that we're
> > not
> > advancing one more step through the state machine as part of log
> > recovery.  From the perspective of the log, the recovery function
> > merely
> > replaces the recovered xattri log item with a new one.
> > 
> > Why can't we just attach the recovered xattri to the
> > xfs_defer_pending
> > that is created to point to the xfs_attr_intent that's created in
> > step
> > 3, and skip the xattrd?
> 
> Remember that attribute intents are different to all other intent
> types that we have. The existing extent based intents define a
> single indepedent operation that needs to be performed, and each
> step of the intent chain is completely independent of the previous
> step in the chain.  e.g. removing the extent from the rmap btree is
> completely independent of removing it from the inode bmap btree -
> all that matters is that the removal from the bmbt happens first.
> The rmapbt removal can happen at any time after than, and is
> completely independent of any other bmbt or rmapbt operation.
> Similarly, the EFI can processed independently of all bmapbt and
> rmapbt modifications, it just has to happen after those
> modifications are done.
> 
> Hence if we crash during recovery, we can just restart from
> where-ever we got to in the middle of the intent chains and not have
> to care at all.  IOWs, eventual consistency works with these chains
> because there is no dependencies between each step of the intent
> chain and each step is completely independent of the other steps.
> 
> Attribute intent chains are completely different. They link steps in
> a state machine together in a non-trivial, highly dependent chain.
> We can't just restart the chain in the middle like we can for the
> BUI->RUI->CUI->EFI chain because the on-disk attribute is in an
> unknown state and recovering that exact state is .... complex.
> 
> Hence the the first step of recovery is to return the attribute we
> are trying to modify back to a known state. That means we have to
> perform a removal of any existing attribute under that name first.
> Hence this first step should be replacing the existing attr intent
> with the intent that defines the recovery operation we are going to
> perform.
> 
> That means we need to translate set to replace so that cleanup is
> run first, replace needs to clean up the attr under that name
> regardless of whether it has the incomplete bit set on it or not.
> Remove is the only operation that runs the same as at runtime, as
> cleanup for remove is just repeating the remove operation from
> scratch.
> 
> > I /think/ the answer to that question is that we might need to move
> > the
> > log tail forward to free enough log space to finish the intent
> > items, so
> > creating the extra xattrd/xattri (a) avoid the complexity of
> > submitting
> > an incore intent item *and* a log intent item to the defer ops
> > machinery; and (b) avoid livelocks in log recovery.  Therefore, we
> > actually need to do it this way.
> 
> We really need the initial operation to rewrite the intent to match
> the recovery operation we are going to perform. Everything else is
> secondary.
> 
> Cheers,
> 
> Dave.
Dave Chinner Aug. 10, 2022, 6:12 a.m. UTC | #5
On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote:
> On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote:
> > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote:
> > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson wrote:
> > > > Recent parent pointer testing has exposed a bug in the underlying
> > > > attr replay.  A multi transaction replay currently performs a
> > > > single step of the replay, then deferrs the rest if there is more
> > > > to do.
> > 
> > Yup.
> > 
> > > > This causes race conditions with other attr replays that
> > > > might be recovered before the remaining deferred work has had a
> > > > chance to finish.
> > 
> > What other attr replays are we racing against?  There can only be
> > one incomplete attr item intent/done chain per inode present in log
> > recovery, right?
> No, a rename queues up a set and remove before committing the
> transaction.  One for the new parent pointer, and another to remove the
> old one.

Ah. That really needs to be described in the commit message -
changing from "single intent chain per object" to "multiple
concurrent independent and unserialised intent chains per object" is
a pretty important design rule change...

The whole point of intents is to allow complex, multi-stage
operations on a single object to be sequenced in a tightly
controlled manner. They weren't intended to be run as concurrent
lines of modification on single items; if you need to do two
modifications on an object, the intent chain ties the two
modifications together into a single whole.

One of the reasons I rewrote the attr state machine for LARP was to
enable new multiple attr operation chains to be easily build from
the entry points the state machien provides. Parent attr rename
needs a new intent chain to be built, not run multiple independent
intent chains for each modification.

> It cant be an attr replace because technically the names are
> different.

I disagree - we have all the pieces we need in the state machine
already, we just need to define separate attr names for the
remove and insert steps in the attr intent.

That is, the "replace" operation we execute when an attr set
overwrites the value is "technically" a "replace value" operation,
but we actually implement it as a "replace entire attribute"
operation.

Without LARP, we do that overwrite in independent steps via an
intermediate INCOMPLETE state to allow two xattrs of the same name
to exist in the attr tree at the same time. IOWs, the attr value
overwrite is effectively a "set-swap-remove" operation on two
entirely independent xattrs, ensuring that if we crash we always
have either the old or new xattr visible.

With LARP, we can remove the original attr first, thereby avoiding
the need for two versions of the xattr to exist in the tree in the
first place. However, we have to do these two operations as a pair
of linked independent operations. The intent chain provides the
linking, and requires us to log the name and the value of the attr
that we are overwriting in the intent. Hence we can always recover
the modification to completion no matter where in the operation we
fail.

When it comes to a parent attr rename operation, we are effectively
doing two linked operations - remove the old attr, set the new attr
- on different attributes. Implementation wise, it is exactly the
same sequence as a "replace value" operation, except for the fact
that the new attr we add has a different name.

Hence the only real difference between the existing "attr replace"
and the intent chain we need for "parent attr rename" is that we
have to log two attr names instead of one. Basically, we have a new
XFS_ATTRI_OP_FLAGS... type for this, and that's what tells us that
we are operating on two different attributes instead of just one.

The recovery operation becomes slightly different - we have to run a
remove on the old, then a replace on the new - so there a little bit
of new code needed to manage that in the state machine.

These, however, are just small tweaks on the existing replace attr
operation, and there should be little difference in performance or
overhead between a "replace value" and a "replace entire xattr"
operation as they are largely the same runtime operation for LARP.

> So the recovered set grows the leaf, and returns the egain, then rest
> gets capture committed.  Next up is the recovered remove which pulls
> out the fork, which causes problems when the rest of the set operation
> resumes as a deferred operation.

Yup, and all this goes away when we build the right intent chain for
replacing a parent attr rename....

Cheers,

Dave.
Darrick J. Wong Aug. 10, 2022, 3:52 p.m. UTC | #6
On Wed, Aug 10, 2022 at 04:12:58PM +1000, Dave Chinner wrote:
> On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote:
> > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote:
> > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote:
> > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson wrote:
> > > > > Recent parent pointer testing has exposed a bug in the underlying
> > > > > attr replay.  A multi transaction replay currently performs a
> > > > > single step of the replay, then deferrs the rest if there is more
> > > > > to do.
> > > 
> > > Yup.
> > > 
> > > > > This causes race conditions with other attr replays that
> > > > > might be recovered before the remaining deferred work has had a
> > > > > chance to finish.
> > > 
> > > What other attr replays are we racing against?  There can only be
> > > one incomplete attr item intent/done chain per inode present in log
> > > recovery, right?
> > No, a rename queues up a set and remove before committing the
> > transaction.  One for the new parent pointer, and another to remove the
> > old one.
> 
> Ah. That really needs to be described in the commit message -
> changing from "single intent chain per object" to "multiple
> concurrent independent and unserialised intent chains per object" is
> a pretty important design rule change...
> 
> The whole point of intents is to allow complex, multi-stage
> operations on a single object to be sequenced in a tightly
> controlled manner. They weren't intended to be run as concurrent
> lines of modification on single items; if you need to do two
> modifications on an object, the intent chain ties the two
> modifications together into a single whole.

Back when I made the suggestion that resulted in this patch, I was
pondering why it is that (say) atomic swapext didn't suffer from these
recovery problems, and I realized that for any given inode, you can only
have one ongoing swapext operation at a time.  That's why recovery of
swapext operations works fine, whereas pptr recovery has this quirk.

At the time, my thought process was more narrowly focused on making log
recovery mimic runtime more closely.  I didn't make the connection
between this problem and the other open question I had (see the bottom)
about how to fix pptr attrs when rebuilding a directory.

> One of the reasons I rewrote the attr state machine for LARP was to
> enable new multiple attr operation chains to be easily build from
> the entry points the state machien provides. Parent attr rename
> needs a new intent chain to be built, not run multiple independent
> intent chains for each modification.
> 
> > It cant be an attr replace because technically the names are
> > different.
> 
> I disagree - we have all the pieces we need in the state machine
> already, we just need to define separate attr names for the
> remove and insert steps in the attr intent.
> 
> That is, the "replace" operation we execute when an attr set
> overwrites the value is "technically" a "replace value" operation,
> but we actually implement it as a "replace entire attribute"
> operation.

OH.  Right.  I forgot that ATTR_REPLACE=="replace entire attr".

If I'm understanding this right, that means that the xfs_rename patch
ought to detect the situation where there's an existing dirent in the
target directory, and do something along the lines of:

	} else { /* target_ip != NULL */
		xfs_dir_replace(...);

		xfs_parent_defer_replace(tp, new_parent_ptr, target_dp,
				old_diroffset, target_name,
				new_diroffset);

		xfs_trans_ichgtime(...);

Where the xfs_parent_defer_replace operation does an ATTR_REPLACE to
switch:

(target_dp_ino, target_gen, old_diroffset) == <dontcare>

to this:

(target_dp_ino, target_gen, new_diroffset) == target_name

except, I think we have to log the old name in addition to the new name,
because userspace ATTR_REPLACE operations don't allow name changes?

I guess this also implies that xfs_dir_replace will pass out the offset
of the old name, in addition to the offset of the new name.

> Without LARP, we do that overwrite in independent steps via an
> intermediate INCOMPLETE state to allow two xattrs of the same name
> to exist in the attr tree at the same time. IOWs, the attr value
> overwrite is effectively a "set-swap-remove" operation on two
> entirely independent xattrs, ensuring that if we crash we always
> have either the old or new xattr visible.
> 
> With LARP, we can remove the original attr first, thereby avoiding
> the need for two versions of the xattr to exist in the tree in the
> first place. However, we have to do these two operations as a pair
> of linked independent operations. The intent chain provides the
> linking, and requires us to log the name and the value of the attr
> that we are overwriting in the intent. Hence we can always recover
> the modification to completion no matter where in the operation we
> fail.
> 
> When it comes to a parent attr rename operation, we are effectively
> doing two linked operations - remove the old attr, set the new attr
> - on different attributes. Implementation wise, it is exactly the
> same sequence as a "replace value" operation, except for the fact
> that the new attr we add has a different name.
> 
> Hence the only real difference between the existing "attr replace"
> and the intent chain we need for "parent attr rename" is that we
> have to log two attr names instead of one. Basically, we have a new
> XFS_ATTRI_OP_FLAGS... type for this, and that's what tells us that
> we are operating on two different attributes instead of just one.

This answers my earlier question: Yes, and yes.

> The recovery operation becomes slightly different - we have to run a
> remove on the old, then a replace on the new - so there a little bit
> of new code needed to manage that in the state machine.
> 
> These, however, are just small tweaks on the existing replace attr
> operation, and there should be little difference in performance or
> overhead between a "replace value" and a "replace entire xattr"
> operation as they are largely the same runtime operation for LARP.
> 
> > So the recovered set grows the leaf, and returns the egain, then rest
> > gets capture committed.  Next up is the recovered remove which pulls
> > out the fork, which causes problems when the rest of the set operation
> > resumes as a deferred operation.
> 
> Yup, and all this goes away when we build the right intent chain for
> replacing a parent attr rename....

Funnily enough, just last week I had thought that online repair was
going to require the ability to replace an entire xattr...

https://djwong.org/docs/xfs-online-fsck-design/#parent-pointers

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Allison Henderson Aug. 10, 2022, 7:28 p.m. UTC | #7
On Wed, 2022-08-10 at 08:52 -0700, Darrick J. Wong wrote:
> On Wed, Aug 10, 2022 at 04:12:58PM +1000, Dave Chinner wrote:
> > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote:
> > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote:
> > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong
> > > > wrote:
> > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson
> > > > > wrote:
> > > > > > Recent parent pointer testing has exposed a bug in the
> > > > > > underlying
> > > > > > attr replay.  A multi transaction replay currently performs
> > > > > > a
> > > > > > single step of the replay, then deferrs the rest if there
> > > > > > is more
> > > > > > to do.
> > > > 
> > > > Yup.
> > > > 
> > > > > > This causes race conditions with other attr replays that
> > > > > > might be recovered before the remaining deferred work has
> > > > > > had a
> > > > > > chance to finish.
> > > > 
> > > > What other attr replays are we racing against?  There can only
> > > > be
> > > > one incomplete attr item intent/done chain per inode present in
> > > > log
> > > > recovery, right?
> > > No, a rename queues up a set and remove before committing the
> > > transaction.  One for the new parent pointer, and another to
> > > remove the
> > > old one.
> > 
> > Ah. That really needs to be described in the commit message -
> > changing from "single intent chain per object" to "multiple
> > concurrent independent and unserialised intent chains per object"
> > is
> > a pretty important design rule change...
> > 
> > The whole point of intents is to allow complex, multi-stage
> > operations on a single object to be sequenced in a tightly
> > controlled manner. They weren't intended to be run as concurrent
> > lines of modification on single items; if you need to do two
> > modifications on an object, the intent chain ties the two
> > modifications together into a single whole.
> 
> Back when I made the suggestion that resulted in this patch, I was
> pondering why it is that (say) atomic swapext didn't suffer from
> these
> recovery problems, and I realized that for any given inode, you can
> only
> have one ongoing swapext operation at a time.  That's why recovery of
> swapext operations works fine, whereas pptr recovery has this quirk.
> 
> At the time, my thought process was more narrowly focused on making
> log
> recovery mimic runtime more closely.  I didn't make the connection
> between this problem and the other open question I had (see the
> bottom)
> about how to fix pptr attrs when rebuilding a directory.
> 
> > One of the reasons I rewrote the attr state machine for LARP was to
> > enable new multiple attr operation chains to be easily build from
> > the entry points the state machien provides. Parent attr rename
> > needs a new intent chain to be built, not run multiple independent
> > intent chains for each modification.
> > 
> > > It cant be an attr replace because technically the names are
> > > different.
> > 
> > I disagree - we have all the pieces we need in the state machine
> > already, we just need to define separate attr names for the
> > remove and insert steps in the attr intent.
> > 
> > That is, the "replace" operation we execute when an attr set
> > overwrites the value is "technically" a "replace value" operation,
> > but we actually implement it as a "replace entire attribute"
> > operation.
> 
> OH.  Right.  I forgot that ATTR_REPLACE=="replace entire attr".
> 
> If I'm understanding this right, that means that the xfs_rename patch
> ought to detect the situation where there's an existing dirent in the
> target directory, and do something along the lines of:
> 
> 	} else { /* target_ip != NULL */
> 		xfs_dir_replace(...);
> 
> 		xfs_parent_defer_replace(tp, new_parent_ptr, target_dp,
> 				old_diroffset, target_name,
> 				new_diroffset);
> 
> 		xfs_trans_ichgtime(...);
> 
> Where the xfs_parent_defer_replace operation does an ATTR_REPLACE to
> switch:
> 
> (target_dp_ino, target_gen, old_diroffset) == <dontcare>
> 
> to this:
> 
> (target_dp_ino, target_gen, new_diroffset) == target_name
> 
> except, I think we have to log the old name in addition to the new
> name,
> because userspace ATTR_REPLACE operations don't allow name changes?
> 
> I guess this also implies that xfs_dir_replace will pass out the
> offset
> of the old name, in addition to the offset of the new name.
> 
> > Without LARP, we do that overwrite in independent steps via an
> > intermediate INCOMPLETE state to allow two xattrs of the same name
> > to exist in the attr tree at the same time. IOWs, the attr value
> > overwrite is effectively a "set-swap-remove" operation on two
> > entirely independent xattrs, ensuring that if we crash we always
> > have either the old or new xattr visible.
> > 
> > With LARP, we can remove the original attr first, thereby avoiding
> > the need for two versions of the xattr to exist in the tree in the
> > first place. However, we have to do these two operations as a pair
> > of linked independent operations. The intent chain provides the
> > linking, and requires us to log the name and the value of the attr
> > that we are overwriting in the intent. Hence we can always recover
> > the modification to completion no matter where in the operation we
> > fail.
> > 
> > When it comes to a parent attr rename operation, we are effectively
> > doing two linked operations - remove the old attr, set the new attr
> > - on different attributes. Implementation wise, it is exactly the
> > same sequence as a "replace value" operation, except for the fact
> > that the new attr we add has a different name.
> > 
> > Hence the only real difference between the existing "attr replace"
> > and the intent chain we need for "parent attr rename" is that we
> > have to log two attr names instead of one. Basically, we have a new
> > XFS_ATTRI_OP_FLAGS... type for this, and that's what tells us that
> > we are operating on two different attributes instead of just one.
> 
> This answers my earlier question: Yes, and yes.

I see, alrighty then, I'll see if I can put together a new
XFS_ATTRI_OP_FLAGS type that carries both the old and new name.  That
sounds like it should work.  Thanks for all the feed back!

Allison


> 
> > The recovery operation becomes slightly different - we have to run
> > a
> > remove on the old, then a replace on the new - so there a little
> > bit
> > of new code needed to manage that in the state machine.
> > 
> > These, however, are just small tweaks on the existing replace attr
> > operation, and there should be little difference in performance or
> > overhead between a "replace value" and a "replace entire xattr"
> > operation as they are largely the same runtime operation for LARP.
> > 
> > > So the recovered set grows the leaf, and returns the egain, then
> > > rest
> > > gets capture committed.  Next up is the recovered remove which
> > > pulls
> > > out the fork, which causes problems when the rest of the set
> > > operation
> > > resumes as a deferred operation.
> > 
> > Yup, and all this goes away when we build the right intent chain
> > for
> > replacing a parent attr rename....
> 
> Funnily enough, just last week I had thought that online repair was
> going to require the ability to replace an entire xattr...
> 
> https://urldefense.com/v3/__https://djwong.org/docs/xfs-online-fsck-design/*parent-pointers__;Iw!!ACWV5N9M2RV99hQ!MA2KfxWZLMTj_fdJoFnvZhLIgOGsGlIclRVE39DFME755VnvyX4VqsQGM6GfBDnDXKkfAcFjdv2oENaXepic$ 
> 
> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
Allison Henderson Aug. 12, 2022, 1:55 a.m. UTC | #8
On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote:
> On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote:
> > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote:
> > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote:
> > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson
> > > > wrote:
> > > > > Recent parent pointer testing has exposed a bug in the
> > > > > underlying
> > > > > attr replay.  A multi transaction replay currently performs a
> > > > > single step of the replay, then deferrs the rest if there is
> > > > > more
> > > > > to do.
> > > 
> > > Yup.
> > > 
> > > > > This causes race conditions with other attr replays that
> > > > > might be recovered before the remaining deferred work has had
> > > > > a
> > > > > chance to finish.
> > > 
> > > What other attr replays are we racing against?  There can only be
> > > one incomplete attr item intent/done chain per inode present in
> > > log
> > > recovery, right?
> > No, a rename queues up a set and remove before committing the
> > transaction.  One for the new parent pointer, and another to remove
> > the
> > old one.
> 
> Ah. That really needs to be described in the commit message -
> changing from "single intent chain per object" to "multiple
> concurrent independent and unserialised intent chains per object" is
> a pretty important design rule change...
> 
> The whole point of intents is to allow complex, multi-stage
> operations on a single object to be sequenced in a tightly
> controlled manner. They weren't intended to be run as concurrent
> lines of modification on single items; if you need to do two
> modifications on an object, the intent chain ties the two
> modifications together into a single whole.
> 
> One of the reasons I rewrote the attr state machine for LARP was to
> enable new multiple attr operation chains to be easily build from
> the entry points the state machien provides. Parent attr rename
> needs a new intent chain to be built, not run multiple independent
> intent chains for each modification.
> 
> > It cant be an attr replace because technically the names are
> > different.
> 
> I disagree - we have all the pieces we need in the state machine
> already, we just need to define separate attr names for the
> remove and insert steps in the attr intent.
> 
> That is, the "replace" operation we execute when an attr set
> overwrites the value is "technically" a "replace value" operation,
> but we actually implement it as a "replace entire attribute"
> operation.
> 
> Without LARP, we do that overwrite in independent steps via an
> intermediate INCOMPLETE state to allow two xattrs of the same name
> to exist in the attr tree at the same time. IOWs, the attr value
> overwrite is effectively a "set-swap-remove" operation on two
> entirely independent xattrs, ensuring that if we crash we always
> have either the old or new xattr visible.
> 
> With LARP, we can remove the original attr first, thereby avoiding
> the need for two versions of the xattr to exist in the tree in the
> first place. However, we have to do these two operations as a pair
> of linked independent operations. The intent chain provides the
> linking, and requires us to log the name and the value of the attr
> that we are overwriting in the intent. Hence we can always recover
> the modification to completion no matter where in the operation we
> fail.
> 
> When it comes to a parent attr rename operation, we are effectively
> doing two linked operations - remove the old attr, set the new attr
> - on different attributes. Implementation wise, it is exactly the
> same sequence as a "replace value" operation, except for the fact
> that the new attr we add has a different name.
> 
> Hence the only real difference between the existing "attr replace"
> and the intent chain we need for "parent attr rename" is that we
> have to log two attr names instead of one. 

To be clear, this would imply expanding xfs_attri_log_format to have
another alfi_new_name_len feild and another iovec for the attr intent
right?  Does that cause issues to change the on disk log layout after
the original has merged?  Or is that ok for things that are still
experimental? Thanks!

Allison

> Basically, we have a new
> XFS_ATTRI_OP_FLAGS... type for this, and that's what tells us that
> we are operating on two different attributes instead of just one.
> 
> The recovery operation becomes slightly different - we have to run a
> remove on the old, then a replace on the new - so there a little bit
> of new code needed to manage that in the state machine.
> 
> These, however, are just small tweaks on the existing replace attr
> operation, and there should be little difference in performance or
> overhead between a "replace value" and a "replace entire xattr"
> operation as they are largely the same runtime operation for LARP.
> 
> > So the recovered set grows the leaf, and returns the egain, then
> > rest
> > gets capture committed.  Next up is the recovered remove which
> > pulls
> > out the fork, which causes problems when the rest of the set
> > operation
> > resumes as a deferred operation.
> 
> Yup, and all this goes away when we build the right intent chain for
> replacing a parent attr rename....
> 
> Cheers,
> 
> Dave.
Darrick J. Wong Aug. 12, 2022, 3:05 a.m. UTC | #9
On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote:
> On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote:
> > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote:
> > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote:
> > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote:
> > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson
> > > > > wrote:
> > > > > > Recent parent pointer testing has exposed a bug in the
> > > > > > underlying
> > > > > > attr replay.  A multi transaction replay currently performs a
> > > > > > single step of the replay, then deferrs the rest if there is
> > > > > > more
> > > > > > to do.
> > > > 
> > > > Yup.
> > > > 
> > > > > > This causes race conditions with other attr replays that
> > > > > > might be recovered before the remaining deferred work has had
> > > > > > a
> > > > > > chance to finish.
> > > > 
> > > > What other attr replays are we racing against?  There can only be
> > > > one incomplete attr item intent/done chain per inode present in
> > > > log
> > > > recovery, right?
> > > No, a rename queues up a set and remove before committing the
> > > transaction.  One for the new parent pointer, and another to remove
> > > the
> > > old one.
> > 
> > Ah. That really needs to be described in the commit message -
> > changing from "single intent chain per object" to "multiple
> > concurrent independent and unserialised intent chains per object" is
> > a pretty important design rule change...
> > 
> > The whole point of intents is to allow complex, multi-stage
> > operations on a single object to be sequenced in a tightly
> > controlled manner. They weren't intended to be run as concurrent
> > lines of modification on single items; if you need to do two
> > modifications on an object, the intent chain ties the two
> > modifications together into a single whole.
> > 
> > One of the reasons I rewrote the attr state machine for LARP was to
> > enable new multiple attr operation chains to be easily build from
> > the entry points the state machien provides. Parent attr rename
> > needs a new intent chain to be built, not run multiple independent
> > intent chains for each modification.
> > 
> > > It cant be an attr replace because technically the names are
> > > different.
> > 
> > I disagree - we have all the pieces we need in the state machine
> > already, we just need to define separate attr names for the
> > remove and insert steps in the attr intent.
> > 
> > That is, the "replace" operation we execute when an attr set
> > overwrites the value is "technically" a "replace value" operation,
> > but we actually implement it as a "replace entire attribute"
> > operation.
> > 
> > Without LARP, we do that overwrite in independent steps via an
> > intermediate INCOMPLETE state to allow two xattrs of the same name
> > to exist in the attr tree at the same time. IOWs, the attr value
> > overwrite is effectively a "set-swap-remove" operation on two
> > entirely independent xattrs, ensuring that if we crash we always
> > have either the old or new xattr visible.
> > 
> > With LARP, we can remove the original attr first, thereby avoiding
> > the need for two versions of the xattr to exist in the tree in the
> > first place. However, we have to do these two operations as a pair
> > of linked independent operations. The intent chain provides the
> > linking, and requires us to log the name and the value of the attr
> > that we are overwriting in the intent. Hence we can always recover
> > the modification to completion no matter where in the operation we
> > fail.
> > 
> > When it comes to a parent attr rename operation, we are effectively
> > doing two linked operations - remove the old attr, set the new attr
> > - on different attributes. Implementation wise, it is exactly the
> > same sequence as a "replace value" operation, except for the fact
> > that the new attr we add has a different name.
> > 
> > Hence the only real difference between the existing "attr replace"
> > and the intent chain we need for "parent attr rename" is that we
> > have to log two attr names instead of one. 
> 
> To be clear, this would imply expanding xfs_attri_log_format to have
> another alfi_new_name_len feild and another iovec for the attr intent
> right?  Does that cause issues to change the on disk log layout after
> the original has merged?  Or is that ok for things that are still
> experimental? Thanks!

XFS_SB_FEAT_INCOMPAT_PARENT should protect against that, since the
userspace xattr api does not support replacing an attr's name, only the
value.

--D

> Allison
> 
> > Basically, we have a new
> > XFS_ATTRI_OP_FLAGS... type for this, and that's what tells us that
> > we are operating on two different attributes instead of just one.
> > 
> > The recovery operation becomes slightly different - we have to run a
> > remove on the old, then a replace on the new - so there a little bit
> > of new code needed to manage that in the state machine.
> > 
> > These, however, are just small tweaks on the existing replace attr
> > operation, and there should be little difference in performance or
> > overhead between a "replace value" and a "replace entire xattr"
> > operation as they are largely the same runtime operation for LARP.
> > 
> > > So the recovered set grows the leaf, and returns the egain, then
> > > rest
> > > gets capture committed.  Next up is the recovered remove which
> > > pulls
> > > out the fork, which causes problems when the rest of the set
> > > operation
> > > resumes as a deferred operation.
> > 
> > Yup, and all this goes away when we build the right intent chain for
> > replacing a parent attr rename....
> > 
> > Cheers,
> > 
> > Dave.
>
Dave Chinner Aug. 16, 2022, 12:54 a.m. UTC | #10
On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote:
> On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote:
> > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote:
> > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote:
> > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote:
> > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson
> > > > > wrote:
> > > > > > Recent parent pointer testing has exposed a bug in the
> > > > > > underlying
> > > > > > attr replay.  A multi transaction replay currently performs a
> > > > > > single step of the replay, then deferrs the rest if there is
> > > > > > more
> > > > > > to do.
> > > > 
> > > > Yup.
> > > > 
> > > > > > This causes race conditions with other attr replays that
> > > > > > might be recovered before the remaining deferred work has had
> > > > > > a
> > > > > > chance to finish.
> > > > 
> > > > What other attr replays are we racing against?  There can only be
> > > > one incomplete attr item intent/done chain per inode present in
> > > > log
> > > > recovery, right?
> > > No, a rename queues up a set and remove before committing the
> > > transaction.  One for the new parent pointer, and another to remove
> > > the
> > > old one.
> > 
> > Ah. That really needs to be described in the commit message -
> > changing from "single intent chain per object" to "multiple
> > concurrent independent and unserialised intent chains per object" is
> > a pretty important design rule change...
> > 
> > The whole point of intents is to allow complex, multi-stage
> > operations on a single object to be sequenced in a tightly
> > controlled manner. They weren't intended to be run as concurrent
> > lines of modification on single items; if you need to do two
> > modifications on an object, the intent chain ties the two
> > modifications together into a single whole.
> > 
> > One of the reasons I rewrote the attr state machine for LARP was to
> > enable new multiple attr operation chains to be easily build from
> > the entry points the state machien provides. Parent attr rename
> > needs a new intent chain to be built, not run multiple independent
> > intent chains for each modification.
> > 
> > > It cant be an attr replace because technically the names are
> > > different.
> > 
> > I disagree - we have all the pieces we need in the state machine
> > already, we just need to define separate attr names for the
> > remove and insert steps in the attr intent.
> > 
> > That is, the "replace" operation we execute when an attr set
> > overwrites the value is "technically" a "replace value" operation,
> > but we actually implement it as a "replace entire attribute"
> > operation.
> > 
> > Without LARP, we do that overwrite in independent steps via an
> > intermediate INCOMPLETE state to allow two xattrs of the same name
> > to exist in the attr tree at the same time. IOWs, the attr value
> > overwrite is effectively a "set-swap-remove" operation on two
> > entirely independent xattrs, ensuring that if we crash we always
> > have either the old or new xattr visible.
> > 
> > With LARP, we can remove the original attr first, thereby avoiding
> > the need for two versions of the xattr to exist in the tree in the
> > first place. However, we have to do these two operations as a pair
> > of linked independent operations. The intent chain provides the
> > linking, and requires us to log the name and the value of the attr
> > that we are overwriting in the intent. Hence we can always recover
> > the modification to completion no matter where in the operation we
> > fail.
> > 
> > When it comes to a parent attr rename operation, we are effectively
> > doing two linked operations - remove the old attr, set the new attr
> > - on different attributes. Implementation wise, it is exactly the
> > same sequence as a "replace value" operation, except for the fact
> > that the new attr we add has a different name.
> > 
> > Hence the only real difference between the existing "attr replace"
> > and the intent chain we need for "parent attr rename" is that we
> > have to log two attr names instead of one. 
> 
> To be clear, this would imply expanding xfs_attri_log_format to have
> another alfi_new_name_len feild and another iovec for the attr intent
> right?  Does that cause issues to change the on disk log layout after
> the original has merged?  Or is that ok for things that are still
> experimental? Thanks!

I think we can get away with this quite easily without breaking the
existing experimental code.

struct xfs_attri_log_format {
        uint16_t        alfi_type;      /* attri log item type */
        uint16_t        alfi_size;      /* size of this item */
        uint32_t        __pad;          /* pad to 64 bit aligned */
        uint64_t        alfi_id;        /* attri identifier */
        uint64_t        alfi_ino;       /* the inode for this attr operation */
        uint32_t        alfi_op_flags;  /* marks the op as a set or remove */
        uint32_t        alfi_name_len;  /* attr name length */
        uint32_t        alfi_value_len; /* attr value length */
        uint32_t        alfi_attr_filter;/* attr filter flags */
};

We have a padding field in there that is currently all zeros. Let's
make that a count of the number of {name, value} tuples that are
appended to the format. i.e.

struct xfs_attri_log_name {
        uint32_t        alfi_op_flags;  /* marks the op as a set or remove */
        uint32_t        alfi_name_len;  /* attr name length */
        uint32_t        alfi_value_len; /* attr value length */
        uint32_t        alfi_attr_filter;/* attr filter flags */
};

struct xfs_attri_log_format {
        uint16_t        alfi_type;      /* attri log item type */
        uint16_t        alfi_size;      /* size of this item */
	uint8_t		alfi_attr_cnt;	/* count of name/val pairs */
        uint8_t		__pad1;          /* pad to 64 bit aligned */
        uint16_t	__pad2;          /* pad to 64 bit aligned */
        uint64_t        alfi_id;        /* attri identifier */
        uint64_t        alfi_ino;       /* the inode for this attr operation */
	struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on */
};

Basically, the size and shape of the structure has not changed, and
if alfi_attr_cnt == 0 we just treat it as if alfi_attr_cnt == 1 as
the backwards compat code for the existing code.

And then we just have as many followup regions for name/val pairs
as are defined by the alfi_attr_cnt and alfi_attr[] parts of the
structure. Each attr can have a different operation performed on
them, and they can have different filters applied so they can exist
in different namespaces, too.

SO I don't think we need a new on-disk feature bit for this
enhancement - it definitely comes under the heading of "this stuff
is experimental, this is the sort of early structure revision that
EXPERIMENTAL is supposed to cover....

Cheers,

Dave.
Darrick J. Wong Aug. 16, 2022, 5:07 a.m. UTC | #11
On Tue, Aug 16, 2022 at 10:54:38AM +1000, Dave Chinner wrote:
> On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote:
> > On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote:
> > > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote:
> > > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote:
> > > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote:
> > > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson
> > > > > > wrote:
> > > > > > > Recent parent pointer testing has exposed a bug in the
> > > > > > > underlying
> > > > > > > attr replay.  A multi transaction replay currently performs a
> > > > > > > single step of the replay, then deferrs the rest if there is
> > > > > > > more
> > > > > > > to do.
> > > > > 
> > > > > Yup.
> > > > > 
> > > > > > > This causes race conditions with other attr replays that
> > > > > > > might be recovered before the remaining deferred work has had
> > > > > > > a
> > > > > > > chance to finish.
> > > > > 
> > > > > What other attr replays are we racing against?  There can only be
> > > > > one incomplete attr item intent/done chain per inode present in
> > > > > log
> > > > > recovery, right?
> > > > No, a rename queues up a set and remove before committing the
> > > > transaction.  One for the new parent pointer, and another to remove
> > > > the
> > > > old one.
> > > 
> > > Ah. That really needs to be described in the commit message -
> > > changing from "single intent chain per object" to "multiple
> > > concurrent independent and unserialised intent chains per object" is
> > > a pretty important design rule change...
> > > 
> > > The whole point of intents is to allow complex, multi-stage
> > > operations on a single object to be sequenced in a tightly
> > > controlled manner. They weren't intended to be run as concurrent
> > > lines of modification on single items; if you need to do two
> > > modifications on an object, the intent chain ties the two
> > > modifications together into a single whole.
> > > 
> > > One of the reasons I rewrote the attr state machine for LARP was to
> > > enable new multiple attr operation chains to be easily build from
> > > the entry points the state machien provides. Parent attr rename
> > > needs a new intent chain to be built, not run multiple independent
> > > intent chains for each modification.
> > > 
> > > > It cant be an attr replace because technically the names are
> > > > different.
> > > 
> > > I disagree - we have all the pieces we need in the state machine
> > > already, we just need to define separate attr names for the
> > > remove and insert steps in the attr intent.
> > > 
> > > That is, the "replace" operation we execute when an attr set
> > > overwrites the value is "technically" a "replace value" operation,
> > > but we actually implement it as a "replace entire attribute"
> > > operation.
> > > 
> > > Without LARP, we do that overwrite in independent steps via an
> > > intermediate INCOMPLETE state to allow two xattrs of the same name
> > > to exist in the attr tree at the same time. IOWs, the attr value
> > > overwrite is effectively a "set-swap-remove" operation on two
> > > entirely independent xattrs, ensuring that if we crash we always
> > > have either the old or new xattr visible.
> > > 
> > > With LARP, we can remove the original attr first, thereby avoiding
> > > the need for two versions of the xattr to exist in the tree in the
> > > first place. However, we have to do these two operations as a pair
> > > of linked independent operations. The intent chain provides the
> > > linking, and requires us to log the name and the value of the attr
> > > that we are overwriting in the intent. Hence we can always recover
> > > the modification to completion no matter where in the operation we
> > > fail.
> > > 
> > > When it comes to a parent attr rename operation, we are effectively
> > > doing two linked operations - remove the old attr, set the new attr
> > > - on different attributes. Implementation wise, it is exactly the
> > > same sequence as a "replace value" operation, except for the fact
> > > that the new attr we add has a different name.
> > > 
> > > Hence the only real difference between the existing "attr replace"
> > > and the intent chain we need for "parent attr rename" is that we
> > > have to log two attr names instead of one. 
> > 
> > To be clear, this would imply expanding xfs_attri_log_format to have
> > another alfi_new_name_len feild and another iovec for the attr intent
> > right?  Does that cause issues to change the on disk log layout after
> > the original has merged?  Or is that ok for things that are still
> > experimental? Thanks!
> 
> I think we can get away with this quite easily without breaking the
> existing experimental code.
> 
> struct xfs_attri_log_format {
>         uint16_t        alfi_type;      /* attri log item type */
>         uint16_t        alfi_size;      /* size of this item */
>         uint32_t        __pad;          /* pad to 64 bit aligned */
>         uint64_t        alfi_id;        /* attri identifier */
>         uint64_t        alfi_ino;       /* the inode for this attr operation */
>         uint32_t        alfi_op_flags;  /* marks the op as a set or remove */
>         uint32_t        alfi_name_len;  /* attr name length */
>         uint32_t        alfi_value_len; /* attr value length */
>         uint32_t        alfi_attr_filter;/* attr filter flags */
> };
> 
> We have a padding field in there that is currently all zeros. Let's
> make that a count of the number of {name, value} tuples that are
> appended to the format. i.e.
> 
> struct xfs_attri_log_name {
>         uint32_t        alfi_op_flags;  /* marks the op as a set or remove */
>         uint32_t        alfi_name_len;  /* attr name length */
>         uint32_t        alfi_value_len; /* attr value length */
>         uint32_t        alfi_attr_filter;/* attr filter flags */
> };
> 
> struct xfs_attri_log_format {
>         uint16_t        alfi_type;      /* attri log item type */
>         uint16_t        alfi_size;      /* size of this item */
> 	uint8_t		alfi_attr_cnt;	/* count of name/val pairs */
>         uint8_t		__pad1;          /* pad to 64 bit aligned */
>         uint16_t	__pad2;          /* pad to 64 bit aligned */
>         uint64_t        alfi_id;        /* attri identifier */
>         uint64_t        alfi_ino;       /* the inode for this attr operation */
> 	struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on */
> };
> 
> Basically, the size and shape of the structure has not changed, and
> if alfi_attr_cnt == 0 we just treat it as if alfi_attr_cnt == 1 as
> the backwards compat code for the existing code.
> 
> And then we just have as many followup regions for name/val pairs
> as are defined by the alfi_attr_cnt and alfi_attr[] parts of the
> structure. Each attr can have a different operation performed on
> them, and they can have different filters applied so they can exist
> in different namespaces, too.
> 
> SO I don't think we need a new on-disk feature bit for this
> enhancement - it definitely comes under the heading of "this stuff
> is experimental, this is the sort of early structure revision that
> EXPERIMENTAL is supposed to cover....

You might even callit "alfi_extra_names" to avoid the "0 means 1" stuff.
;)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Allison Henderson Aug. 16, 2022, 8:41 p.m. UTC | #12
On Mon, 2022-08-15 at 22:07 -0700, Darrick J. Wong wrote:
> On Tue, Aug 16, 2022 at 10:54:38AM +1000, Dave Chinner wrote:
> > On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote:
> > > On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote:
> > > > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote:
> > > > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote:
> > > > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong
> > > > > > wrote:
> > > > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison
> > > > > > > Henderson
> > > > > > > wrote:
> > > > > > > > Recent parent pointer testing has exposed a bug in the
> > > > > > > > underlying
> > > > > > > > attr replay.  A multi transaction replay currently
> > > > > > > > performs a
> > > > > > > > single step of the replay, then deferrs the rest if
> > > > > > > > there is
> > > > > > > > more
> > > > > > > > to do.
> > > > > > 
> > > > > > Yup.
> > > > > > 
> > > > > > > > This causes race conditions with other attr replays
> > > > > > > > that
> > > > > > > > might be recovered before the remaining deferred work
> > > > > > > > has had
> > > > > > > > a
> > > > > > > > chance to finish.
> > > > > > 
> > > > > > What other attr replays are we racing against?  There can
> > > > > > only be
> > > > > > one incomplete attr item intent/done chain per inode
> > > > > > present in
> > > > > > log
> > > > > > recovery, right?
> > > > > No, a rename queues up a set and remove before committing the
> > > > > transaction.  One for the new parent pointer, and another to
> > > > > remove
> > > > > the
> > > > > old one.
> > > > 
> > > > Ah. That really needs to be described in the commit message -
> > > > changing from "single intent chain per object" to "multiple
> > > > concurrent independent and unserialised intent chains per
> > > > object" is
> > > > a pretty important design rule change...
> > > > 
> > > > The whole point of intents is to allow complex, multi-stage
> > > > operations on a single object to be sequenced in a tightly
> > > > controlled manner. They weren't intended to be run as
> > > > concurrent
> > > > lines of modification on single items; if you need to do two
> > > > modifications on an object, the intent chain ties the two
> > > > modifications together into a single whole.
> > > > 
> > > > One of the reasons I rewrote the attr state machine for LARP
> > > > was to
> > > > enable new multiple attr operation chains to be easily build
> > > > from
> > > > the entry points the state machien provides. Parent attr rename
> > > > needs a new intent chain to be built, not run multiple
> > > > independent
> > > > intent chains for each modification.
> > > > 
> > > > > It cant be an attr replace because technically the names are
> > > > > different.
> > > > 
> > > > I disagree - we have all the pieces we need in the state
> > > > machine
> > > > already, we just need to define separate attr names for the
> > > > remove and insert steps in the attr intent.
> > > > 
> > > > That is, the "replace" operation we execute when an attr set
> > > > overwrites the value is "technically" a "replace value"
> > > > operation,
> > > > but we actually implement it as a "replace entire attribute"
> > > > operation.
> > > > 
> > > > Without LARP, we do that overwrite in independent steps via an
> > > > intermediate INCOMPLETE state to allow two xattrs of the same
> > > > name
> > > > to exist in the attr tree at the same time. IOWs, the attr
> > > > value
> > > > overwrite is effectively a "set-swap-remove" operation on two
> > > > entirely independent xattrs, ensuring that if we crash we
> > > > always
> > > > have either the old or new xattr visible.
> > > > 
> > > > With LARP, we can remove the original attr first, thereby
> > > > avoiding
> > > > the need for two versions of the xattr to exist in the tree in
> > > > the
> > > > first place. However, we have to do these two operations as a
> > > > pair
> > > > of linked independent operations. The intent chain provides the
> > > > linking, and requires us to log the name and the value of the
> > > > attr
> > > > that we are overwriting in the intent. Hence we can always
> > > > recover
> > > > the modification to completion no matter where in the operation
> > > > we
> > > > fail.
> > > > 
> > > > When it comes to a parent attr rename operation, we are
> > > > effectively
> > > > doing two linked operations - remove the old attr, set the new
> > > > attr
> > > > - on different attributes. Implementation wise, it is exactly
> > > > the
> > > > same sequence as a "replace value" operation, except for the
> > > > fact
> > > > that the new attr we add has a different name.
> > > > 
> > > > Hence the only real difference between the existing "attr
> > > > replace"
> > > > and the intent chain we need for "parent attr rename" is that
> > > > we
> > > > have to log two attr names instead of one. 
> > > 
> > > To be clear, this would imply expanding xfs_attri_log_format to
> > > have
> > > another alfi_new_name_len feild and another iovec for the attr
> > > intent
> > > right?  Does that cause issues to change the on disk log layout
> > > after
> > > the original has merged?  Or is that ok for things that are still
> > > experimental? Thanks!
> > 
> > I think we can get away with this quite easily without breaking the
> > existing experimental code.
> > 
> > struct xfs_attri_log_format {
> >         uint16_t        alfi_type;      /* attri log item type */
> >         uint16_t        alfi_size;      /* size of this item */
> >         uint32_t        __pad;          /* pad to 64 bit aligned */
> >         uint64_t        alfi_id;        /* attri identifier */
> >         uint64_t        alfi_ino;       /* the inode for this attr
> > operation */
> >         uint32_t        alfi_op_flags;  /* marks the op as a set or
> > remove */
> >         uint32_t        alfi_name_len;  /* attr name length */
> >         uint32_t        alfi_value_len; /* attr value length */
> >         uint32_t        alfi_attr_filter;/* attr filter flags */
> > };
> > 
> > We have a padding field in there that is currently all zeros. Let's
> > make that a count of the number of {name, value} tuples that are
> > appended to the format. i.e.
> > 
> > struct xfs_attri_log_name {
> >         uint32_t        alfi_op_flags;  /* marks the op as a set or
> > remove */
> >         uint32_t        alfi_name_len;  /* attr name length */
> >         uint32_t        alfi_value_len; /* attr value length */
> >         uint32_t        alfi_attr_filter;/* attr filter flags */
> > };
> > 
> > struct xfs_attri_log_format {
> >         uint16_t        alfi_type;      /* attri log item type */
> >         uint16_t        alfi_size;      /* size of this item */
> > 	uint8_t		alfi_attr_cnt;	/* count of name/val pairs
> > */
> >         uint8_t		__pad1;          /* pad to 64 bit
> > aligned */
> >         uint16_t	__pad2;          /* pad to 64 bit aligned */
> >         uint64_t        alfi_id;        /* attri identifier */
> >         uint64_t        alfi_ino;       /* the inode for this attr
> > operation */
> > 	struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on
> > */
> > };
> > 
> > Basically, the size and shape of the structure has not changed, and
> > if alfi_attr_cnt == 0 we just treat it as if alfi_attr_cnt == 1 as
> > the backwards compat code for the existing code.
> > 
> > And then we just have as many followup regions for name/val pairs
> > as are defined by the alfi_attr_cnt and alfi_attr[] parts of the
> > structure. Each attr can have a different operation performed on
> > them, and they can have different filters applied so they can exist
> > in different namespaces, too.
> > 
> > SO I don't think we need a new on-disk feature bit for this
> > enhancement - it definitely comes under the heading of "this stuff
> > is experimental, this is the sort of early structure revision that
> > EXPERIMENTAL is supposed to cover....
> 
> You might even callit "alfi_extra_names" to avoid the "0 means 1"
> stuff.
> ;)
> 
> --D

Oh, I just noticed these comments this morning when I sent out the new
attri/d patch.  I'll add this changes to v2.  Please let me know if
there's anything else you'd like me to change from the v1.  Thx!

Allison

> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
Allison Henderson Aug. 19, 2022, 1:05 a.m. UTC | #13
On Tue, 2022-08-16 at 13:41 -0700, Alli wrote:
> On Mon, 2022-08-15 at 22:07 -0700, Darrick J. Wong wrote:
> > On Tue, Aug 16, 2022 at 10:54:38AM +1000, Dave Chinner wrote:
> > > On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote:
> > > > On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote:
> > > > > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote:
> > > > > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote:
> > > > > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong
> > > > > > > wrote:
> > > > > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison
> > > > > > > > Henderson
> > > > > > > > wrote:
> > > > > > > > > Recent parent pointer testing has exposed a bug in
> > > > > > > > > the
> > > > > > > > > underlying
> > > > > > > > > attr replay.  A multi transaction replay currently
> > > > > > > > > performs a
> > > > > > > > > single step of the replay, then deferrs the rest if
> > > > > > > > > there is
> > > > > > > > > more
> > > > > > > > > to do.
> > > > > > > 
> > > > > > > Yup.
> > > > > > > 
> > > > > > > > > This causes race conditions with other attr replays
> > > > > > > > > that
> > > > > > > > > might be recovered before the remaining deferred work
> > > > > > > > > has had
> > > > > > > > > a
> > > > > > > > > chance to finish.
> > > > > > > 
> > > > > > > What other attr replays are we racing against?  There can
> > > > > > > only be
> > > > > > > one incomplete attr item intent/done chain per inode
> > > > > > > present in
> > > > > > > log
> > > > > > > recovery, right?
> > > > > > No, a rename queues up a set and remove before committing
> > > > > > the
> > > > > > transaction.  One for the new parent pointer, and another
> > > > > > to
> > > > > > remove
> > > > > > the
> > > > > > old one.
> > > > > 
> > > > > Ah. That really needs to be described in the commit message -
> > > > > changing from "single intent chain per object" to "multiple
> > > > > concurrent independent and unserialised intent chains per
> > > > > object" is
> > > > > a pretty important design rule change...
> > > > > 
> > > > > The whole point of intents is to allow complex, multi-stage
> > > > > operations on a single object to be sequenced in a tightly
> > > > > controlled manner. They weren't intended to be run as
> > > > > concurrent
> > > > > lines of modification on single items; if you need to do two
> > > > > modifications on an object, the intent chain ties the two
> > > > > modifications together into a single whole.
> > > > > 
> > > > > One of the reasons I rewrote the attr state machine for LARP
> > > > > was to
> > > > > enable new multiple attr operation chains to be easily build
> > > > > from
> > > > > the entry points the state machien provides. Parent attr
> > > > > rename
> > > > > needs a new intent chain to be built, not run multiple
> > > > > independent
> > > > > intent chains for each modification.
> > > > > 
> > > > > > It cant be an attr replace because technically the names
> > > > > > are
> > > > > > different.
> > > > > 
> > > > > I disagree - we have all the pieces we need in the state
> > > > > machine
> > > > > already, we just need to define separate attr names for the
> > > > > remove and insert steps in the attr intent.
> > > > > 
> > > > > That is, the "replace" operation we execute when an attr set
> > > > > overwrites the value is "technically" a "replace value"
> > > > > operation,
> > > > > but we actually implement it as a "replace entire attribute"
> > > > > operation.
> > > > > 
> > > > > Without LARP, we do that overwrite in independent steps via
> > > > > an
> > > > > intermediate INCOMPLETE state to allow two xattrs of the same
> > > > > name
> > > > > to exist in the attr tree at the same time. IOWs, the attr
> > > > > value
> > > > > overwrite is effectively a "set-swap-remove" operation on two
> > > > > entirely independent xattrs, ensuring that if we crash we
> > > > > always
> > > > > have either the old or new xattr visible.
> > > > > 
> > > > > With LARP, we can remove the original attr first, thereby
> > > > > avoiding
> > > > > the need for two versions of the xattr to exist in the tree
> > > > > in
> > > > > the
> > > > > first place. However, we have to do these two operations as a
> > > > > pair
> > > > > of linked independent operations. The intent chain provides
> > > > > the
> > > > > linking, and requires us to log the name and the value of the
> > > > > attr
> > > > > that we are overwriting in the intent. Hence we can always
> > > > > recover
> > > > > the modification to completion no matter where in the
> > > > > operation
> > > > > we
> > > > > fail.
> > > > > 
> > > > > When it comes to a parent attr rename operation, we are
> > > > > effectively
> > > > > doing two linked operations - remove the old attr, set the
> > > > > new
> > > > > attr
> > > > > - on different attributes. Implementation wise, it is exactly
> > > > > the
> > > > > same sequence as a "replace value" operation, except for the
> > > > > fact
> > > > > that the new attr we add has a different name.
> > > > > 
> > > > > Hence the only real difference between the existing "attr
> > > > > replace"
> > > > > and the intent chain we need for "parent attr rename" is that
> > > > > we
> > > > > have to log two attr names instead of one. 
> > > > 
> > > > To be clear, this would imply expanding xfs_attri_log_format to
> > > > have
> > > > another alfi_new_name_len feild and another iovec for the attr
> > > > intent
> > > > right?  Does that cause issues to change the on disk log layout
> > > > after
> > > > the original has merged?  Or is that ok for things that are
> > > > still
> > > > experimental? Thanks!
> > > 
> > > I think we can get away with this quite easily without breaking
> > > the
> > > existing experimental code.
> > > 
> > > struct xfs_attri_log_format {
> > >         uint16_t        alfi_type;      /* attri log item type */
> > >         uint16_t        alfi_size;      /* size of this item */
> > >         uint32_t        __pad;          /* pad to 64 bit aligned
> > > */
> > >         uint64_t        alfi_id;        /* attri identifier */
> > >         uint64_t        alfi_ino;       /* the inode for this
> > > attr
> > > operation */
> > >         uint32_t        alfi_op_flags;  /* marks the op as a set
> > > or
> > > remove */
> > >         uint32_t        alfi_name_len;  /* attr name length */
> > >         uint32_t        alfi_value_len; /* attr value length */
> > >         uint32_t        alfi_attr_filter;/* attr filter flags */
> > > };
> > > 
> > > We have a padding field in there that is currently all zeros.
> > > Let's
> > > make that a count of the number of {name, value} tuples that are
> > > appended to the format. i.e.
> > > 
> > > struct xfs_attri_log_name {
> > >         uint32_t        alfi_op_flags;  /* marks the op as a set
> > > or
> > > remove */
> > >         uint32_t        alfi_name_len;  /* attr name length */
> > >         uint32_t        alfi_value_len; /* attr value length */
> > >         uint32_t        alfi_attr_filter;/* attr filter flags */
> > > };
> > > 
> > > struct xfs_attri_log_format {
> > >         uint16_t        alfi_type;      /* attri log item type */
> > >         uint16_t        alfi_size;      /* size of this item */
> > > 	uint8_t		alfi_attr_cnt;	/* count of name/val
> > > pairs
> > > */
> > >         uint8_t		__pad1;          /* pad to 64 bit
> > > aligned */
> > >         uint16_t	__pad2;          /* pad to 64 bit aligned */
> > >         uint64_t        alfi_id;        /* attri identifier */
> > >         uint64_t        alfi_ino;       /* the inode for this
> > > attr
> > > operation */
> > > 	struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on
> > > */
> > > };
> > > 
> > > Basically, the size and shape of the structure has not changed,
> > > and
> > > if alfi_attr_cnt == 0 we just treat it as if alfi_attr_cnt == 1
> > > as
> > > the backwards compat code for the existing code.
> > > 
> > > And then we just have as many followup regions for name/val pairs
> > > as are defined by the alfi_attr_cnt and alfi_attr[] parts of the
> > > structure. Each attr can have a different operation performed on
> > > them, and they can have different filters applied so they can
> > > exist
> > > in different namespaces, too.
> > > 
> > > SO I don't think we need a new on-disk feature bit for this
> > > enhancement - it definitely comes under the heading of "this
> > > stuff
> > > is experimental, this is the sort of early structure revision
> > > that
> > > EXPERIMENTAL is supposed to cover....
> > 
> > You might even callit "alfi_extra_names" to avoid the "0 means 1"
> > stuff.
> > ;)
> > 
> > --D
> 
> Oh, I just noticed these comments this morning when I sent out the
> new
> attri/d patch.  I'll add this changes to v2.  Please let me know if
> there's anything else you'd like me to change from the v1.  Thx!
> 
> Allison

Ok, so I am part way through coding this up, and I'm getting this
feeling like this is not going to work out very well due to the size
checks for the log formats:

root@garnet:/home/achender/work_area/xfs-linux# git diff
fs/xfs/libxfs/xfs_log_format.h fs/xfs/xfs_ondisk.h
diff --git a/fs/xfs/libxfs/xfs_log_format.h
b/fs/xfs/libxfs/xfs_log_format.h
index f1ff52ebb982..5a4e700f32fc 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -922,6 +922,13 @@ struct xfs_icreate_log {
                                         XFS_ATTR_PARENT | \
                                         XFS_ATTR_INCOMPLETE)
 
+struct xfs_attri_log_name {
+       uint32_t        alfi_op_flags;  /* marks the op as a set or
remove */
+       uint32_t        alfi_name_len;  /* attr name length */
+       uint32_t        alfi_value_len; /* attr value length */
+       uint32_t        alfi_attr_filter;/* attr filter flags */
+};
+
 /*
  * This is the structure used to lay out an attr log item in the
  * log.
@@ -929,14 +936,12 @@ struct xfs_icreate_log {
 struct xfs_attri_log_format {
        uint16_t        alfi_type;      /* attri log item type */
        uint16_t        alfi_size;      /* size of this item */
-       uint32_t        __pad;          /* pad to 64 bit aligned */
+       uint8_t         alfi_extra_names;/* count of name/val pairs */
+       uint8_t         __pad1;         /* pad to 64 bit aligned */
+       uint16_t        __pad2;         /* pad to 64 bit aligned */
        uint64_t        alfi_id;        /* attri identifier */
        uint64_t        alfi_ino;       /* the inode for this attr
operation */
-       uint32_t        alfi_op_flags;  /* marks the op as a set or
remove */
-       uint32_t        alfi_name_len;  /* attr name length */
-       uint32_t        alfi_value_len; /* attr value length */
-       uint32_t        alfi_attr_filter;/* attr filter flags */
+       struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on
*/
 };
 
 struct xfs_attrd_log_format {
diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
index 3e7f7eaa5b96..c040eeb88def 100644
--- a/fs/xfs/xfs_ondisk.h
+++ b/fs/xfs/xfs_ondisk.h
@@ -132,7 +132,7 @@ xfs_check_ondisk_structs(void)
        XFS_CHECK_STRUCT_SIZE(struct xfs_inode_log_format,      56);
        XFS_CHECK_STRUCT_SIZE(struct xfs_qoff_logformat,        20);
        XFS_CHECK_STRUCT_SIZE(struct xfs_trans_header,          16);
-       XFS_CHECK_STRUCT_SIZE(struct xfs_attri_log_format,      48);
+       XFS_CHECK_STRUCT_SIZE(struct xfs_attri_log_format,      24);
        XFS_CHECK_STRUCT_SIZE(struct xfs_attrd_log_format,      16);
 
        /* parent pointer ioctls */
root@garnet:/home/achender/work_area/xfs-linux# 



If the on disk size check thinks the format is 24 bytes, and then we
surprise pack an array of structs after it, isnt that going to run over
the next item?  I think anything dynamic like this has to be an nvec.
 Maybe we leave the existing alfi_* as they are so the size doesnt
change, and then if we have a value in alfi_extra_names, then we have
an extra nvec that has the array in it.  I think that would work.

FWIW, an alternate solution would be to use the pad for a second name
length, and then we get a patch that's very similar to the one I sent
out last Tues, but backward compatible.  Though it does eat the
remaining pad and wouldn't be as flexible, I cant think of an attr op
that would need more than two names either?

Let me know what people think.  Thanks!
Allison


> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
Darrick J. Wong Aug. 23, 2022, 3:07 p.m. UTC | #14
On Thu, Aug 18, 2022 at 06:05:54PM -0700, Alli wrote:
> On Tue, 2022-08-16 at 13:41 -0700, Alli wrote:
> > On Mon, 2022-08-15 at 22:07 -0700, Darrick J. Wong wrote:
> > > On Tue, Aug 16, 2022 at 10:54:38AM +1000, Dave Chinner wrote:
> > > > On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote:
> > > > > On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote:
> > > > > > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote:
> > > > > > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote:
> > > > > > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong
> > > > > > > > wrote:
> > > > > > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison
> > > > > > > > > Henderson
> > > > > > > > > wrote:
> > > > > > > > > > Recent parent pointer testing has exposed a bug in
> > > > > > > > > > the
> > > > > > > > > > underlying
> > > > > > > > > > attr replay.  A multi transaction replay currently
> > > > > > > > > > performs a
> > > > > > > > > > single step of the replay, then deferrs the rest if
> > > > > > > > > > there is
> > > > > > > > > > more
> > > > > > > > > > to do.
> > > > > > > > 
> > > > > > > > Yup.
> > > > > > > > 
> > > > > > > > > > This causes race conditions with other attr replays
> > > > > > > > > > that
> > > > > > > > > > might be recovered before the remaining deferred work
> > > > > > > > > > has had
> > > > > > > > > > a
> > > > > > > > > > chance to finish.
> > > > > > > > 
> > > > > > > > What other attr replays are we racing against?  There can
> > > > > > > > only be
> > > > > > > > one incomplete attr item intent/done chain per inode
> > > > > > > > present in
> > > > > > > > log
> > > > > > > > recovery, right?
> > > > > > > No, a rename queues up a set and remove before committing
> > > > > > > the
> > > > > > > transaction.  One for the new parent pointer, and another
> > > > > > > to
> > > > > > > remove
> > > > > > > the
> > > > > > > old one.
> > > > > > 
> > > > > > Ah. That really needs to be described in the commit message -
> > > > > > changing from "single intent chain per object" to "multiple
> > > > > > concurrent independent and unserialised intent chains per
> > > > > > object" is
> > > > > > a pretty important design rule change...
> > > > > > 
> > > > > > The whole point of intents is to allow complex, multi-stage
> > > > > > operations on a single object to be sequenced in a tightly
> > > > > > controlled manner. They weren't intended to be run as
> > > > > > concurrent
> > > > > > lines of modification on single items; if you need to do two
> > > > > > modifications on an object, the intent chain ties the two
> > > > > > modifications together into a single whole.
> > > > > > 
> > > > > > One of the reasons I rewrote the attr state machine for LARP
> > > > > > was to
> > > > > > enable new multiple attr operation chains to be easily build
> > > > > > from
> > > > > > the entry points the state machien provides. Parent attr
> > > > > > rename
> > > > > > needs a new intent chain to be built, not run multiple
> > > > > > independent
> > > > > > intent chains for each modification.
> > > > > > 
> > > > > > > It cant be an attr replace because technically the names
> > > > > > > are
> > > > > > > different.
> > > > > > 
> > > > > > I disagree - we have all the pieces we need in the state
> > > > > > machine
> > > > > > already, we just need to define separate attr names for the
> > > > > > remove and insert steps in the attr intent.
> > > > > > 
> > > > > > That is, the "replace" operation we execute when an attr set
> > > > > > overwrites the value is "technically" a "replace value"
> > > > > > operation,
> > > > > > but we actually implement it as a "replace entire attribute"
> > > > > > operation.
> > > > > > 
> > > > > > Without LARP, we do that overwrite in independent steps via
> > > > > > an
> > > > > > intermediate INCOMPLETE state to allow two xattrs of the same
> > > > > > name
> > > > > > to exist in the attr tree at the same time. IOWs, the attr
> > > > > > value
> > > > > > overwrite is effectively a "set-swap-remove" operation on two
> > > > > > entirely independent xattrs, ensuring that if we crash we
> > > > > > always
> > > > > > have either the old or new xattr visible.
> > > > > > 
> > > > > > With LARP, we can remove the original attr first, thereby
> > > > > > avoiding
> > > > > > the need for two versions of the xattr to exist in the tree
> > > > > > in
> > > > > > the
> > > > > > first place. However, we have to do these two operations as a
> > > > > > pair
> > > > > > of linked independent operations. The intent chain provides
> > > > > > the
> > > > > > linking, and requires us to log the name and the value of the
> > > > > > attr
> > > > > > that we are overwriting in the intent. Hence we can always
> > > > > > recover
> > > > > > the modification to completion no matter where in the
> > > > > > operation
> > > > > > we
> > > > > > fail.
> > > > > > 
> > > > > > When it comes to a parent attr rename operation, we are
> > > > > > effectively
> > > > > > doing two linked operations - remove the old attr, set the
> > > > > > new
> > > > > > attr
> > > > > > - on different attributes. Implementation wise, it is exactly
> > > > > > the
> > > > > > same sequence as a "replace value" operation, except for the
> > > > > > fact
> > > > > > that the new attr we add has a different name.
> > > > > > 
> > > > > > Hence the only real difference between the existing "attr
> > > > > > replace"
> > > > > > and the intent chain we need for "parent attr rename" is that
> > > > > > we
> > > > > > have to log two attr names instead of one. 
> > > > > 
> > > > > To be clear, this would imply expanding xfs_attri_log_format to
> > > > > have
> > > > > another alfi_new_name_len feild and another iovec for the attr
> > > > > intent
> > > > > right?  Does that cause issues to change the on disk log layout
> > > > > after
> > > > > the original has merged?  Or is that ok for things that are
> > > > > still
> > > > > experimental? Thanks!
> > > > 
> > > > I think we can get away with this quite easily without breaking
> > > > the
> > > > existing experimental code.
> > > > 
> > > > struct xfs_attri_log_format {
> > > >         uint16_t        alfi_type;      /* attri log item type */
> > > >         uint16_t        alfi_size;      /* size of this item */
> > > >         uint32_t        __pad;          /* pad to 64 bit aligned
> > > > */
> > > >         uint64_t        alfi_id;        /* attri identifier */
> > > >         uint64_t        alfi_ino;       /* the inode for this
> > > > attr
> > > > operation */
> > > >         uint32_t        alfi_op_flags;  /* marks the op as a set
> > > > or
> > > > remove */
> > > >         uint32_t        alfi_name_len;  /* attr name length */
> > > >         uint32_t        alfi_value_len; /* attr value length */
> > > >         uint32_t        alfi_attr_filter;/* attr filter flags */
> > > > };
> > > > 
> > > > We have a padding field in there that is currently all zeros.
> > > > Let's
> > > > make that a count of the number of {name, value} tuples that are
> > > > appended to the format. i.e.
> > > > 
> > > > struct xfs_attri_log_name {
> > > >         uint32_t        alfi_op_flags;  /* marks the op as a set
> > > > or
> > > > remove */
> > > >         uint32_t        alfi_name_len;  /* attr name length */
> > > >         uint32_t        alfi_value_len; /* attr value length */
> > > >         uint32_t        alfi_attr_filter;/* attr filter flags */
> > > > };
> > > > 
> > > > struct xfs_attri_log_format {
> > > >         uint16_t        alfi_type;      /* attri log item type */
> > > >         uint16_t        alfi_size;      /* size of this item */
> > > > 	uint8_t		alfi_attr_cnt;	/* count of name/val
> > > > pairs
> > > > */
> > > >         uint8_t		__pad1;          /* pad to 64 bit
> > > > aligned */
> > > >         uint16_t	__pad2;          /* pad to 64 bit aligned */
> > > >         uint64_t        alfi_id;        /* attri identifier */
> > > >         uint64_t        alfi_ino;       /* the inode for this
> > > > attr
> > > > operation */
> > > > 	struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on
> > > > */
> > > > };
> > > > 
> > > > Basically, the size and shape of the structure has not changed,
> > > > and
> > > > if alfi_attr_cnt == 0 we just treat it as if alfi_attr_cnt == 1
> > > > as
> > > > the backwards compat code for the existing code.
> > > > 
> > > > And then we just have as many followup regions for name/val pairs
> > > > as are defined by the alfi_attr_cnt and alfi_attr[] parts of the
> > > > structure. Each attr can have a different operation performed on
> > > > them, and they can have different filters applied so they can
> > > > exist
> > > > in different namespaces, too.
> > > > 
> > > > SO I don't think we need a new on-disk feature bit for this
> > > > enhancement - it definitely comes under the heading of "this
> > > > stuff
> > > > is experimental, this is the sort of early structure revision
> > > > that
> > > > EXPERIMENTAL is supposed to cover....
> > > 
> > > You might even callit "alfi_extra_names" to avoid the "0 means 1"
> > > stuff.
> > > ;)
> > > 
> > > --D
> > 
> > Oh, I just noticed these comments this morning when I sent out the
> > new
> > attri/d patch.  I'll add this changes to v2.  Please let me know if
> > there's anything else you'd like me to change from the v1.  Thx!
> > 
> > Allison
> 
> Ok, so I am part way through coding this up, and I'm getting this
> feeling like this is not going to work out very well due to the size
> checks for the log formats:
> 
> root@garnet:/home/achender/work_area/xfs-linux# git diff
> fs/xfs/libxfs/xfs_log_format.h fs/xfs/xfs_ondisk.h
> diff --git a/fs/xfs/libxfs/xfs_log_format.h
> b/fs/xfs/libxfs/xfs_log_format.h
> index f1ff52ebb982..5a4e700f32fc 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -922,6 +922,13 @@ struct xfs_icreate_log {
>                                          XFS_ATTR_PARENT | \
>                                          XFS_ATTR_INCOMPLETE)
>  
> +struct xfs_attri_log_name {
> +       uint32_t        alfi_op_flags;  /* marks the op as a set or
> remove */
> +       uint32_t        alfi_name_len;  /* attr name length */
> +       uint32_t        alfi_value_len; /* attr value length */
> +       uint32_t        alfi_attr_filter;/* attr filter flags */
> +};
> +
>  /*
>   * This is the structure used to lay out an attr log item in the
>   * log.
> @@ -929,14 +936,12 @@ struct xfs_icreate_log {
>  struct xfs_attri_log_format {
>         uint16_t        alfi_type;      /* attri log item type */
>         uint16_t        alfi_size;      /* size of this item */
> -       uint32_t        __pad;          /* pad to 64 bit aligned */
> +       uint8_t         alfi_extra_names;/* count of name/val pairs */
> +       uint8_t         __pad1;         /* pad to 64 bit aligned */
> +       uint16_t        __pad2;         /* pad to 64 bit aligned */
>         uint64_t        alfi_id;        /* attri identifier */
>         uint64_t        alfi_ino;       /* the inode for this attr
> operation */
> -       uint32_t        alfi_op_flags;  /* marks the op as a set or
> remove */
> -       uint32_t        alfi_name_len;  /* attr name length */
> -       uint32_t        alfi_value_len; /* attr value length */
> -       uint32_t        alfi_attr_filter;/* attr filter flags */
> +       struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on

What's the length of this VLA?  1 for a normal SET or REPLACE
operation, and 2 for the "rename and replace value" operation?

If so, why do we need two xfs_attri_log_name structures?  The old value
is unimportant, so we only need one alfi_value_len per operation.  Each
xfs_attri_log_format only describes one change, so it only needs one
alfi_op_flags per op.

For now I also don't think attributes should be able to jump namespaces,
so we'd only need one alfi_attr_filter per op as well.

*lightbulb comes on*  Oops, I think I led you astray with my unfortunate
comment. :(

IOWs, the only change to struct xfs_attri_log_format is:

-       uint32_t        __pad;          /* pad to 64 bit aligned */
+       uint32_t        alfi_new_namelen;/* new attr name length */

and the rest of the changes in "[PATCH] xfs: Add new name to attri/d"
are more or less fine as is.

I'll go reply to that before I get back to Dave's log accounting stuff.

--D

> */
>  };
>  
>  struct xfs_attrd_log_format {
> diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
> index 3e7f7eaa5b96..c040eeb88def 100644
> --- a/fs/xfs/xfs_ondisk.h
> +++ b/fs/xfs/xfs_ondisk.h
> @@ -132,7 +132,7 @@ xfs_check_ondisk_structs(void)
>         XFS_CHECK_STRUCT_SIZE(struct xfs_inode_log_format,      56);
>         XFS_CHECK_STRUCT_SIZE(struct xfs_qoff_logformat,        20);
>         XFS_CHECK_STRUCT_SIZE(struct xfs_trans_header,          16);
> -       XFS_CHECK_STRUCT_SIZE(struct xfs_attri_log_format,      48);
> +       XFS_CHECK_STRUCT_SIZE(struct xfs_attri_log_format,      24);
>         XFS_CHECK_STRUCT_SIZE(struct xfs_attrd_log_format,      16);
>  
>         /* parent pointer ioctls */
> root@garnet:/home/achender/work_area/xfs-linux# 
> 
> 
> 
> If the on disk size check thinks the format is 24 bytes, and then we
> surprise pack an array of structs after it, isnt that going to run over
> the next item?  I think anything dynamic like this has to be an nvec.
>  Maybe we leave the existing alfi_* as they are so the size doesnt
> change, and then if we have a value in alfi_extra_names, then we have
> an extra nvec that has the array in it.  I think that would work.
> 
> FWIW, an alternate solution would be to use the pad for a second name
> length, and then we get a patch that's very similar to the one I sent
> out last Tues, but backward compatible.  Though it does eat the
> remaining pad and wouldn't be as flexible, I cant think of an attr op
> that would need more than two names either?
> 
> Let me know what people think.  Thanks!
> Allison
> 
> 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@fromorbit.com
>
Allison Henderson Aug. 24, 2022, 6:47 p.m. UTC | #15
On Tue, 2022-08-23 at 08:07 -0700, Darrick J. Wong wrote:
> On Thu, Aug 18, 2022 at 06:05:54PM -0700, Alli wrote:
> > On Tue, 2022-08-16 at 13:41 -0700, Alli wrote:
> > > On Mon, 2022-08-15 at 22:07 -0700, Darrick J. Wong wrote:
> > > > On Tue, Aug 16, 2022 at 10:54:38AM +1000, Dave Chinner wrote:
> > > > > On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote:
> > > > > > On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote:
> > > > > > > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote:
> > > > > > > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote:
> > > > > > > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J.
> > > > > > > > > Wong
> > > > > > > > > wrote:
> > > > > > > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison
> > > > > > > > > > Henderson
> > > > > > > > > > wrote:
> > > > > > > > > > > Recent parent pointer testing has exposed a bug
> > > > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > underlying
> > > > > > > > > > > attr replay.  A multi transaction replay
> > > > > > > > > > > currently
> > > > > > > > > > > performs a
> > > > > > > > > > > single step of the replay, then deferrs the rest
> > > > > > > > > > > if
> > > > > > > > > > > there is
> > > > > > > > > > > more
> > > > > > > > > > > to do.
> > > > > > > > > 
> > > > > > > > > Yup.
> > > > > > > > > 
> > > > > > > > > > > This causes race conditions with other attr
> > > > > > > > > > > replays
> > > > > > > > > > > that
> > > > > > > > > > > might be recovered before the remaining deferred
> > > > > > > > > > > work
> > > > > > > > > > > has had
> > > > > > > > > > > a
> > > > > > > > > > > chance to finish.
> > > > > > > > > 
> > > > > > > > > What other attr replays are we racing against?  There
> > > > > > > > > can
> > > > > > > > > only be
> > > > > > > > > one incomplete attr item intent/done chain per inode
> > > > > > > > > present in
> > > > > > > > > log
> > > > > > > > > recovery, right?
> > > > > > > > No, a rename queues up a set and remove before
> > > > > > > > committing
> > > > > > > > the
> > > > > > > > transaction.  One for the new parent pointer, and
> > > > > > > > another
> > > > > > > > to
> > > > > > > > remove
> > > > > > > > the
> > > > > > > > old one.
> > > > > > > 
> > > > > > > Ah. That really needs to be described in the commit
> > > > > > > message -
> > > > > > > changing from "single intent chain per object" to
> > > > > > > "multiple
> > > > > > > concurrent independent and unserialised intent chains per
> > > > > > > object" is
> > > > > > > a pretty important design rule change...
> > > > > > > 
> > > > > > > The whole point of intents is to allow complex, multi-
> > > > > > > stage
> > > > > > > operations on a single object to be sequenced in a
> > > > > > > tightly
> > > > > > > controlled manner. They weren't intended to be run as
> > > > > > > concurrent
> > > > > > > lines of modification on single items; if you need to do
> > > > > > > two
> > > > > > > modifications on an object, the intent chain ties the two
> > > > > > > modifications together into a single whole.
> > > > > > > 
> > > > > > > One of the reasons I rewrote the attr state machine for
> > > > > > > LARP
> > > > > > > was to
> > > > > > > enable new multiple attr operation chains to be easily
> > > > > > > build
> > > > > > > from
> > > > > > > the entry points the state machien provides. Parent attr
> > > > > > > rename
> > > > > > > needs a new intent chain to be built, not run multiple
> > > > > > > independent
> > > > > > > intent chains for each modification.
> > > > > > > 
> > > > > > > > It cant be an attr replace because technically the
> > > > > > > > names
> > > > > > > > are
> > > > > > > > different.
> > > > > > > 
> > > > > > > I disagree - we have all the pieces we need in the state
> > > > > > > machine
> > > > > > > already, we just need to define separate attr names for
> > > > > > > the
> > > > > > > remove and insert steps in the attr intent.
> > > > > > > 
> > > > > > > That is, the "replace" operation we execute when an attr
> > > > > > > set
> > > > > > > overwrites the value is "technically" a "replace value"
> > > > > > > operation,
> > > > > > > but we actually implement it as a "replace entire
> > > > > > > attribute"
> > > > > > > operation.
> > > > > > > 
> > > > > > > Without LARP, we do that overwrite in independent steps
> > > > > > > via
> > > > > > > an
> > > > > > > intermediate INCOMPLETE state to allow two xattrs of the
> > > > > > > same
> > > > > > > name
> > > > > > > to exist in the attr tree at the same time. IOWs, the
> > > > > > > attr
> > > > > > > value
> > > > > > > overwrite is effectively a "set-swap-remove" operation on
> > > > > > > two
> > > > > > > entirely independent xattrs, ensuring that if we crash we
> > > > > > > always
> > > > > > > have either the old or new xattr visible.
> > > > > > > 
> > > > > > > With LARP, we can remove the original attr first, thereby
> > > > > > > avoiding
> > > > > > > the need for two versions of the xattr to exist in the
> > > > > > > tree
> > > > > > > in
> > > > > > > the
> > > > > > > first place. However, we have to do these two operations
> > > > > > > as a
> > > > > > > pair
> > > > > > > of linked independent operations. The intent chain
> > > > > > > provides
> > > > > > > the
> > > > > > > linking, and requires us to log the name and the value of
> > > > > > > the
> > > > > > > attr
> > > > > > > that we are overwriting in the intent. Hence we can
> > > > > > > always
> > > > > > > recover
> > > > > > > the modification to completion no matter where in the
> > > > > > > operation
> > > > > > > we
> > > > > > > fail.
> > > > > > > 
> > > > > > > When it comes to a parent attr rename operation, we are
> > > > > > > effectively
> > > > > > > doing two linked operations - remove the old attr, set
> > > > > > > the
> > > > > > > new
> > > > > > > attr
> > > > > > > - on different attributes. Implementation wise, it is
> > > > > > > exactly
> > > > > > > the
> > > > > > > same sequence as a "replace value" operation, except for
> > > > > > > the
> > > > > > > fact
> > > > > > > that the new attr we add has a different name.
> > > > > > > 
> > > > > > > Hence the only real difference between the existing "attr
> > > > > > > replace"
> > > > > > > and the intent chain we need for "parent attr rename" is
> > > > > > > that
> > > > > > > we
> > > > > > > have to log two attr names instead of one. 
> > > > > > 
> > > > > > To be clear, this would imply expanding
> > > > > > xfs_attri_log_format to
> > > > > > have
> > > > > > another alfi_new_name_len feild and another iovec for the
> > > > > > attr
> > > > > > intent
> > > > > > right?  Does that cause issues to change the on disk log
> > > > > > layout
> > > > > > after
> > > > > > the original has merged?  Or is that ok for things that are
> > > > > > still
> > > > > > experimental? Thanks!
> > > > > 
> > > > > I think we can get away with this quite easily without
> > > > > breaking
> > > > > the
> > > > > existing experimental code.
> > > > > 
> > > > > struct xfs_attri_log_format {
> > > > >         uint16_t        alfi_type;      /* attri log item
> > > > > type */
> > > > >         uint16_t        alfi_size;      /* size of this item
> > > > > */
> > > > >         uint32_t        __pad;          /* pad to 64 bit
> > > > > aligned
> > > > > */
> > > > >         uint64_t        alfi_id;        /* attri identifier
> > > > > */
> > > > >         uint64_t        alfi_ino;       /* the inode for this
> > > > > attr
> > > > > operation */
> > > > >         uint32_t        alfi_op_flags;  /* marks the op as a
> > > > > set
> > > > > or
> > > > > remove */
> > > > >         uint32_t        alfi_name_len;  /* attr name length
> > > > > */
> > > > >         uint32_t        alfi_value_len; /* attr value length
> > > > > */
> > > > >         uint32_t        alfi_attr_filter;/* attr filter flags
> > > > > */
> > > > > };
> > > > > 
> > > > > We have a padding field in there that is currently all zeros.
> > > > > Let's
> > > > > make that a count of the number of {name, value} tuples that
> > > > > are
> > > > > appended to the format. i.e.
> > > > > 
> > > > > struct xfs_attri_log_name {
> > > > >         uint32_t        alfi_op_flags;  /* marks the op as a
> > > > > set
> > > > > or
> > > > > remove */
> > > > >         uint32_t        alfi_name_len;  /* attr name length
> > > > > */
> > > > >         uint32_t        alfi_value_len; /* attr value length
> > > > > */
> > > > >         uint32_t        alfi_attr_filter;/* attr filter flags
> > > > > */
> > > > > };
> > > > > 
> > > > > struct xfs_attri_log_format {
> > > > >         uint16_t        alfi_type;      /* attri log item
> > > > > type */
> > > > >         uint16_t        alfi_size;      /* size of this item
> > > > > */
> > > > > 	uint8_t		alfi_attr_cnt;	/* count of name/val
> > > > > pairs
> > > > > */
> > > > >         uint8_t		__pad1;          /* pad to 64
> > > > > bit
> > > > > aligned */
> > > > >         uint16_t	__pad2;          /* pad to 64 bit
> > > > > aligned */
> > > > >         uint64_t        alfi_id;        /* attri identifier
> > > > > */
> > > > >         uint64_t        alfi_ino;       /* the inode for this
> > > > > attr
> > > > > operation */
> > > > > 	struct xfs_attri_log_name alfi_attr[]; /* attrs to
> > > > > operate on
> > > > > */
> > > > > };
> > > > > 
> > > > > Basically, the size and shape of the structure has not
> > > > > changed,
> > > > > and
> > > > > if alfi_attr_cnt == 0 we just treat it as if alfi_attr_cnt ==
> > > > > 1
> > > > > as
> > > > > the backwards compat code for the existing code.
> > > > > 
> > > > > And then we just have as many followup regions for name/val
> > > > > pairs
> > > > > as are defined by the alfi_attr_cnt and alfi_attr[] parts of
> > > > > the
> > > > > structure. Each attr can have a different operation performed
> > > > > on
> > > > > them, and they can have different filters applied so they can
> > > > > exist
> > > > > in different namespaces, too.
> > > > > 
> > > > > SO I don't think we need a new on-disk feature bit for this
> > > > > enhancement - it definitely comes under the heading of "this
> > > > > stuff
> > > > > is experimental, this is the sort of early structure revision
> > > > > that
> > > > > EXPERIMENTAL is supposed to cover....
> > > > 
> > > > You might even callit "alfi_extra_names" to avoid the "0 means
> > > > 1"
> > > > stuff.
> > > > ;)
> > > > 
> > > > --D
> > > 
> > > Oh, I just noticed these comments this morning when I sent out
> > > the
> > > new
> > > attri/d patch.  I'll add this changes to v2.  Please let me know
> > > if
> > > there's anything else you'd like me to change from the v1.  Thx!
> > > 
> > > Allison
> > 
> > Ok, so I am part way through coding this up, and I'm getting this
> > feeling like this is not going to work out very well due to the
> > size
> > checks for the log formats:
> > 
> > root@garnet:/home/achender/work_area/xfs-linux# git diff
> > fs/xfs/libxfs/xfs_log_format.h fs/xfs/xfs_ondisk.h
> > diff --git a/fs/xfs/libxfs/xfs_log_format.h
> > b/fs/xfs/libxfs/xfs_log_format.h
> > index f1ff52ebb982..5a4e700f32fc 100644
> > --- a/fs/xfs/libxfs/xfs_log_format.h
> > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > @@ -922,6 +922,13 @@ struct xfs_icreate_log {
> >                                          XFS_ATTR_PARENT | \
> >                                          XFS_ATTR_INCOMPLETE)
> >  
> > +struct xfs_attri_log_name {
> > +       uint32_t        alfi_op_flags;  /* marks the op as a set or
> > remove */
> > +       uint32_t        alfi_name_len;  /* attr name length */
> > +       uint32_t        alfi_value_len; /* attr value length */
> > +       uint32_t        alfi_attr_filter;/* attr filter flags */
> > +};
> > +
> >  /*
> >   * This is the structure used to lay out an attr log item in the
> >   * log.
> > @@ -929,14 +936,12 @@ struct xfs_icreate_log {
> >  struct xfs_attri_log_format {
> >         uint16_t        alfi_type;      /* attri log item type */
> >         uint16_t        alfi_size;      /* size of this item */
> > -       uint32_t        __pad;          /* pad to 64 bit aligned */
> > +       uint8_t         alfi_extra_names;/* count of name/val pairs
> > */
> > +       uint8_t         __pad1;         /* pad to 64 bit aligned */
> > +       uint16_t        __pad2;         /* pad to 64 bit aligned */
> >         uint64_t        alfi_id;        /* attri identifier */
> >         uint64_t        alfi_ino;       /* the inode for this attr
> > operation */
> > -       uint32_t        alfi_op_flags;  /* marks the op as a set or
> > remove */
> > -       uint32_t        alfi_name_len;  /* attr name length */
> > -       uint32_t        alfi_value_len; /* attr value length */
> > -       uint32_t        alfi_attr_filter;/* attr filter flags */
> > +       struct xfs_attri_log_name alfi_attr[]; /* attrs to operate
> > on
> 
> What's the length of this VLA?  1 for a normal SET or REPLACE
> operation, and 2 for the "rename and replace value" operation?
> 
> If so, why do we need two xfs_attri_log_name structures?  The old
> value
> is unimportant, so we only need one alfi_value_len per
> operation.  Each
> xfs_attri_log_format only describes one change, so it only needs one
> alfi_op_flags per op.
> 
> For now I also don't think attributes should be able to jump
> namespaces,
> so we'd only need one alfi_attr_filter per op as well.
> 
> *lightbulb comes on*  Oops, I think I led you astray with my
> unfortunate
> comment. :(
> 
> IOWs, the only change to struct xfs_attri_log_format is:
> 
> -       uint32_t        __pad;          /* pad to 64 bit aligned */
> +       uint32_t        alfi_new_namelen;/* new attr name length */
> 
> and the rest of the changes in "[PATCH] xfs: Add new name to attri/d"
> are more or less fine as is.
> 
> I'll go reply to that before I get back to Dave's log accounting
> stuff.
> 
> --D
Alrighty, I think thats the simplest solution for now.  Will switch to
that thread....

> 
> > */
> >  };
> >  
> >  struct xfs_attrd_log_format {
> > diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
> > index 3e7f7eaa5b96..c040eeb88def 100644
> > --- a/fs/xfs/xfs_ondisk.h
> > +++ b/fs/xfs/xfs_ondisk.h
> > @@ -132,7 +132,7 @@ xfs_check_ondisk_structs(void)
> >         XFS_CHECK_STRUCT_SIZE(struct
> > xfs_inode_log_format,      56);
> >         XFS_CHECK_STRUCT_SIZE(struct
> > xfs_qoff_logformat,        20);
> >         XFS_CHECK_STRUCT_SIZE(struct
> > xfs_trans_header,          16);
> > -       XFS_CHECK_STRUCT_SIZE(struct
> > xfs_attri_log_format,      48);
> > +       XFS_CHECK_STRUCT_SIZE(struct
> > xfs_attri_log_format,      24);
> >         XFS_CHECK_STRUCT_SIZE(struct
> > xfs_attrd_log_format,      16);
> >  
> >         /* parent pointer ioctls */
> > root@garnet:/home/achender/work_area/xfs-linux# 
> > 
> > 
> > 
> > If the on disk size check thinks the format is 24 bytes, and then
> > we
> > surprise pack an array of structs after it, isnt that going to run
> > over
> > the next item?  I think anything dynamic like this has to be an
> > nvec.
> >  Maybe we leave the existing alfi_* as they are so the size doesnt
> > change, and then if we have a value in alfi_extra_names, then we
> > have
> > an extra nvec that has the array in it.  I think that would work.
> > 
> > FWIW, an alternate solution would be to use the pad for a second
> > name
> > length, and then we get a patch that's very similar to the one I
> > sent
> > out last Tues, but backward compatible.  Though it does eat the
> > remaining pad and wouldn't be as flexible, I cant think of an attr
> > op
> > that would need more than two names either?
> > 
> > Let me know what people think.  Thanks!
> > Allison
> > 
> > 
> > > > > Cheers,
> > > > > 
> > > > > Dave.
> > > > > -- 
> > > > > Dave Chinner
> > > > > david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
index 5077a7ad5646..c13d724a3e13 100644
--- a/fs/xfs/xfs_attr_item.c
+++ b/fs/xfs/xfs_attr_item.c
@@ -635,52 +635,33 @@  xfs_attri_item_recover(
 		break;
 	case XFS_ATTRI_OP_FLAGS_REMOVE:
 		if (!xfs_inode_hasattr(args->dp))
-			goto out;
+			return 0;
 		attr->xattri_dela_state = xfs_attr_init_remove_state(args);
 		break;
 	default:
 		ASSERT(0);
-		error = -EFSCORRUPTED;
-		goto out;
+		return -EFSCORRUPTED;
 	}
 
 	xfs_init_attr_trans(args, &tres, &total);
 	error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, &tp);
 	if (error)
-		goto out;
+		return error;
 
 	args->trans = tp;
 	done_item = xfs_trans_get_attrd(tp, attrip);
+	args->trans->t_flags |= XFS_TRANS_HAS_INTENT_DONE;
+	set_bit(XFS_LI_DIRTY, &done_item->attrd_item.li_flags);
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, ip, 0);
 
-	error = xfs_xattri_finish_update(attr, done_item);
-	if (error == -EAGAIN) {
-		/*
-		 * There's more work to do, so add the intent item to this
-		 * transaction so that we can continue it later.
-		 */
-		xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
-		error = xfs_defer_ops_capture_and_commit(tp, capture_list);
-		if (error)
-			goto out_unlock;
-
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		xfs_irele(ip);
-		return 0;
-	}
-	if (error) {
-		xfs_trans_cancel(tp);
-		goto out_unlock;
-	}
-
+	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
 	error = xfs_defer_ops_capture_and_commit(tp, capture_list);
-out_unlock:
+
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	xfs_irele(ip);
-out:
-	xfs_attr_free_item(attr);
+
 	return error;
 }