Message ID | 20220804194013.99237-2-allison.henderson@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Parent Pointers | expand |
On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson wrote: > Recent parent pointer testing has exposed a bug in the underlying > attr replay. A multi transaction replay currently performs a > single step of the replay, then deferrs the rest if there is more > to do. This causes race conditions with other attr replays that > might be recovered before the remaining deferred work has had a > chance to finish. This can lead to interleaved set and remove > operations that may clobber the attribute fork. Fix this by > deferring all work for any attribute operation. > > Signed-off-by: Allison Henderson <allison.henderson@oracle.com> > --- > fs/xfs/xfs_attr_item.c | 35 ++++++++--------------------------- > 1 file changed, 8 insertions(+), 27 deletions(-) > > diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c > index 5077a7ad5646..c13d724a3e13 100644 > --- a/fs/xfs/xfs_attr_item.c > +++ b/fs/xfs/xfs_attr_item.c > @@ -635,52 +635,33 @@ xfs_attri_item_recover( > break; > case XFS_ATTRI_OP_FLAGS_REMOVE: > if (!xfs_inode_hasattr(args->dp)) > - goto out; > + return 0; > attr->xattri_dela_state = xfs_attr_init_remove_state(args); > break; > default: > ASSERT(0); > - error = -EFSCORRUPTED; > - goto out; > + return -EFSCORRUPTED; > } > > xfs_init_attr_trans(args, &tres, &total); > error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, &tp); > if (error) > - goto out; > + return error; > > args->trans = tp; > done_item = xfs_trans_get_attrd(tp, attrip); > + args->trans->t_flags |= XFS_TRANS_HAS_INTENT_DONE; > + set_bit(XFS_LI_DIRTY, &done_item->attrd_item.li_flags); > > xfs_ilock(ip, XFS_ILOCK_EXCL); > xfs_trans_ijoin(tp, ip, 0); > > - error = xfs_xattri_finish_update(attr, done_item); > - if (error == -EAGAIN) { > - /* > - * There's more work to do, so add the intent item to this > - * transaction so that we can continue it later. > - */ > - xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list); > - error = xfs_defer_ops_capture_and_commit(tp, capture_list); > - if (error) > - goto out_unlock; > - > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > - xfs_irele(ip); > - return 0; > - } > - if (error) { > - xfs_trans_cancel(tp); > - goto out_unlock; > - } > - > + xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list); This seems a little convoluted to me. Maybe? Maybe not? 1. Log recovery recreates an incore xfs_attri_log_item from what it finds in the log. 2. This function then logs an xattrd for the recovered xattri item. 3. Then it creates a new xfs_attr_intent to complete the operation. 4. Finally, it calls xfs_defer_ops_capture_and_commit, which logs a new xattri for the intent created in step 3 and also commits the xattrd for the first xattri. IOWs, the only difference between before and after is that we're not advancing one more step through the state machine as part of log recovery. From the perspective of the log, the recovery function merely replaces the recovered xattri log item with a new one. Why can't we just attach the recovered xattri to the xfs_defer_pending that is created to point to the xfs_attr_intent that's created in step 3, and skip the xattrd? I /think/ the answer to that question is that we might need to move the log tail forward to free enough log space to finish the intent items, so creating the extra xattrd/xattri (a) avoid the complexity of submitting an incore intent item *and* a log intent item to the defer ops machinery; and (b) avoid livelocks in log recovery. Therefore, we actually need to do it this way. IOWS, I *think* this is ok, but want to see if others have differing perspectives on how log item recovery works? --D > error = xfs_defer_ops_capture_and_commit(tp, capture_list); > -out_unlock: > + > xfs_iunlock(ip, XFS_ILOCK_EXCL); > xfs_irele(ip); > -out: > - xfs_attr_free_item(attr); > + > return error; > } > > -- > 2.25.1 >
On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote: > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson wrote: > > Recent parent pointer testing has exposed a bug in the underlying > > attr replay. A multi transaction replay currently performs a > > single step of the replay, then deferrs the rest if there is more > > to do. Yup. > > This causes race conditions with other attr replays that > > might be recovered before the remaining deferred work has had a > > chance to finish. What other attr replays are we racing against? There can only be one incomplete attr item intent/done chain per inode present in log recovery, right? > > This can lead to interleaved set and remove > > operations that may clobber the attribute fork. Fix this by > > deferring all work for any attribute operation. Which means this should be an impossible situation. That is, if we crash before the final attrd DONE intent is written to the log, it means that new attr intents for modifications made *after* the current attr modification was completed will not be present in the log. We have strict ordering of committed operations in the journal, hence an operation on an inode has an incomplete intent *must* be the last operation and the *only* incomplete intent that is found in the journal for that inode. Hence from an operational ordering persepective, this explanation for issue being seen doesn't make any sense to me. If there are multiple incomplete attri intents then we've either got a runtime journalling problem (a white-out issue? failing to relog the inode in each new intent?) or a log recovery problem (failing to match intent-done pairs correctly?), not a recovery deferral issue. Hence I think we're still looking for the root cause of this problem... > > Signed-off-by: Allison Henderson <allison.henderson@oracle.com> > > --- > > fs/xfs/xfs_attr_item.c | 35 ++++++++--------------------------- > > 1 file changed, 8 insertions(+), 27 deletions(-) > > > > diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c > > index 5077a7ad5646..c13d724a3e13 100644 > > --- a/fs/xfs/xfs_attr_item.c > > +++ b/fs/xfs/xfs_attr_item.c > > @@ -635,52 +635,33 @@ xfs_attri_item_recover( > > break; > > case XFS_ATTRI_OP_FLAGS_REMOVE: > > if (!xfs_inode_hasattr(args->dp)) > > - goto out; > > + return 0; > > attr->xattri_dela_state = xfs_attr_init_remove_state(args); > > break; > > default: > > ASSERT(0); > > - error = -EFSCORRUPTED; > > - goto out; > > + return -EFSCORRUPTED; > > } > > > > xfs_init_attr_trans(args, &tres, &total); > > error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, &tp); > > if (error) > > - goto out; > > + return error; > > > > args->trans = tp; > > done_item = xfs_trans_get_attrd(tp, attrip); > > + args->trans->t_flags |= XFS_TRANS_HAS_INTENT_DONE; > > + set_bit(XFS_LI_DIRTY, &done_item->attrd_item.li_flags); > > > > xfs_ilock(ip, XFS_ILOCK_EXCL); > > xfs_trans_ijoin(tp, ip, 0); > > > > - error = xfs_xattri_finish_update(attr, done_item); > > - if (error == -EAGAIN) { > > - /* > > - * There's more work to do, so add the intent item to this > > - * transaction so that we can continue it later. > > - */ > > - xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list); > > - error = xfs_defer_ops_capture_and_commit(tp, capture_list); > > - if (error) > > - goto out_unlock; > > - > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > - xfs_irele(ip); > > - return 0; > > - } > > - if (error) { > > - xfs_trans_cancel(tp); > > - goto out_unlock; > > - } > > - > > + xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list); > > This seems a little convoluted to me. Maybe? Maybe not? > > 1. Log recovery recreates an incore xfs_attri_log_item from what it > finds in the log. > > 2. This function then logs an xattrd for the recovered xattri item. > > 3. Then it creates a new xfs_attr_intent to complete the operation. > > 4. Finally, it calls xfs_defer_ops_capture_and_commit, which logs a new > xattri for the intent created in step 3 and also commits the xattrd for > the first xattri. > > IOWs, the only difference between before and after is that we're not > advancing one more step through the state machine as part of log > recovery. From the perspective of the log, the recovery function merely > replaces the recovered xattri log item with a new one. > > Why can't we just attach the recovered xattri to the xfs_defer_pending > that is created to point to the xfs_attr_intent that's created in step > 3, and skip the xattrd? Remember that attribute intents are different to all other intent types that we have. The existing extent based intents define a single indepedent operation that needs to be performed, and each step of the intent chain is completely independent of the previous step in the chain. e.g. removing the extent from the rmap btree is completely independent of removing it from the inode bmap btree - all that matters is that the removal from the bmbt happens first. The rmapbt removal can happen at any time after than, and is completely independent of any other bmbt or rmapbt operation. Similarly, the EFI can processed independently of all bmapbt and rmapbt modifications, it just has to happen after those modifications are done. Hence if we crash during recovery, we can just restart from where-ever we got to in the middle of the intent chains and not have to care at all. IOWs, eventual consistency works with these chains because there is no dependencies between each step of the intent chain and each step is completely independent of the other steps. Attribute intent chains are completely different. They link steps in a state machine together in a non-trivial, highly dependent chain. We can't just restart the chain in the middle like we can for the BUI->RUI->CUI->EFI chain because the on-disk attribute is in an unknown state and recovering that exact state is .... complex. Hence the the first step of recovery is to return the attribute we are trying to modify back to a known state. That means we have to perform a removal of any existing attribute under that name first. Hence this first step should be replacing the existing attr intent with the intent that defines the recovery operation we are going to perform. That means we need to translate set to replace so that cleanup is run first, replace needs to clean up the attr under that name regardless of whether it has the incomplete bit set on it or not. Remove is the only operation that runs the same as at runtime, as cleanup for remove is just repeating the remove operation from scratch. > I /think/ the answer to that question is that we might need to move the > log tail forward to free enough log space to finish the intent items, so > creating the extra xattrd/xattri (a) avoid the complexity of submitting > an incore intent item *and* a log intent item to the defer ops > machinery; and (b) avoid livelocks in log recovery. Therefore, we > actually need to do it this way. We really need the initial operation to rewrite the intent to match the recovery operation we are going to perform. Everything else is secondary. Cheers, Dave.
On Tue, 2022-08-09 at 09:52 -0700, Darrick J. Wong wrote: > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson wrote: > > Recent parent pointer testing has exposed a bug in the underlying > > attr replay. A multi transaction replay currently performs a > > single step of the replay, then deferrs the rest if there is more > > to do. This causes race conditions with other attr replays that > > might be recovered before the remaining deferred work has had a > > chance to finish. This can lead to interleaved set and remove > > operations that may clobber the attribute fork. Fix this by > > deferring all work for any attribute operation. > > > > Signed-off-by: Allison Henderson <allison.henderson@oracle.com> > > --- > > fs/xfs/xfs_attr_item.c | 35 ++++++++--------------------------- > > 1 file changed, 8 insertions(+), 27 deletions(-) > > > > diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c > > index 5077a7ad5646..c13d724a3e13 100644 > > --- a/fs/xfs/xfs_attr_item.c > > +++ b/fs/xfs/xfs_attr_item.c > > @@ -635,52 +635,33 @@ xfs_attri_item_recover( > > break; > > case XFS_ATTRI_OP_FLAGS_REMOVE: > > if (!xfs_inode_hasattr(args->dp)) > > - goto out; > > + return 0; > > attr->xattri_dela_state = > > xfs_attr_init_remove_state(args); > > break; > > default: > > ASSERT(0); > > - error = -EFSCORRUPTED; > > - goto out; > > + return -EFSCORRUPTED; > > } > > > > xfs_init_attr_trans(args, &tres, &total); > > error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, > > &tp); > > if (error) > > - goto out; > > + return error; > > > > args->trans = tp; > > done_item = xfs_trans_get_attrd(tp, attrip); > > + args->trans->t_flags |= XFS_TRANS_HAS_INTENT_DONE; > > + set_bit(XFS_LI_DIRTY, &done_item->attrd_item.li_flags); > > > > xfs_ilock(ip, XFS_ILOCK_EXCL); > > xfs_trans_ijoin(tp, ip, 0); > > > > - error = xfs_xattri_finish_update(attr, done_item); > > - if (error == -EAGAIN) { > > - /* > > - * There's more work to do, so add the intent item to > > this > > - * transaction so that we can continue it later. > > - */ > > - xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr- > > >xattri_list); > > - error = xfs_defer_ops_capture_and_commit(tp, > > capture_list); > > - if (error) > > - goto out_unlock; > > - > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > - xfs_irele(ip); > > - return 0; > > - } > > - if (error) { > > - xfs_trans_cancel(tp); > > - goto out_unlock; > > - } > > - > > + xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list); > > This seems a little convoluted to me. Maybe? Maybe not? > > 1. Log recovery recreates an incore xfs_attri_log_item from what it > finds in the log. > > 2. This function then logs an xattrd for the recovered xattri item. > > 3. Then it creates a new xfs_attr_intent to complete the operation. > > 4. Finally, it calls xfs_defer_ops_capture_and_commit, which logs a > new > xattri for the intent created in step 3 and also commits the xattrd > for > the first xattri. > > IOWs, the only difference between before and after is that we're not > advancing one more step through the state machine as part of log > recovery. From the perspective of the log, the recovery function > merely > replaces the recovered xattri log item with a new one. > > Why can't we just attach the recovered xattri to the > xfs_defer_pending > that is created to point to the xfs_attr_intent that's created in > step > 3, and skip the xattrd? Oh, I see. I hadnt thought of doing it that way, this was based on the initial solution suggested to the first patch of v1 (xfs: Add larp state XFS_DAS_CREATE_FORK). But what you mention below also makes sense. So I suppose if no one has any gripes then maybe it should stay as it is then. Thx for the reviews! Allison > > I /think/ the answer to that question is that we might need to move > the > log tail forward to free enough log space to finish the intent items, > so > creating the extra xattrd/xattri (a) avoid the complexity of > submitting > an incore intent item *and* a log intent item to the defer ops > machinery; and (b) avoid livelocks in log recovery. Therefore, we > actually need to do it this way. > > IOWS, I *think* this is ok, but want to see if others have differing > perspectives on how log item recovery works? > > --D > > > error = xfs_defer_ops_capture_and_commit(tp, capture_list); > > -out_unlock: > > + > > xfs_iunlock(ip, XFS_ILOCK_EXCL); > > xfs_irele(ip); > > -out: > > - xfs_attr_free_item(attr); > > + > > return error; > > } > > > > -- > > 2.25.1 > >
On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote: > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote: > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson wrote: > > > Recent parent pointer testing has exposed a bug in the underlying > > > attr replay. A multi transaction replay currently performs a > > > single step of the replay, then deferrs the rest if there is more > > > to do. > > Yup. > > > > This causes race conditions with other attr replays that > > > might be recovered before the remaining deferred work has had a > > > chance to finish. > > What other attr replays are we racing against? There can only be > one incomplete attr item intent/done chain per inode present in log > recovery, right? No, a rename queues up a set and remove before committing the transaction. One for the new parent pointer, and another to remove the old one. It cant be an attr replace because technically the names are different. So the recovered set grows the leaf, and returns the egain, then rest gets capture committed. Next up is the recovered remove which pulls out the fork, which causes problems when the rest of the set operation resumes as a deferred operation. Here is the link to the original discussion, it was quite a while ago: https://lore.kernel.org/all/Yrzw9F5aGsaldrmR@magnolia/ I hope that helps? Allison > > > > This can lead to interleaved set and remove > > > operations that may clobber the attribute fork. Fix this by > > > deferring all work for any attribute operation. > > Which means this should be an impossible situation. > > That is, if we crash before the final attrd DONE intent is written > to the log, it means that new attr intents for modifications made > *after* the current attr modification was completed will not be > present in the log. We have strict ordering of committed operations > in the journal, hence an operation on an inode has an incomplete > intent *must* be the last operation and the *only* incomplete intent > that is found in the journal for that inode. > > Hence from an operational ordering persepective, this explanation > for issue being seen doesn't make any sense to me. If there are > multiple incomplete attri intents then we've either got a runtime > journalling problem (a white-out issue? failing to relog the inode > in each new intent?) or a log recovery problem (failing to match > intent-done pairs correctly?), not a recovery deferral issue. > > Hence I think we're still looking for the root cause of this > problem... > > > > Signed-off-by: Allison Henderson <allison.henderson@oracle.com> > > > --- > > > fs/xfs/xfs_attr_item.c | 35 ++++++++--------------------------- > > > 1 file changed, 8 insertions(+), 27 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c > > > index 5077a7ad5646..c13d724a3e13 100644 > > > --- a/fs/xfs/xfs_attr_item.c > > > +++ b/fs/xfs/xfs_attr_item.c > > > @@ -635,52 +635,33 @@ xfs_attri_item_recover( > > > break; > > > case XFS_ATTRI_OP_FLAGS_REMOVE: > > > if (!xfs_inode_hasattr(args->dp)) > > > - goto out; > > > + return 0; > > > attr->xattri_dela_state = > > > xfs_attr_init_remove_state(args); > > > break; > > > default: > > > ASSERT(0); > > > - error = -EFSCORRUPTED; > > > - goto out; > > > + return -EFSCORRUPTED; > > > } > > > > > > xfs_init_attr_trans(args, &tres, &total); > > > error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, > > > &tp); > > > if (error) > > > - goto out; > > > + return error; > > > > > > args->trans = tp; > > > done_item = xfs_trans_get_attrd(tp, attrip); > > > + args->trans->t_flags |= XFS_TRANS_HAS_INTENT_DONE; > > > + set_bit(XFS_LI_DIRTY, &done_item->attrd_item.li_flags); > > > > > > xfs_ilock(ip, XFS_ILOCK_EXCL); > > > xfs_trans_ijoin(tp, ip, 0); > > > > > > - error = xfs_xattri_finish_update(attr, done_item); > > > - if (error == -EAGAIN) { > > > - /* > > > - * There's more work to do, so add the intent item to > > > this > > > - * transaction so that we can continue it later. > > > - */ > > > - xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr- > > > >xattri_list); > > > - error = xfs_defer_ops_capture_and_commit(tp, > > > capture_list); > > > - if (error) > > > - goto out_unlock; > > > - > > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > - xfs_irele(ip); > > > - return 0; > > > - } > > > - if (error) { > > > - xfs_trans_cancel(tp); > > > - goto out_unlock; > > > - } > > > - > > > + xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list); > > > > This seems a little convoluted to me. Maybe? Maybe not? > > > > 1. Log recovery recreates an incore xfs_attri_log_item from what it > > finds in the log. > > > > 2. This function then logs an xattrd for the recovered xattri item. > > > > 3. Then it creates a new xfs_attr_intent to complete the operation. > > > > 4. Finally, it calls xfs_defer_ops_capture_and_commit, which logs a > > new > > xattri for the intent created in step 3 and also commits the xattrd > > for > > the first xattri. > > > > IOWs, the only difference between before and after is that we're > > not > > advancing one more step through the state machine as part of log > > recovery. From the perspective of the log, the recovery function > > merely > > replaces the recovered xattri log item with a new one. > > > > Why can't we just attach the recovered xattri to the > > xfs_defer_pending > > that is created to point to the xfs_attr_intent that's created in > > step > > 3, and skip the xattrd? > > Remember that attribute intents are different to all other intent > types that we have. The existing extent based intents define a > single indepedent operation that needs to be performed, and each > step of the intent chain is completely independent of the previous > step in the chain. e.g. removing the extent from the rmap btree is > completely independent of removing it from the inode bmap btree - > all that matters is that the removal from the bmbt happens first. > The rmapbt removal can happen at any time after than, and is > completely independent of any other bmbt or rmapbt operation. > Similarly, the EFI can processed independently of all bmapbt and > rmapbt modifications, it just has to happen after those > modifications are done. > > Hence if we crash during recovery, we can just restart from > where-ever we got to in the middle of the intent chains and not have > to care at all. IOWs, eventual consistency works with these chains > because there is no dependencies between each step of the intent > chain and each step is completely independent of the other steps. > > Attribute intent chains are completely different. They link steps in > a state machine together in a non-trivial, highly dependent chain. > We can't just restart the chain in the middle like we can for the > BUI->RUI->CUI->EFI chain because the on-disk attribute is in an > unknown state and recovering that exact state is .... complex. > > Hence the the first step of recovery is to return the attribute we > are trying to modify back to a known state. That means we have to > perform a removal of any existing attribute under that name first. > Hence this first step should be replacing the existing attr intent > with the intent that defines the recovery operation we are going to > perform. > > That means we need to translate set to replace so that cleanup is > run first, replace needs to clean up the attr under that name > regardless of whether it has the incomplete bit set on it or not. > Remove is the only operation that runs the same as at runtime, as > cleanup for remove is just repeating the remove operation from > scratch. > > > I /think/ the answer to that question is that we might need to move > > the > > log tail forward to free enough log space to finish the intent > > items, so > > creating the extra xattrd/xattri (a) avoid the complexity of > > submitting > > an incore intent item *and* a log intent item to the defer ops > > machinery; and (b) avoid livelocks in log recovery. Therefore, we > > actually need to do it this way. > > We really need the initial operation to rewrite the intent to match > the recovery operation we are going to perform. Everything else is > secondary. > > Cheers, > > Dave.
On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote: > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote: > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote: > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson wrote: > > > > Recent parent pointer testing has exposed a bug in the underlying > > > > attr replay. A multi transaction replay currently performs a > > > > single step of the replay, then deferrs the rest if there is more > > > > to do. > > > > Yup. > > > > > > This causes race conditions with other attr replays that > > > > might be recovered before the remaining deferred work has had a > > > > chance to finish. > > > > What other attr replays are we racing against? There can only be > > one incomplete attr item intent/done chain per inode present in log > > recovery, right? > No, a rename queues up a set and remove before committing the > transaction. One for the new parent pointer, and another to remove the > old one. Ah. That really needs to be described in the commit message - changing from "single intent chain per object" to "multiple concurrent independent and unserialised intent chains per object" is a pretty important design rule change... The whole point of intents is to allow complex, multi-stage operations on a single object to be sequenced in a tightly controlled manner. They weren't intended to be run as concurrent lines of modification on single items; if you need to do two modifications on an object, the intent chain ties the two modifications together into a single whole. One of the reasons I rewrote the attr state machine for LARP was to enable new multiple attr operation chains to be easily build from the entry points the state machien provides. Parent attr rename needs a new intent chain to be built, not run multiple independent intent chains for each modification. > It cant be an attr replace because technically the names are > different. I disagree - we have all the pieces we need in the state machine already, we just need to define separate attr names for the remove and insert steps in the attr intent. That is, the "replace" operation we execute when an attr set overwrites the value is "technically" a "replace value" operation, but we actually implement it as a "replace entire attribute" operation. Without LARP, we do that overwrite in independent steps via an intermediate INCOMPLETE state to allow two xattrs of the same name to exist in the attr tree at the same time. IOWs, the attr value overwrite is effectively a "set-swap-remove" operation on two entirely independent xattrs, ensuring that if we crash we always have either the old or new xattr visible. With LARP, we can remove the original attr first, thereby avoiding the need for two versions of the xattr to exist in the tree in the first place. However, we have to do these two operations as a pair of linked independent operations. The intent chain provides the linking, and requires us to log the name and the value of the attr that we are overwriting in the intent. Hence we can always recover the modification to completion no matter where in the operation we fail. When it comes to a parent attr rename operation, we are effectively doing two linked operations - remove the old attr, set the new attr - on different attributes. Implementation wise, it is exactly the same sequence as a "replace value" operation, except for the fact that the new attr we add has a different name. Hence the only real difference between the existing "attr replace" and the intent chain we need for "parent attr rename" is that we have to log two attr names instead of one. Basically, we have a new XFS_ATTRI_OP_FLAGS... type for this, and that's what tells us that we are operating on two different attributes instead of just one. The recovery operation becomes slightly different - we have to run a remove on the old, then a replace on the new - so there a little bit of new code needed to manage that in the state machine. These, however, are just small tweaks on the existing replace attr operation, and there should be little difference in performance or overhead between a "replace value" and a "replace entire xattr" operation as they are largely the same runtime operation for LARP. > So the recovered set grows the leaf, and returns the egain, then rest > gets capture committed. Next up is the recovered remove which pulls > out the fork, which causes problems when the rest of the set operation > resumes as a deferred operation. Yup, and all this goes away when we build the right intent chain for replacing a parent attr rename.... Cheers, Dave.
On Wed, Aug 10, 2022 at 04:12:58PM +1000, Dave Chinner wrote: > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote: > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote: > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote: > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson wrote: > > > > > Recent parent pointer testing has exposed a bug in the underlying > > > > > attr replay. A multi transaction replay currently performs a > > > > > single step of the replay, then deferrs the rest if there is more > > > > > to do. > > > > > > Yup. > > > > > > > > This causes race conditions with other attr replays that > > > > > might be recovered before the remaining deferred work has had a > > > > > chance to finish. > > > > > > What other attr replays are we racing against? There can only be > > > one incomplete attr item intent/done chain per inode present in log > > > recovery, right? > > No, a rename queues up a set and remove before committing the > > transaction. One for the new parent pointer, and another to remove the > > old one. > > Ah. That really needs to be described in the commit message - > changing from "single intent chain per object" to "multiple > concurrent independent and unserialised intent chains per object" is > a pretty important design rule change... > > The whole point of intents is to allow complex, multi-stage > operations on a single object to be sequenced in a tightly > controlled manner. They weren't intended to be run as concurrent > lines of modification on single items; if you need to do two > modifications on an object, the intent chain ties the two > modifications together into a single whole. Back when I made the suggestion that resulted in this patch, I was pondering why it is that (say) atomic swapext didn't suffer from these recovery problems, and I realized that for any given inode, you can only have one ongoing swapext operation at a time. That's why recovery of swapext operations works fine, whereas pptr recovery has this quirk. At the time, my thought process was more narrowly focused on making log recovery mimic runtime more closely. I didn't make the connection between this problem and the other open question I had (see the bottom) about how to fix pptr attrs when rebuilding a directory. > One of the reasons I rewrote the attr state machine for LARP was to > enable new multiple attr operation chains to be easily build from > the entry points the state machien provides. Parent attr rename > needs a new intent chain to be built, not run multiple independent > intent chains for each modification. > > > It cant be an attr replace because technically the names are > > different. > > I disagree - we have all the pieces we need in the state machine > already, we just need to define separate attr names for the > remove and insert steps in the attr intent. > > That is, the "replace" operation we execute when an attr set > overwrites the value is "technically" a "replace value" operation, > but we actually implement it as a "replace entire attribute" > operation. OH. Right. I forgot that ATTR_REPLACE=="replace entire attr". If I'm understanding this right, that means that the xfs_rename patch ought to detect the situation where there's an existing dirent in the target directory, and do something along the lines of: } else { /* target_ip != NULL */ xfs_dir_replace(...); xfs_parent_defer_replace(tp, new_parent_ptr, target_dp, old_diroffset, target_name, new_diroffset); xfs_trans_ichgtime(...); Where the xfs_parent_defer_replace operation does an ATTR_REPLACE to switch: (target_dp_ino, target_gen, old_diroffset) == <dontcare> to this: (target_dp_ino, target_gen, new_diroffset) == target_name except, I think we have to log the old name in addition to the new name, because userspace ATTR_REPLACE operations don't allow name changes? I guess this also implies that xfs_dir_replace will pass out the offset of the old name, in addition to the offset of the new name. > Without LARP, we do that overwrite in independent steps via an > intermediate INCOMPLETE state to allow two xattrs of the same name > to exist in the attr tree at the same time. IOWs, the attr value > overwrite is effectively a "set-swap-remove" operation on two > entirely independent xattrs, ensuring that if we crash we always > have either the old or new xattr visible. > > With LARP, we can remove the original attr first, thereby avoiding > the need for two versions of the xattr to exist in the tree in the > first place. However, we have to do these two operations as a pair > of linked independent operations. The intent chain provides the > linking, and requires us to log the name and the value of the attr > that we are overwriting in the intent. Hence we can always recover > the modification to completion no matter where in the operation we > fail. > > When it comes to a parent attr rename operation, we are effectively > doing two linked operations - remove the old attr, set the new attr > - on different attributes. Implementation wise, it is exactly the > same sequence as a "replace value" operation, except for the fact > that the new attr we add has a different name. > > Hence the only real difference between the existing "attr replace" > and the intent chain we need for "parent attr rename" is that we > have to log two attr names instead of one. Basically, we have a new > XFS_ATTRI_OP_FLAGS... type for this, and that's what tells us that > we are operating on two different attributes instead of just one. This answers my earlier question: Yes, and yes. > The recovery operation becomes slightly different - we have to run a > remove on the old, then a replace on the new - so there a little bit > of new code needed to manage that in the state machine. > > These, however, are just small tweaks on the existing replace attr > operation, and there should be little difference in performance or > overhead between a "replace value" and a "replace entire xattr" > operation as they are largely the same runtime operation for LARP. > > > So the recovered set grows the leaf, and returns the egain, then rest > > gets capture committed. Next up is the recovered remove which pulls > > out the fork, which causes problems when the rest of the set operation > > resumes as a deferred operation. > > Yup, and all this goes away when we build the right intent chain for > replacing a parent attr rename.... Funnily enough, just last week I had thought that online repair was going to require the ability to replace an entire xattr... https://djwong.org/docs/xfs-online-fsck-design/#parent-pointers --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Wed, 2022-08-10 at 08:52 -0700, Darrick J. Wong wrote: > On Wed, Aug 10, 2022 at 04:12:58PM +1000, Dave Chinner wrote: > > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote: > > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote: > > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong > > > > wrote: > > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson > > > > > wrote: > > > > > > Recent parent pointer testing has exposed a bug in the > > > > > > underlying > > > > > > attr replay. A multi transaction replay currently performs > > > > > > a > > > > > > single step of the replay, then deferrs the rest if there > > > > > > is more > > > > > > to do. > > > > > > > > Yup. > > > > > > > > > > This causes race conditions with other attr replays that > > > > > > might be recovered before the remaining deferred work has > > > > > > had a > > > > > > chance to finish. > > > > > > > > What other attr replays are we racing against? There can only > > > > be > > > > one incomplete attr item intent/done chain per inode present in > > > > log > > > > recovery, right? > > > No, a rename queues up a set and remove before committing the > > > transaction. One for the new parent pointer, and another to > > > remove the > > > old one. > > > > Ah. That really needs to be described in the commit message - > > changing from "single intent chain per object" to "multiple > > concurrent independent and unserialised intent chains per object" > > is > > a pretty important design rule change... > > > > The whole point of intents is to allow complex, multi-stage > > operations on a single object to be sequenced in a tightly > > controlled manner. They weren't intended to be run as concurrent > > lines of modification on single items; if you need to do two > > modifications on an object, the intent chain ties the two > > modifications together into a single whole. > > Back when I made the suggestion that resulted in this patch, I was > pondering why it is that (say) atomic swapext didn't suffer from > these > recovery problems, and I realized that for any given inode, you can > only > have one ongoing swapext operation at a time. That's why recovery of > swapext operations works fine, whereas pptr recovery has this quirk. > > At the time, my thought process was more narrowly focused on making > log > recovery mimic runtime more closely. I didn't make the connection > between this problem and the other open question I had (see the > bottom) > about how to fix pptr attrs when rebuilding a directory. > > > One of the reasons I rewrote the attr state machine for LARP was to > > enable new multiple attr operation chains to be easily build from > > the entry points the state machien provides. Parent attr rename > > needs a new intent chain to be built, not run multiple independent > > intent chains for each modification. > > > > > It cant be an attr replace because technically the names are > > > different. > > > > I disagree - we have all the pieces we need in the state machine > > already, we just need to define separate attr names for the > > remove and insert steps in the attr intent. > > > > That is, the "replace" operation we execute when an attr set > > overwrites the value is "technically" a "replace value" operation, > > but we actually implement it as a "replace entire attribute" > > operation. > > OH. Right. I forgot that ATTR_REPLACE=="replace entire attr". > > If I'm understanding this right, that means that the xfs_rename patch > ought to detect the situation where there's an existing dirent in the > target directory, and do something along the lines of: > > } else { /* target_ip != NULL */ > xfs_dir_replace(...); > > xfs_parent_defer_replace(tp, new_parent_ptr, target_dp, > old_diroffset, target_name, > new_diroffset); > > xfs_trans_ichgtime(...); > > Where the xfs_parent_defer_replace operation does an ATTR_REPLACE to > switch: > > (target_dp_ino, target_gen, old_diroffset) == <dontcare> > > to this: > > (target_dp_ino, target_gen, new_diroffset) == target_name > > except, I think we have to log the old name in addition to the new > name, > because userspace ATTR_REPLACE operations don't allow name changes? > > I guess this also implies that xfs_dir_replace will pass out the > offset > of the old name, in addition to the offset of the new name. > > > Without LARP, we do that overwrite in independent steps via an > > intermediate INCOMPLETE state to allow two xattrs of the same name > > to exist in the attr tree at the same time. IOWs, the attr value > > overwrite is effectively a "set-swap-remove" operation on two > > entirely independent xattrs, ensuring that if we crash we always > > have either the old or new xattr visible. > > > > With LARP, we can remove the original attr first, thereby avoiding > > the need for two versions of the xattr to exist in the tree in the > > first place. However, we have to do these two operations as a pair > > of linked independent operations. The intent chain provides the > > linking, and requires us to log the name and the value of the attr > > that we are overwriting in the intent. Hence we can always recover > > the modification to completion no matter where in the operation we > > fail. > > > > When it comes to a parent attr rename operation, we are effectively > > doing two linked operations - remove the old attr, set the new attr > > - on different attributes. Implementation wise, it is exactly the > > same sequence as a "replace value" operation, except for the fact > > that the new attr we add has a different name. > > > > Hence the only real difference between the existing "attr replace" > > and the intent chain we need for "parent attr rename" is that we > > have to log two attr names instead of one. Basically, we have a new > > XFS_ATTRI_OP_FLAGS... type for this, and that's what tells us that > > we are operating on two different attributes instead of just one. > > This answers my earlier question: Yes, and yes. I see, alrighty then, I'll see if I can put together a new XFS_ATTRI_OP_FLAGS type that carries both the old and new name. That sounds like it should work. Thanks for all the feed back! Allison > > > The recovery operation becomes slightly different - we have to run > > a > > remove on the old, then a replace on the new - so there a little > > bit > > of new code needed to manage that in the state machine. > > > > These, however, are just small tweaks on the existing replace attr > > operation, and there should be little difference in performance or > > overhead between a "replace value" and a "replace entire xattr" > > operation as they are largely the same runtime operation for LARP. > > > > > So the recovered set grows the leaf, and returns the egain, then > > > rest > > > gets capture committed. Next up is the recovered remove which > > > pulls > > > out the fork, which causes problems when the rest of the set > > > operation > > > resumes as a deferred operation. > > > > Yup, and all this goes away when we build the right intent chain > > for > > replacing a parent attr rename.... > > Funnily enough, just last week I had thought that online repair was > going to require the ability to replace an entire xattr... > > https://urldefense.com/v3/__https://djwong.org/docs/xfs-online-fsck-design/*parent-pointers__;Iw!!ACWV5N9M2RV99hQ!MA2KfxWZLMTj_fdJoFnvZhLIgOGsGlIclRVE39DFME755VnvyX4VqsQGM6GfBDnDXKkfAcFjdv2oENaXepic$ > > --D > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com
On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote: > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote: > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote: > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote: > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson > > > > wrote: > > > > > Recent parent pointer testing has exposed a bug in the > > > > > underlying > > > > > attr replay. A multi transaction replay currently performs a > > > > > single step of the replay, then deferrs the rest if there is > > > > > more > > > > > to do. > > > > > > Yup. > > > > > > > > This causes race conditions with other attr replays that > > > > > might be recovered before the remaining deferred work has had > > > > > a > > > > > chance to finish. > > > > > > What other attr replays are we racing against? There can only be > > > one incomplete attr item intent/done chain per inode present in > > > log > > > recovery, right? > > No, a rename queues up a set and remove before committing the > > transaction. One for the new parent pointer, and another to remove > > the > > old one. > > Ah. That really needs to be described in the commit message - > changing from "single intent chain per object" to "multiple > concurrent independent and unserialised intent chains per object" is > a pretty important design rule change... > > The whole point of intents is to allow complex, multi-stage > operations on a single object to be sequenced in a tightly > controlled manner. They weren't intended to be run as concurrent > lines of modification on single items; if you need to do two > modifications on an object, the intent chain ties the two > modifications together into a single whole. > > One of the reasons I rewrote the attr state machine for LARP was to > enable new multiple attr operation chains to be easily build from > the entry points the state machien provides. Parent attr rename > needs a new intent chain to be built, not run multiple independent > intent chains for each modification. > > > It cant be an attr replace because technically the names are > > different. > > I disagree - we have all the pieces we need in the state machine > already, we just need to define separate attr names for the > remove and insert steps in the attr intent. > > That is, the "replace" operation we execute when an attr set > overwrites the value is "technically" a "replace value" operation, > but we actually implement it as a "replace entire attribute" > operation. > > Without LARP, we do that overwrite in independent steps via an > intermediate INCOMPLETE state to allow two xattrs of the same name > to exist in the attr tree at the same time. IOWs, the attr value > overwrite is effectively a "set-swap-remove" operation on two > entirely independent xattrs, ensuring that if we crash we always > have either the old or new xattr visible. > > With LARP, we can remove the original attr first, thereby avoiding > the need for two versions of the xattr to exist in the tree in the > first place. However, we have to do these two operations as a pair > of linked independent operations. The intent chain provides the > linking, and requires us to log the name and the value of the attr > that we are overwriting in the intent. Hence we can always recover > the modification to completion no matter where in the operation we > fail. > > When it comes to a parent attr rename operation, we are effectively > doing two linked operations - remove the old attr, set the new attr > - on different attributes. Implementation wise, it is exactly the > same sequence as a "replace value" operation, except for the fact > that the new attr we add has a different name. > > Hence the only real difference between the existing "attr replace" > and the intent chain we need for "parent attr rename" is that we > have to log two attr names instead of one. To be clear, this would imply expanding xfs_attri_log_format to have another alfi_new_name_len feild and another iovec for the attr intent right? Does that cause issues to change the on disk log layout after the original has merged? Or is that ok for things that are still experimental? Thanks! Allison > Basically, we have a new > XFS_ATTRI_OP_FLAGS... type for this, and that's what tells us that > we are operating on two different attributes instead of just one. > > The recovery operation becomes slightly different - we have to run a > remove on the old, then a replace on the new - so there a little bit > of new code needed to manage that in the state machine. > > These, however, are just small tweaks on the existing replace attr > operation, and there should be little difference in performance or > overhead between a "replace value" and a "replace entire xattr" > operation as they are largely the same runtime operation for LARP. > > > So the recovered set grows the leaf, and returns the egain, then > > rest > > gets capture committed. Next up is the recovered remove which > > pulls > > out the fork, which causes problems when the rest of the set > > operation > > resumes as a deferred operation. > > Yup, and all this goes away when we build the right intent chain for > replacing a parent attr rename.... > > Cheers, > > Dave.
On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote: > On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote: > > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote: > > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote: > > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote: > > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson > > > > > wrote: > > > > > > Recent parent pointer testing has exposed a bug in the > > > > > > underlying > > > > > > attr replay. A multi transaction replay currently performs a > > > > > > single step of the replay, then deferrs the rest if there is > > > > > > more > > > > > > to do. > > > > > > > > Yup. > > > > > > > > > > This causes race conditions with other attr replays that > > > > > > might be recovered before the remaining deferred work has had > > > > > > a > > > > > > chance to finish. > > > > > > > > What other attr replays are we racing against? There can only be > > > > one incomplete attr item intent/done chain per inode present in > > > > log > > > > recovery, right? > > > No, a rename queues up a set and remove before committing the > > > transaction. One for the new parent pointer, and another to remove > > > the > > > old one. > > > > Ah. That really needs to be described in the commit message - > > changing from "single intent chain per object" to "multiple > > concurrent independent and unserialised intent chains per object" is > > a pretty important design rule change... > > > > The whole point of intents is to allow complex, multi-stage > > operations on a single object to be sequenced in a tightly > > controlled manner. They weren't intended to be run as concurrent > > lines of modification on single items; if you need to do two > > modifications on an object, the intent chain ties the two > > modifications together into a single whole. > > > > One of the reasons I rewrote the attr state machine for LARP was to > > enable new multiple attr operation chains to be easily build from > > the entry points the state machien provides. Parent attr rename > > needs a new intent chain to be built, not run multiple independent > > intent chains for each modification. > > > > > It cant be an attr replace because technically the names are > > > different. > > > > I disagree - we have all the pieces we need in the state machine > > already, we just need to define separate attr names for the > > remove and insert steps in the attr intent. > > > > That is, the "replace" operation we execute when an attr set > > overwrites the value is "technically" a "replace value" operation, > > but we actually implement it as a "replace entire attribute" > > operation. > > > > Without LARP, we do that overwrite in independent steps via an > > intermediate INCOMPLETE state to allow two xattrs of the same name > > to exist in the attr tree at the same time. IOWs, the attr value > > overwrite is effectively a "set-swap-remove" operation on two > > entirely independent xattrs, ensuring that if we crash we always > > have either the old or new xattr visible. > > > > With LARP, we can remove the original attr first, thereby avoiding > > the need for two versions of the xattr to exist in the tree in the > > first place. However, we have to do these two operations as a pair > > of linked independent operations. The intent chain provides the > > linking, and requires us to log the name and the value of the attr > > that we are overwriting in the intent. Hence we can always recover > > the modification to completion no matter where in the operation we > > fail. > > > > When it comes to a parent attr rename operation, we are effectively > > doing two linked operations - remove the old attr, set the new attr > > - on different attributes. Implementation wise, it is exactly the > > same sequence as a "replace value" operation, except for the fact > > that the new attr we add has a different name. > > > > Hence the only real difference between the existing "attr replace" > > and the intent chain we need for "parent attr rename" is that we > > have to log two attr names instead of one. > > To be clear, this would imply expanding xfs_attri_log_format to have > another alfi_new_name_len feild and another iovec for the attr intent > right? Does that cause issues to change the on disk log layout after > the original has merged? Or is that ok for things that are still > experimental? Thanks! XFS_SB_FEAT_INCOMPAT_PARENT should protect against that, since the userspace xattr api does not support replacing an attr's name, only the value. --D > Allison > > > Basically, we have a new > > XFS_ATTRI_OP_FLAGS... type for this, and that's what tells us that > > we are operating on two different attributes instead of just one. > > > > The recovery operation becomes slightly different - we have to run a > > remove on the old, then a replace on the new - so there a little bit > > of new code needed to manage that in the state machine. > > > > These, however, are just small tweaks on the existing replace attr > > operation, and there should be little difference in performance or > > overhead between a "replace value" and a "replace entire xattr" > > operation as they are largely the same runtime operation for LARP. > > > > > So the recovered set grows the leaf, and returns the egain, then > > > rest > > > gets capture committed. Next up is the recovered remove which > > > pulls > > > out the fork, which causes problems when the rest of the set > > > operation > > > resumes as a deferred operation. > > > > Yup, and all this goes away when we build the right intent chain for > > replacing a parent attr rename.... > > > > Cheers, > > > > Dave. >
On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote: > On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote: > > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote: > > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote: > > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote: > > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson > > > > > wrote: > > > > > > Recent parent pointer testing has exposed a bug in the > > > > > > underlying > > > > > > attr replay. A multi transaction replay currently performs a > > > > > > single step of the replay, then deferrs the rest if there is > > > > > > more > > > > > > to do. > > > > > > > > Yup. > > > > > > > > > > This causes race conditions with other attr replays that > > > > > > might be recovered before the remaining deferred work has had > > > > > > a > > > > > > chance to finish. > > > > > > > > What other attr replays are we racing against? There can only be > > > > one incomplete attr item intent/done chain per inode present in > > > > log > > > > recovery, right? > > > No, a rename queues up a set and remove before committing the > > > transaction. One for the new parent pointer, and another to remove > > > the > > > old one. > > > > Ah. That really needs to be described in the commit message - > > changing from "single intent chain per object" to "multiple > > concurrent independent and unserialised intent chains per object" is > > a pretty important design rule change... > > > > The whole point of intents is to allow complex, multi-stage > > operations on a single object to be sequenced in a tightly > > controlled manner. They weren't intended to be run as concurrent > > lines of modification on single items; if you need to do two > > modifications on an object, the intent chain ties the two > > modifications together into a single whole. > > > > One of the reasons I rewrote the attr state machine for LARP was to > > enable new multiple attr operation chains to be easily build from > > the entry points the state machien provides. Parent attr rename > > needs a new intent chain to be built, not run multiple independent > > intent chains for each modification. > > > > > It cant be an attr replace because technically the names are > > > different. > > > > I disagree - we have all the pieces we need in the state machine > > already, we just need to define separate attr names for the > > remove and insert steps in the attr intent. > > > > That is, the "replace" operation we execute when an attr set > > overwrites the value is "technically" a "replace value" operation, > > but we actually implement it as a "replace entire attribute" > > operation. > > > > Without LARP, we do that overwrite in independent steps via an > > intermediate INCOMPLETE state to allow two xattrs of the same name > > to exist in the attr tree at the same time. IOWs, the attr value > > overwrite is effectively a "set-swap-remove" operation on two > > entirely independent xattrs, ensuring that if we crash we always > > have either the old or new xattr visible. > > > > With LARP, we can remove the original attr first, thereby avoiding > > the need for two versions of the xattr to exist in the tree in the > > first place. However, we have to do these two operations as a pair > > of linked independent operations. The intent chain provides the > > linking, and requires us to log the name and the value of the attr > > that we are overwriting in the intent. Hence we can always recover > > the modification to completion no matter where in the operation we > > fail. > > > > When it comes to a parent attr rename operation, we are effectively > > doing two linked operations - remove the old attr, set the new attr > > - on different attributes. Implementation wise, it is exactly the > > same sequence as a "replace value" operation, except for the fact > > that the new attr we add has a different name. > > > > Hence the only real difference between the existing "attr replace" > > and the intent chain we need for "parent attr rename" is that we > > have to log two attr names instead of one. > > To be clear, this would imply expanding xfs_attri_log_format to have > another alfi_new_name_len feild and another iovec for the attr intent > right? Does that cause issues to change the on disk log layout after > the original has merged? Or is that ok for things that are still > experimental? Thanks! I think we can get away with this quite easily without breaking the existing experimental code. struct xfs_attri_log_format { uint16_t alfi_type; /* attri log item type */ uint16_t alfi_size; /* size of this item */ uint32_t __pad; /* pad to 64 bit aligned */ uint64_t alfi_id; /* attri identifier */ uint64_t alfi_ino; /* the inode for this attr operation */ uint32_t alfi_op_flags; /* marks the op as a set or remove */ uint32_t alfi_name_len; /* attr name length */ uint32_t alfi_value_len; /* attr value length */ uint32_t alfi_attr_filter;/* attr filter flags */ }; We have a padding field in there that is currently all zeros. Let's make that a count of the number of {name, value} tuples that are appended to the format. i.e. struct xfs_attri_log_name { uint32_t alfi_op_flags; /* marks the op as a set or remove */ uint32_t alfi_name_len; /* attr name length */ uint32_t alfi_value_len; /* attr value length */ uint32_t alfi_attr_filter;/* attr filter flags */ }; struct xfs_attri_log_format { uint16_t alfi_type; /* attri log item type */ uint16_t alfi_size; /* size of this item */ uint8_t alfi_attr_cnt; /* count of name/val pairs */ uint8_t __pad1; /* pad to 64 bit aligned */ uint16_t __pad2; /* pad to 64 bit aligned */ uint64_t alfi_id; /* attri identifier */ uint64_t alfi_ino; /* the inode for this attr operation */ struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on */ }; Basically, the size and shape of the structure has not changed, and if alfi_attr_cnt == 0 we just treat it as if alfi_attr_cnt == 1 as the backwards compat code for the existing code. And then we just have as many followup regions for name/val pairs as are defined by the alfi_attr_cnt and alfi_attr[] parts of the structure. Each attr can have a different operation performed on them, and they can have different filters applied so they can exist in different namespaces, too. SO I don't think we need a new on-disk feature bit for this enhancement - it definitely comes under the heading of "this stuff is experimental, this is the sort of early structure revision that EXPERIMENTAL is supposed to cover.... Cheers, Dave.
On Tue, Aug 16, 2022 at 10:54:38AM +1000, Dave Chinner wrote: > On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote: > > On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote: > > > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote: > > > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote: > > > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote: > > > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson > > > > > > wrote: > > > > > > > Recent parent pointer testing has exposed a bug in the > > > > > > > underlying > > > > > > > attr replay. A multi transaction replay currently performs a > > > > > > > single step of the replay, then deferrs the rest if there is > > > > > > > more > > > > > > > to do. > > > > > > > > > > Yup. > > > > > > > > > > > > This causes race conditions with other attr replays that > > > > > > > might be recovered before the remaining deferred work has had > > > > > > > a > > > > > > > chance to finish. > > > > > > > > > > What other attr replays are we racing against? There can only be > > > > > one incomplete attr item intent/done chain per inode present in > > > > > log > > > > > recovery, right? > > > > No, a rename queues up a set and remove before committing the > > > > transaction. One for the new parent pointer, and another to remove > > > > the > > > > old one. > > > > > > Ah. That really needs to be described in the commit message - > > > changing from "single intent chain per object" to "multiple > > > concurrent independent and unserialised intent chains per object" is > > > a pretty important design rule change... > > > > > > The whole point of intents is to allow complex, multi-stage > > > operations on a single object to be sequenced in a tightly > > > controlled manner. They weren't intended to be run as concurrent > > > lines of modification on single items; if you need to do two > > > modifications on an object, the intent chain ties the two > > > modifications together into a single whole. > > > > > > One of the reasons I rewrote the attr state machine for LARP was to > > > enable new multiple attr operation chains to be easily build from > > > the entry points the state machien provides. Parent attr rename > > > needs a new intent chain to be built, not run multiple independent > > > intent chains for each modification. > > > > > > > It cant be an attr replace because technically the names are > > > > different. > > > > > > I disagree - we have all the pieces we need in the state machine > > > already, we just need to define separate attr names for the > > > remove and insert steps in the attr intent. > > > > > > That is, the "replace" operation we execute when an attr set > > > overwrites the value is "technically" a "replace value" operation, > > > but we actually implement it as a "replace entire attribute" > > > operation. > > > > > > Without LARP, we do that overwrite in independent steps via an > > > intermediate INCOMPLETE state to allow two xattrs of the same name > > > to exist in the attr tree at the same time. IOWs, the attr value > > > overwrite is effectively a "set-swap-remove" operation on two > > > entirely independent xattrs, ensuring that if we crash we always > > > have either the old or new xattr visible. > > > > > > With LARP, we can remove the original attr first, thereby avoiding > > > the need for two versions of the xattr to exist in the tree in the > > > first place. However, we have to do these two operations as a pair > > > of linked independent operations. The intent chain provides the > > > linking, and requires us to log the name and the value of the attr > > > that we are overwriting in the intent. Hence we can always recover > > > the modification to completion no matter where in the operation we > > > fail. > > > > > > When it comes to a parent attr rename operation, we are effectively > > > doing two linked operations - remove the old attr, set the new attr > > > - on different attributes. Implementation wise, it is exactly the > > > same sequence as a "replace value" operation, except for the fact > > > that the new attr we add has a different name. > > > > > > Hence the only real difference between the existing "attr replace" > > > and the intent chain we need for "parent attr rename" is that we > > > have to log two attr names instead of one. > > > > To be clear, this would imply expanding xfs_attri_log_format to have > > another alfi_new_name_len feild and another iovec for the attr intent > > right? Does that cause issues to change the on disk log layout after > > the original has merged? Or is that ok for things that are still > > experimental? Thanks! > > I think we can get away with this quite easily without breaking the > existing experimental code. > > struct xfs_attri_log_format { > uint16_t alfi_type; /* attri log item type */ > uint16_t alfi_size; /* size of this item */ > uint32_t __pad; /* pad to 64 bit aligned */ > uint64_t alfi_id; /* attri identifier */ > uint64_t alfi_ino; /* the inode for this attr operation */ > uint32_t alfi_op_flags; /* marks the op as a set or remove */ > uint32_t alfi_name_len; /* attr name length */ > uint32_t alfi_value_len; /* attr value length */ > uint32_t alfi_attr_filter;/* attr filter flags */ > }; > > We have a padding field in there that is currently all zeros. Let's > make that a count of the number of {name, value} tuples that are > appended to the format. i.e. > > struct xfs_attri_log_name { > uint32_t alfi_op_flags; /* marks the op as a set or remove */ > uint32_t alfi_name_len; /* attr name length */ > uint32_t alfi_value_len; /* attr value length */ > uint32_t alfi_attr_filter;/* attr filter flags */ > }; > > struct xfs_attri_log_format { > uint16_t alfi_type; /* attri log item type */ > uint16_t alfi_size; /* size of this item */ > uint8_t alfi_attr_cnt; /* count of name/val pairs */ > uint8_t __pad1; /* pad to 64 bit aligned */ > uint16_t __pad2; /* pad to 64 bit aligned */ > uint64_t alfi_id; /* attri identifier */ > uint64_t alfi_ino; /* the inode for this attr operation */ > struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on */ > }; > > Basically, the size and shape of the structure has not changed, and > if alfi_attr_cnt == 0 we just treat it as if alfi_attr_cnt == 1 as > the backwards compat code for the existing code. > > And then we just have as many followup regions for name/val pairs > as are defined by the alfi_attr_cnt and alfi_attr[] parts of the > structure. Each attr can have a different operation performed on > them, and they can have different filters applied so they can exist > in different namespaces, too. > > SO I don't think we need a new on-disk feature bit for this > enhancement - it definitely comes under the heading of "this stuff > is experimental, this is the sort of early structure revision that > EXPERIMENTAL is supposed to cover.... You might even callit "alfi_extra_names" to avoid the "0 means 1" stuff. ;) --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Mon, 2022-08-15 at 22:07 -0700, Darrick J. Wong wrote: > On Tue, Aug 16, 2022 at 10:54:38AM +1000, Dave Chinner wrote: > > On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote: > > > On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote: > > > > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote: > > > > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote: > > > > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong > > > > > > wrote: > > > > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison > > > > > > > Henderson > > > > > > > wrote: > > > > > > > > Recent parent pointer testing has exposed a bug in the > > > > > > > > underlying > > > > > > > > attr replay. A multi transaction replay currently > > > > > > > > performs a > > > > > > > > single step of the replay, then deferrs the rest if > > > > > > > > there is > > > > > > > > more > > > > > > > > to do. > > > > > > > > > > > > Yup. > > > > > > > > > > > > > > This causes race conditions with other attr replays > > > > > > > > that > > > > > > > > might be recovered before the remaining deferred work > > > > > > > > has had > > > > > > > > a > > > > > > > > chance to finish. > > > > > > > > > > > > What other attr replays are we racing against? There can > > > > > > only be > > > > > > one incomplete attr item intent/done chain per inode > > > > > > present in > > > > > > log > > > > > > recovery, right? > > > > > No, a rename queues up a set and remove before committing the > > > > > transaction. One for the new parent pointer, and another to > > > > > remove > > > > > the > > > > > old one. > > > > > > > > Ah. That really needs to be described in the commit message - > > > > changing from "single intent chain per object" to "multiple > > > > concurrent independent and unserialised intent chains per > > > > object" is > > > > a pretty important design rule change... > > > > > > > > The whole point of intents is to allow complex, multi-stage > > > > operations on a single object to be sequenced in a tightly > > > > controlled manner. They weren't intended to be run as > > > > concurrent > > > > lines of modification on single items; if you need to do two > > > > modifications on an object, the intent chain ties the two > > > > modifications together into a single whole. > > > > > > > > One of the reasons I rewrote the attr state machine for LARP > > > > was to > > > > enable new multiple attr operation chains to be easily build > > > > from > > > > the entry points the state machien provides. Parent attr rename > > > > needs a new intent chain to be built, not run multiple > > > > independent > > > > intent chains for each modification. > > > > > > > > > It cant be an attr replace because technically the names are > > > > > different. > > > > > > > > I disagree - we have all the pieces we need in the state > > > > machine > > > > already, we just need to define separate attr names for the > > > > remove and insert steps in the attr intent. > > > > > > > > That is, the "replace" operation we execute when an attr set > > > > overwrites the value is "technically" a "replace value" > > > > operation, > > > > but we actually implement it as a "replace entire attribute" > > > > operation. > > > > > > > > Without LARP, we do that overwrite in independent steps via an > > > > intermediate INCOMPLETE state to allow two xattrs of the same > > > > name > > > > to exist in the attr tree at the same time. IOWs, the attr > > > > value > > > > overwrite is effectively a "set-swap-remove" operation on two > > > > entirely independent xattrs, ensuring that if we crash we > > > > always > > > > have either the old or new xattr visible. > > > > > > > > With LARP, we can remove the original attr first, thereby > > > > avoiding > > > > the need for two versions of the xattr to exist in the tree in > > > > the > > > > first place. However, we have to do these two operations as a > > > > pair > > > > of linked independent operations. The intent chain provides the > > > > linking, and requires us to log the name and the value of the > > > > attr > > > > that we are overwriting in the intent. Hence we can always > > > > recover > > > > the modification to completion no matter where in the operation > > > > we > > > > fail. > > > > > > > > When it comes to a parent attr rename operation, we are > > > > effectively > > > > doing two linked operations - remove the old attr, set the new > > > > attr > > > > - on different attributes. Implementation wise, it is exactly > > > > the > > > > same sequence as a "replace value" operation, except for the > > > > fact > > > > that the new attr we add has a different name. > > > > > > > > Hence the only real difference between the existing "attr > > > > replace" > > > > and the intent chain we need for "parent attr rename" is that > > > > we > > > > have to log two attr names instead of one. > > > > > > To be clear, this would imply expanding xfs_attri_log_format to > > > have > > > another alfi_new_name_len feild and another iovec for the attr > > > intent > > > right? Does that cause issues to change the on disk log layout > > > after > > > the original has merged? Or is that ok for things that are still > > > experimental? Thanks! > > > > I think we can get away with this quite easily without breaking the > > existing experimental code. > > > > struct xfs_attri_log_format { > > uint16_t alfi_type; /* attri log item type */ > > uint16_t alfi_size; /* size of this item */ > > uint32_t __pad; /* pad to 64 bit aligned */ > > uint64_t alfi_id; /* attri identifier */ > > uint64_t alfi_ino; /* the inode for this attr > > operation */ > > uint32_t alfi_op_flags; /* marks the op as a set or > > remove */ > > uint32_t alfi_name_len; /* attr name length */ > > uint32_t alfi_value_len; /* attr value length */ > > uint32_t alfi_attr_filter;/* attr filter flags */ > > }; > > > > We have a padding field in there that is currently all zeros. Let's > > make that a count of the number of {name, value} tuples that are > > appended to the format. i.e. > > > > struct xfs_attri_log_name { > > uint32_t alfi_op_flags; /* marks the op as a set or > > remove */ > > uint32_t alfi_name_len; /* attr name length */ > > uint32_t alfi_value_len; /* attr value length */ > > uint32_t alfi_attr_filter;/* attr filter flags */ > > }; > > > > struct xfs_attri_log_format { > > uint16_t alfi_type; /* attri log item type */ > > uint16_t alfi_size; /* size of this item */ > > uint8_t alfi_attr_cnt; /* count of name/val pairs > > */ > > uint8_t __pad1; /* pad to 64 bit > > aligned */ > > uint16_t __pad2; /* pad to 64 bit aligned */ > > uint64_t alfi_id; /* attri identifier */ > > uint64_t alfi_ino; /* the inode for this attr > > operation */ > > struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on > > */ > > }; > > > > Basically, the size and shape of the structure has not changed, and > > if alfi_attr_cnt == 0 we just treat it as if alfi_attr_cnt == 1 as > > the backwards compat code for the existing code. > > > > And then we just have as many followup regions for name/val pairs > > as are defined by the alfi_attr_cnt and alfi_attr[] parts of the > > structure. Each attr can have a different operation performed on > > them, and they can have different filters applied so they can exist > > in different namespaces, too. > > > > SO I don't think we need a new on-disk feature bit for this > > enhancement - it definitely comes under the heading of "this stuff > > is experimental, this is the sort of early structure revision that > > EXPERIMENTAL is supposed to cover.... > > You might even callit "alfi_extra_names" to avoid the "0 means 1" > stuff. > ;) > > --D Oh, I just noticed these comments this morning when I sent out the new attri/d patch. I'll add this changes to v2. Please let me know if there's anything else you'd like me to change from the v1. Thx! Allison > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com
On Tue, 2022-08-16 at 13:41 -0700, Alli wrote: > On Mon, 2022-08-15 at 22:07 -0700, Darrick J. Wong wrote: > > On Tue, Aug 16, 2022 at 10:54:38AM +1000, Dave Chinner wrote: > > > On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote: > > > > On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote: > > > > > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote: > > > > > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote: > > > > > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong > > > > > > > wrote: > > > > > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison > > > > > > > > Henderson > > > > > > > > wrote: > > > > > > > > > Recent parent pointer testing has exposed a bug in > > > > > > > > > the > > > > > > > > > underlying > > > > > > > > > attr replay. A multi transaction replay currently > > > > > > > > > performs a > > > > > > > > > single step of the replay, then deferrs the rest if > > > > > > > > > there is > > > > > > > > > more > > > > > > > > > to do. > > > > > > > > > > > > > > Yup. > > > > > > > > > > > > > > > > This causes race conditions with other attr replays > > > > > > > > > that > > > > > > > > > might be recovered before the remaining deferred work > > > > > > > > > has had > > > > > > > > > a > > > > > > > > > chance to finish. > > > > > > > > > > > > > > What other attr replays are we racing against? There can > > > > > > > only be > > > > > > > one incomplete attr item intent/done chain per inode > > > > > > > present in > > > > > > > log > > > > > > > recovery, right? > > > > > > No, a rename queues up a set and remove before committing > > > > > > the > > > > > > transaction. One for the new parent pointer, and another > > > > > > to > > > > > > remove > > > > > > the > > > > > > old one. > > > > > > > > > > Ah. That really needs to be described in the commit message - > > > > > changing from "single intent chain per object" to "multiple > > > > > concurrent independent and unserialised intent chains per > > > > > object" is > > > > > a pretty important design rule change... > > > > > > > > > > The whole point of intents is to allow complex, multi-stage > > > > > operations on a single object to be sequenced in a tightly > > > > > controlled manner. They weren't intended to be run as > > > > > concurrent > > > > > lines of modification on single items; if you need to do two > > > > > modifications on an object, the intent chain ties the two > > > > > modifications together into a single whole. > > > > > > > > > > One of the reasons I rewrote the attr state machine for LARP > > > > > was to > > > > > enable new multiple attr operation chains to be easily build > > > > > from > > > > > the entry points the state machien provides. Parent attr > > > > > rename > > > > > needs a new intent chain to be built, not run multiple > > > > > independent > > > > > intent chains for each modification. > > > > > > > > > > > It cant be an attr replace because technically the names > > > > > > are > > > > > > different. > > > > > > > > > > I disagree - we have all the pieces we need in the state > > > > > machine > > > > > already, we just need to define separate attr names for the > > > > > remove and insert steps in the attr intent. > > > > > > > > > > That is, the "replace" operation we execute when an attr set > > > > > overwrites the value is "technically" a "replace value" > > > > > operation, > > > > > but we actually implement it as a "replace entire attribute" > > > > > operation. > > > > > > > > > > Without LARP, we do that overwrite in independent steps via > > > > > an > > > > > intermediate INCOMPLETE state to allow two xattrs of the same > > > > > name > > > > > to exist in the attr tree at the same time. IOWs, the attr > > > > > value > > > > > overwrite is effectively a "set-swap-remove" operation on two > > > > > entirely independent xattrs, ensuring that if we crash we > > > > > always > > > > > have either the old or new xattr visible. > > > > > > > > > > With LARP, we can remove the original attr first, thereby > > > > > avoiding > > > > > the need for two versions of the xattr to exist in the tree > > > > > in > > > > > the > > > > > first place. However, we have to do these two operations as a > > > > > pair > > > > > of linked independent operations. The intent chain provides > > > > > the > > > > > linking, and requires us to log the name and the value of the > > > > > attr > > > > > that we are overwriting in the intent. Hence we can always > > > > > recover > > > > > the modification to completion no matter where in the > > > > > operation > > > > > we > > > > > fail. > > > > > > > > > > When it comes to a parent attr rename operation, we are > > > > > effectively > > > > > doing two linked operations - remove the old attr, set the > > > > > new > > > > > attr > > > > > - on different attributes. Implementation wise, it is exactly > > > > > the > > > > > same sequence as a "replace value" operation, except for the > > > > > fact > > > > > that the new attr we add has a different name. > > > > > > > > > > Hence the only real difference between the existing "attr > > > > > replace" > > > > > and the intent chain we need for "parent attr rename" is that > > > > > we > > > > > have to log two attr names instead of one. > > > > > > > > To be clear, this would imply expanding xfs_attri_log_format to > > > > have > > > > another alfi_new_name_len feild and another iovec for the attr > > > > intent > > > > right? Does that cause issues to change the on disk log layout > > > > after > > > > the original has merged? Or is that ok for things that are > > > > still > > > > experimental? Thanks! > > > > > > I think we can get away with this quite easily without breaking > > > the > > > existing experimental code. > > > > > > struct xfs_attri_log_format { > > > uint16_t alfi_type; /* attri log item type */ > > > uint16_t alfi_size; /* size of this item */ > > > uint32_t __pad; /* pad to 64 bit aligned > > > */ > > > uint64_t alfi_id; /* attri identifier */ > > > uint64_t alfi_ino; /* the inode for this > > > attr > > > operation */ > > > uint32_t alfi_op_flags; /* marks the op as a set > > > or > > > remove */ > > > uint32_t alfi_name_len; /* attr name length */ > > > uint32_t alfi_value_len; /* attr value length */ > > > uint32_t alfi_attr_filter;/* attr filter flags */ > > > }; > > > > > > We have a padding field in there that is currently all zeros. > > > Let's > > > make that a count of the number of {name, value} tuples that are > > > appended to the format. i.e. > > > > > > struct xfs_attri_log_name { > > > uint32_t alfi_op_flags; /* marks the op as a set > > > or > > > remove */ > > > uint32_t alfi_name_len; /* attr name length */ > > > uint32_t alfi_value_len; /* attr value length */ > > > uint32_t alfi_attr_filter;/* attr filter flags */ > > > }; > > > > > > struct xfs_attri_log_format { > > > uint16_t alfi_type; /* attri log item type */ > > > uint16_t alfi_size; /* size of this item */ > > > uint8_t alfi_attr_cnt; /* count of name/val > > > pairs > > > */ > > > uint8_t __pad1; /* pad to 64 bit > > > aligned */ > > > uint16_t __pad2; /* pad to 64 bit aligned */ > > > uint64_t alfi_id; /* attri identifier */ > > > uint64_t alfi_ino; /* the inode for this > > > attr > > > operation */ > > > struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on > > > */ > > > }; > > > > > > Basically, the size and shape of the structure has not changed, > > > and > > > if alfi_attr_cnt == 0 we just treat it as if alfi_attr_cnt == 1 > > > as > > > the backwards compat code for the existing code. > > > > > > And then we just have as many followup regions for name/val pairs > > > as are defined by the alfi_attr_cnt and alfi_attr[] parts of the > > > structure. Each attr can have a different operation performed on > > > them, and they can have different filters applied so they can > > > exist > > > in different namespaces, too. > > > > > > SO I don't think we need a new on-disk feature bit for this > > > enhancement - it definitely comes under the heading of "this > > > stuff > > > is experimental, this is the sort of early structure revision > > > that > > > EXPERIMENTAL is supposed to cover.... > > > > You might even callit "alfi_extra_names" to avoid the "0 means 1" > > stuff. > > ;) > > > > --D > > Oh, I just noticed these comments this morning when I sent out the > new > attri/d patch. I'll add this changes to v2. Please let me know if > there's anything else you'd like me to change from the v1. Thx! > > Allison Ok, so I am part way through coding this up, and I'm getting this feeling like this is not going to work out very well due to the size checks for the log formats: root@garnet:/home/achender/work_area/xfs-linux# git diff fs/xfs/libxfs/xfs_log_format.h fs/xfs/xfs_ondisk.h diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h index f1ff52ebb982..5a4e700f32fc 100644 --- a/fs/xfs/libxfs/xfs_log_format.h +++ b/fs/xfs/libxfs/xfs_log_format.h @@ -922,6 +922,13 @@ struct xfs_icreate_log { XFS_ATTR_PARENT | \ XFS_ATTR_INCOMPLETE) +struct xfs_attri_log_name { + uint32_t alfi_op_flags; /* marks the op as a set or remove */ + uint32_t alfi_name_len; /* attr name length */ + uint32_t alfi_value_len; /* attr value length */ + uint32_t alfi_attr_filter;/* attr filter flags */ +}; + /* * This is the structure used to lay out an attr log item in the * log. @@ -929,14 +936,12 @@ struct xfs_icreate_log { struct xfs_attri_log_format { uint16_t alfi_type; /* attri log item type */ uint16_t alfi_size; /* size of this item */ - uint32_t __pad; /* pad to 64 bit aligned */ + uint8_t alfi_extra_names;/* count of name/val pairs */ + uint8_t __pad1; /* pad to 64 bit aligned */ + uint16_t __pad2; /* pad to 64 bit aligned */ uint64_t alfi_id; /* attri identifier */ uint64_t alfi_ino; /* the inode for this attr operation */ - uint32_t alfi_op_flags; /* marks the op as a set or remove */ - uint32_t alfi_name_len; /* attr name length */ - uint32_t alfi_value_len; /* attr value length */ - uint32_t alfi_attr_filter;/* attr filter flags */ + struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on */ }; struct xfs_attrd_log_format { diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h index 3e7f7eaa5b96..c040eeb88def 100644 --- a/fs/xfs/xfs_ondisk.h +++ b/fs/xfs/xfs_ondisk.h @@ -132,7 +132,7 @@ xfs_check_ondisk_structs(void) XFS_CHECK_STRUCT_SIZE(struct xfs_inode_log_format, 56); XFS_CHECK_STRUCT_SIZE(struct xfs_qoff_logformat, 20); XFS_CHECK_STRUCT_SIZE(struct xfs_trans_header, 16); - XFS_CHECK_STRUCT_SIZE(struct xfs_attri_log_format, 48); + XFS_CHECK_STRUCT_SIZE(struct xfs_attri_log_format, 24); XFS_CHECK_STRUCT_SIZE(struct xfs_attrd_log_format, 16); /* parent pointer ioctls */ root@garnet:/home/achender/work_area/xfs-linux# If the on disk size check thinks the format is 24 bytes, and then we surprise pack an array of structs after it, isnt that going to run over the next item? I think anything dynamic like this has to be an nvec. Maybe we leave the existing alfi_* as they are so the size doesnt change, and then if we have a value in alfi_extra_names, then we have an extra nvec that has the array in it. I think that would work. FWIW, an alternate solution would be to use the pad for a second name length, and then we get a patch that's very similar to the one I sent out last Tues, but backward compatible. Though it does eat the remaining pad and wouldn't be as flexible, I cant think of an attr op that would need more than two names either? Let me know what people think. Thanks! Allison > > > Cheers, > > > > > > Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com
On Thu, Aug 18, 2022 at 06:05:54PM -0700, Alli wrote: > On Tue, 2022-08-16 at 13:41 -0700, Alli wrote: > > On Mon, 2022-08-15 at 22:07 -0700, Darrick J. Wong wrote: > > > On Tue, Aug 16, 2022 at 10:54:38AM +1000, Dave Chinner wrote: > > > > On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote: > > > > > On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote: > > > > > > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote: > > > > > > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote: > > > > > > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong > > > > > > > > wrote: > > > > > > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison > > > > > > > > > Henderson > > > > > > > > > wrote: > > > > > > > > > > Recent parent pointer testing has exposed a bug in > > > > > > > > > > the > > > > > > > > > > underlying > > > > > > > > > > attr replay. A multi transaction replay currently > > > > > > > > > > performs a > > > > > > > > > > single step of the replay, then deferrs the rest if > > > > > > > > > > there is > > > > > > > > > > more > > > > > > > > > > to do. > > > > > > > > > > > > > > > > Yup. > > > > > > > > > > > > > > > > > > This causes race conditions with other attr replays > > > > > > > > > > that > > > > > > > > > > might be recovered before the remaining deferred work > > > > > > > > > > has had > > > > > > > > > > a > > > > > > > > > > chance to finish. > > > > > > > > > > > > > > > > What other attr replays are we racing against? There can > > > > > > > > only be > > > > > > > > one incomplete attr item intent/done chain per inode > > > > > > > > present in > > > > > > > > log > > > > > > > > recovery, right? > > > > > > > No, a rename queues up a set and remove before committing > > > > > > > the > > > > > > > transaction. One for the new parent pointer, and another > > > > > > > to > > > > > > > remove > > > > > > > the > > > > > > > old one. > > > > > > > > > > > > Ah. That really needs to be described in the commit message - > > > > > > changing from "single intent chain per object" to "multiple > > > > > > concurrent independent and unserialised intent chains per > > > > > > object" is > > > > > > a pretty important design rule change... > > > > > > > > > > > > The whole point of intents is to allow complex, multi-stage > > > > > > operations on a single object to be sequenced in a tightly > > > > > > controlled manner. They weren't intended to be run as > > > > > > concurrent > > > > > > lines of modification on single items; if you need to do two > > > > > > modifications on an object, the intent chain ties the two > > > > > > modifications together into a single whole. > > > > > > > > > > > > One of the reasons I rewrote the attr state machine for LARP > > > > > > was to > > > > > > enable new multiple attr operation chains to be easily build > > > > > > from > > > > > > the entry points the state machien provides. Parent attr > > > > > > rename > > > > > > needs a new intent chain to be built, not run multiple > > > > > > independent > > > > > > intent chains for each modification. > > > > > > > > > > > > > It cant be an attr replace because technically the names > > > > > > > are > > > > > > > different. > > > > > > > > > > > > I disagree - we have all the pieces we need in the state > > > > > > machine > > > > > > already, we just need to define separate attr names for the > > > > > > remove and insert steps in the attr intent. > > > > > > > > > > > > That is, the "replace" operation we execute when an attr set > > > > > > overwrites the value is "technically" a "replace value" > > > > > > operation, > > > > > > but we actually implement it as a "replace entire attribute" > > > > > > operation. > > > > > > > > > > > > Without LARP, we do that overwrite in independent steps via > > > > > > an > > > > > > intermediate INCOMPLETE state to allow two xattrs of the same > > > > > > name > > > > > > to exist in the attr tree at the same time. IOWs, the attr > > > > > > value > > > > > > overwrite is effectively a "set-swap-remove" operation on two > > > > > > entirely independent xattrs, ensuring that if we crash we > > > > > > always > > > > > > have either the old or new xattr visible. > > > > > > > > > > > > With LARP, we can remove the original attr first, thereby > > > > > > avoiding > > > > > > the need for two versions of the xattr to exist in the tree > > > > > > in > > > > > > the > > > > > > first place. However, we have to do these two operations as a > > > > > > pair > > > > > > of linked independent operations. The intent chain provides > > > > > > the > > > > > > linking, and requires us to log the name and the value of the > > > > > > attr > > > > > > that we are overwriting in the intent. Hence we can always > > > > > > recover > > > > > > the modification to completion no matter where in the > > > > > > operation > > > > > > we > > > > > > fail. > > > > > > > > > > > > When it comes to a parent attr rename operation, we are > > > > > > effectively > > > > > > doing two linked operations - remove the old attr, set the > > > > > > new > > > > > > attr > > > > > > - on different attributes. Implementation wise, it is exactly > > > > > > the > > > > > > same sequence as a "replace value" operation, except for the > > > > > > fact > > > > > > that the new attr we add has a different name. > > > > > > > > > > > > Hence the only real difference between the existing "attr > > > > > > replace" > > > > > > and the intent chain we need for "parent attr rename" is that > > > > > > we > > > > > > have to log two attr names instead of one. > > > > > > > > > > To be clear, this would imply expanding xfs_attri_log_format to > > > > > have > > > > > another alfi_new_name_len feild and another iovec for the attr > > > > > intent > > > > > right? Does that cause issues to change the on disk log layout > > > > > after > > > > > the original has merged? Or is that ok for things that are > > > > > still > > > > > experimental? Thanks! > > > > > > > > I think we can get away with this quite easily without breaking > > > > the > > > > existing experimental code. > > > > > > > > struct xfs_attri_log_format { > > > > uint16_t alfi_type; /* attri log item type */ > > > > uint16_t alfi_size; /* size of this item */ > > > > uint32_t __pad; /* pad to 64 bit aligned > > > > */ > > > > uint64_t alfi_id; /* attri identifier */ > > > > uint64_t alfi_ino; /* the inode for this > > > > attr > > > > operation */ > > > > uint32_t alfi_op_flags; /* marks the op as a set > > > > or > > > > remove */ > > > > uint32_t alfi_name_len; /* attr name length */ > > > > uint32_t alfi_value_len; /* attr value length */ > > > > uint32_t alfi_attr_filter;/* attr filter flags */ > > > > }; > > > > > > > > We have a padding field in there that is currently all zeros. > > > > Let's > > > > make that a count of the number of {name, value} tuples that are > > > > appended to the format. i.e. > > > > > > > > struct xfs_attri_log_name { > > > > uint32_t alfi_op_flags; /* marks the op as a set > > > > or > > > > remove */ > > > > uint32_t alfi_name_len; /* attr name length */ > > > > uint32_t alfi_value_len; /* attr value length */ > > > > uint32_t alfi_attr_filter;/* attr filter flags */ > > > > }; > > > > > > > > struct xfs_attri_log_format { > > > > uint16_t alfi_type; /* attri log item type */ > > > > uint16_t alfi_size; /* size of this item */ > > > > uint8_t alfi_attr_cnt; /* count of name/val > > > > pairs > > > > */ > > > > uint8_t __pad1; /* pad to 64 bit > > > > aligned */ > > > > uint16_t __pad2; /* pad to 64 bit aligned */ > > > > uint64_t alfi_id; /* attri identifier */ > > > > uint64_t alfi_ino; /* the inode for this > > > > attr > > > > operation */ > > > > struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on > > > > */ > > > > }; > > > > > > > > Basically, the size and shape of the structure has not changed, > > > > and > > > > if alfi_attr_cnt == 0 we just treat it as if alfi_attr_cnt == 1 > > > > as > > > > the backwards compat code for the existing code. > > > > > > > > And then we just have as many followup regions for name/val pairs > > > > as are defined by the alfi_attr_cnt and alfi_attr[] parts of the > > > > structure. Each attr can have a different operation performed on > > > > them, and they can have different filters applied so they can > > > > exist > > > > in different namespaces, too. > > > > > > > > SO I don't think we need a new on-disk feature bit for this > > > > enhancement - it definitely comes under the heading of "this > > > > stuff > > > > is experimental, this is the sort of early structure revision > > > > that > > > > EXPERIMENTAL is supposed to cover.... > > > > > > You might even callit "alfi_extra_names" to avoid the "0 means 1" > > > stuff. > > > ;) > > > > > > --D > > > > Oh, I just noticed these comments this morning when I sent out the > > new > > attri/d patch. I'll add this changes to v2. Please let me know if > > there's anything else you'd like me to change from the v1. Thx! > > > > Allison > > Ok, so I am part way through coding this up, and I'm getting this > feeling like this is not going to work out very well due to the size > checks for the log formats: > > root@garnet:/home/achender/work_area/xfs-linux# git diff > fs/xfs/libxfs/xfs_log_format.h fs/xfs/xfs_ondisk.h > diff --git a/fs/xfs/libxfs/xfs_log_format.h > b/fs/xfs/libxfs/xfs_log_format.h > index f1ff52ebb982..5a4e700f32fc 100644 > --- a/fs/xfs/libxfs/xfs_log_format.h > +++ b/fs/xfs/libxfs/xfs_log_format.h > @@ -922,6 +922,13 @@ struct xfs_icreate_log { > XFS_ATTR_PARENT | \ > XFS_ATTR_INCOMPLETE) > > +struct xfs_attri_log_name { > + uint32_t alfi_op_flags; /* marks the op as a set or > remove */ > + uint32_t alfi_name_len; /* attr name length */ > + uint32_t alfi_value_len; /* attr value length */ > + uint32_t alfi_attr_filter;/* attr filter flags */ > +}; > + > /* > * This is the structure used to lay out an attr log item in the > * log. > @@ -929,14 +936,12 @@ struct xfs_icreate_log { > struct xfs_attri_log_format { > uint16_t alfi_type; /* attri log item type */ > uint16_t alfi_size; /* size of this item */ > - uint32_t __pad; /* pad to 64 bit aligned */ > + uint8_t alfi_extra_names;/* count of name/val pairs */ > + uint8_t __pad1; /* pad to 64 bit aligned */ > + uint16_t __pad2; /* pad to 64 bit aligned */ > uint64_t alfi_id; /* attri identifier */ > uint64_t alfi_ino; /* the inode for this attr > operation */ > - uint32_t alfi_op_flags; /* marks the op as a set or > remove */ > - uint32_t alfi_name_len; /* attr name length */ > - uint32_t alfi_value_len; /* attr value length */ > - uint32_t alfi_attr_filter;/* attr filter flags */ > + struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on What's the length of this VLA? 1 for a normal SET or REPLACE operation, and 2 for the "rename and replace value" operation? If so, why do we need two xfs_attri_log_name structures? The old value is unimportant, so we only need one alfi_value_len per operation. Each xfs_attri_log_format only describes one change, so it only needs one alfi_op_flags per op. For now I also don't think attributes should be able to jump namespaces, so we'd only need one alfi_attr_filter per op as well. *lightbulb comes on* Oops, I think I led you astray with my unfortunate comment. :( IOWs, the only change to struct xfs_attri_log_format is: - uint32_t __pad; /* pad to 64 bit aligned */ + uint32_t alfi_new_namelen;/* new attr name length */ and the rest of the changes in "[PATCH] xfs: Add new name to attri/d" are more or less fine as is. I'll go reply to that before I get back to Dave's log accounting stuff. --D > */ > }; > > struct xfs_attrd_log_format { > diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h > index 3e7f7eaa5b96..c040eeb88def 100644 > --- a/fs/xfs/xfs_ondisk.h > +++ b/fs/xfs/xfs_ondisk.h > @@ -132,7 +132,7 @@ xfs_check_ondisk_structs(void) > XFS_CHECK_STRUCT_SIZE(struct xfs_inode_log_format, 56); > XFS_CHECK_STRUCT_SIZE(struct xfs_qoff_logformat, 20); > XFS_CHECK_STRUCT_SIZE(struct xfs_trans_header, 16); > - XFS_CHECK_STRUCT_SIZE(struct xfs_attri_log_format, 48); > + XFS_CHECK_STRUCT_SIZE(struct xfs_attri_log_format, 24); > XFS_CHECK_STRUCT_SIZE(struct xfs_attrd_log_format, 16); > > /* parent pointer ioctls */ > root@garnet:/home/achender/work_area/xfs-linux# > > > > If the on disk size check thinks the format is 24 bytes, and then we > surprise pack an array of structs after it, isnt that going to run over > the next item? I think anything dynamic like this has to be an nvec. > Maybe we leave the existing alfi_* as they are so the size doesnt > change, and then if we have a value in alfi_extra_names, then we have > an extra nvec that has the array in it. I think that would work. > > FWIW, an alternate solution would be to use the pad for a second name > length, and then we get a patch that's very similar to the one I sent > out last Tues, but backward compatible. Though it does eat the > remaining pad and wouldn't be as flexible, I cant think of an attr op > that would need more than two names either? > > Let me know what people think. Thanks! > Allison > > > > > > Cheers, > > > > > > > > Dave. > > > > -- > > > > Dave Chinner > > > > david@fromorbit.com >
On Tue, 2022-08-23 at 08:07 -0700, Darrick J. Wong wrote: > On Thu, Aug 18, 2022 at 06:05:54PM -0700, Alli wrote: > > On Tue, 2022-08-16 at 13:41 -0700, Alli wrote: > > > On Mon, 2022-08-15 at 22:07 -0700, Darrick J. Wong wrote: > > > > On Tue, Aug 16, 2022 at 10:54:38AM +1000, Dave Chinner wrote: > > > > > On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote: > > > > > > On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote: > > > > > > > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote: > > > > > > > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote: > > > > > > > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. > > > > > > > > > Wong > > > > > > > > > wrote: > > > > > > > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison > > > > > > > > > > Henderson > > > > > > > > > > wrote: > > > > > > > > > > > Recent parent pointer testing has exposed a bug > > > > > > > > > > > in > > > > > > > > > > > the > > > > > > > > > > > underlying > > > > > > > > > > > attr replay. A multi transaction replay > > > > > > > > > > > currently > > > > > > > > > > > performs a > > > > > > > > > > > single step of the replay, then deferrs the rest > > > > > > > > > > > if > > > > > > > > > > > there is > > > > > > > > > > > more > > > > > > > > > > > to do. > > > > > > > > > > > > > > > > > > Yup. > > > > > > > > > > > > > > > > > > > > This causes race conditions with other attr > > > > > > > > > > > replays > > > > > > > > > > > that > > > > > > > > > > > might be recovered before the remaining deferred > > > > > > > > > > > work > > > > > > > > > > > has had > > > > > > > > > > > a > > > > > > > > > > > chance to finish. > > > > > > > > > > > > > > > > > > What other attr replays are we racing against? There > > > > > > > > > can > > > > > > > > > only be > > > > > > > > > one incomplete attr item intent/done chain per inode > > > > > > > > > present in > > > > > > > > > log > > > > > > > > > recovery, right? > > > > > > > > No, a rename queues up a set and remove before > > > > > > > > committing > > > > > > > > the > > > > > > > > transaction. One for the new parent pointer, and > > > > > > > > another > > > > > > > > to > > > > > > > > remove > > > > > > > > the > > > > > > > > old one. > > > > > > > > > > > > > > Ah. That really needs to be described in the commit > > > > > > > message - > > > > > > > changing from "single intent chain per object" to > > > > > > > "multiple > > > > > > > concurrent independent and unserialised intent chains per > > > > > > > object" is > > > > > > > a pretty important design rule change... > > > > > > > > > > > > > > The whole point of intents is to allow complex, multi- > > > > > > > stage > > > > > > > operations on a single object to be sequenced in a > > > > > > > tightly > > > > > > > controlled manner. They weren't intended to be run as > > > > > > > concurrent > > > > > > > lines of modification on single items; if you need to do > > > > > > > two > > > > > > > modifications on an object, the intent chain ties the two > > > > > > > modifications together into a single whole. > > > > > > > > > > > > > > One of the reasons I rewrote the attr state machine for > > > > > > > LARP > > > > > > > was to > > > > > > > enable new multiple attr operation chains to be easily > > > > > > > build > > > > > > > from > > > > > > > the entry points the state machien provides. Parent attr > > > > > > > rename > > > > > > > needs a new intent chain to be built, not run multiple > > > > > > > independent > > > > > > > intent chains for each modification. > > > > > > > > > > > > > > > It cant be an attr replace because technically the > > > > > > > > names > > > > > > > > are > > > > > > > > different. > > > > > > > > > > > > > > I disagree - we have all the pieces we need in the state > > > > > > > machine > > > > > > > already, we just need to define separate attr names for > > > > > > > the > > > > > > > remove and insert steps in the attr intent. > > > > > > > > > > > > > > That is, the "replace" operation we execute when an attr > > > > > > > set > > > > > > > overwrites the value is "technically" a "replace value" > > > > > > > operation, > > > > > > > but we actually implement it as a "replace entire > > > > > > > attribute" > > > > > > > operation. > > > > > > > > > > > > > > Without LARP, we do that overwrite in independent steps > > > > > > > via > > > > > > > an > > > > > > > intermediate INCOMPLETE state to allow two xattrs of the > > > > > > > same > > > > > > > name > > > > > > > to exist in the attr tree at the same time. IOWs, the > > > > > > > attr > > > > > > > value > > > > > > > overwrite is effectively a "set-swap-remove" operation on > > > > > > > two > > > > > > > entirely independent xattrs, ensuring that if we crash we > > > > > > > always > > > > > > > have either the old or new xattr visible. > > > > > > > > > > > > > > With LARP, we can remove the original attr first, thereby > > > > > > > avoiding > > > > > > > the need for two versions of the xattr to exist in the > > > > > > > tree > > > > > > > in > > > > > > > the > > > > > > > first place. However, we have to do these two operations > > > > > > > as a > > > > > > > pair > > > > > > > of linked independent operations. The intent chain > > > > > > > provides > > > > > > > the > > > > > > > linking, and requires us to log the name and the value of > > > > > > > the > > > > > > > attr > > > > > > > that we are overwriting in the intent. Hence we can > > > > > > > always > > > > > > > recover > > > > > > > the modification to completion no matter where in the > > > > > > > operation > > > > > > > we > > > > > > > fail. > > > > > > > > > > > > > > When it comes to a parent attr rename operation, we are > > > > > > > effectively > > > > > > > doing two linked operations - remove the old attr, set > > > > > > > the > > > > > > > new > > > > > > > attr > > > > > > > - on different attributes. Implementation wise, it is > > > > > > > exactly > > > > > > > the > > > > > > > same sequence as a "replace value" operation, except for > > > > > > > the > > > > > > > fact > > > > > > > that the new attr we add has a different name. > > > > > > > > > > > > > > Hence the only real difference between the existing "attr > > > > > > > replace" > > > > > > > and the intent chain we need for "parent attr rename" is > > > > > > > that > > > > > > > we > > > > > > > have to log two attr names instead of one. > > > > > > > > > > > > To be clear, this would imply expanding > > > > > > xfs_attri_log_format to > > > > > > have > > > > > > another alfi_new_name_len feild and another iovec for the > > > > > > attr > > > > > > intent > > > > > > right? Does that cause issues to change the on disk log > > > > > > layout > > > > > > after > > > > > > the original has merged? Or is that ok for things that are > > > > > > still > > > > > > experimental? Thanks! > > > > > > > > > > I think we can get away with this quite easily without > > > > > breaking > > > > > the > > > > > existing experimental code. > > > > > > > > > > struct xfs_attri_log_format { > > > > > uint16_t alfi_type; /* attri log item > > > > > type */ > > > > > uint16_t alfi_size; /* size of this item > > > > > */ > > > > > uint32_t __pad; /* pad to 64 bit > > > > > aligned > > > > > */ > > > > > uint64_t alfi_id; /* attri identifier > > > > > */ > > > > > uint64_t alfi_ino; /* the inode for this > > > > > attr > > > > > operation */ > > > > > uint32_t alfi_op_flags; /* marks the op as a > > > > > set > > > > > or > > > > > remove */ > > > > > uint32_t alfi_name_len; /* attr name length > > > > > */ > > > > > uint32_t alfi_value_len; /* attr value length > > > > > */ > > > > > uint32_t alfi_attr_filter;/* attr filter flags > > > > > */ > > > > > }; > > > > > > > > > > We have a padding field in there that is currently all zeros. > > > > > Let's > > > > > make that a count of the number of {name, value} tuples that > > > > > are > > > > > appended to the format. i.e. > > > > > > > > > > struct xfs_attri_log_name { > > > > > uint32_t alfi_op_flags; /* marks the op as a > > > > > set > > > > > or > > > > > remove */ > > > > > uint32_t alfi_name_len; /* attr name length > > > > > */ > > > > > uint32_t alfi_value_len; /* attr value length > > > > > */ > > > > > uint32_t alfi_attr_filter;/* attr filter flags > > > > > */ > > > > > }; > > > > > > > > > > struct xfs_attri_log_format { > > > > > uint16_t alfi_type; /* attri log item > > > > > type */ > > > > > uint16_t alfi_size; /* size of this item > > > > > */ > > > > > uint8_t alfi_attr_cnt; /* count of name/val > > > > > pairs > > > > > */ > > > > > uint8_t __pad1; /* pad to 64 > > > > > bit > > > > > aligned */ > > > > > uint16_t __pad2; /* pad to 64 bit > > > > > aligned */ > > > > > uint64_t alfi_id; /* attri identifier > > > > > */ > > > > > uint64_t alfi_ino; /* the inode for this > > > > > attr > > > > > operation */ > > > > > struct xfs_attri_log_name alfi_attr[]; /* attrs to > > > > > operate on > > > > > */ > > > > > }; > > > > > > > > > > Basically, the size and shape of the structure has not > > > > > changed, > > > > > and > > > > > if alfi_attr_cnt == 0 we just treat it as if alfi_attr_cnt == > > > > > 1 > > > > > as > > > > > the backwards compat code for the existing code. > > > > > > > > > > And then we just have as many followup regions for name/val > > > > > pairs > > > > > as are defined by the alfi_attr_cnt and alfi_attr[] parts of > > > > > the > > > > > structure. Each attr can have a different operation performed > > > > > on > > > > > them, and they can have different filters applied so they can > > > > > exist > > > > > in different namespaces, too. > > > > > > > > > > SO I don't think we need a new on-disk feature bit for this > > > > > enhancement - it definitely comes under the heading of "this > > > > > stuff > > > > > is experimental, this is the sort of early structure revision > > > > > that > > > > > EXPERIMENTAL is supposed to cover.... > > > > > > > > You might even callit "alfi_extra_names" to avoid the "0 means > > > > 1" > > > > stuff. > > > > ;) > > > > > > > > --D > > > > > > Oh, I just noticed these comments this morning when I sent out > > > the > > > new > > > attri/d patch. I'll add this changes to v2. Please let me know > > > if > > > there's anything else you'd like me to change from the v1. Thx! > > > > > > Allison > > > > Ok, so I am part way through coding this up, and I'm getting this > > feeling like this is not going to work out very well due to the > > size > > checks for the log formats: > > > > root@garnet:/home/achender/work_area/xfs-linux# git diff > > fs/xfs/libxfs/xfs_log_format.h fs/xfs/xfs_ondisk.h > > diff --git a/fs/xfs/libxfs/xfs_log_format.h > > b/fs/xfs/libxfs/xfs_log_format.h > > index f1ff52ebb982..5a4e700f32fc 100644 > > --- a/fs/xfs/libxfs/xfs_log_format.h > > +++ b/fs/xfs/libxfs/xfs_log_format.h > > @@ -922,6 +922,13 @@ struct xfs_icreate_log { > > XFS_ATTR_PARENT | \ > > XFS_ATTR_INCOMPLETE) > > > > +struct xfs_attri_log_name { > > + uint32_t alfi_op_flags; /* marks the op as a set or > > remove */ > > + uint32_t alfi_name_len; /* attr name length */ > > + uint32_t alfi_value_len; /* attr value length */ > > + uint32_t alfi_attr_filter;/* attr filter flags */ > > +}; > > + > > /* > > * This is the structure used to lay out an attr log item in the > > * log. > > @@ -929,14 +936,12 @@ struct xfs_icreate_log { > > struct xfs_attri_log_format { > > uint16_t alfi_type; /* attri log item type */ > > uint16_t alfi_size; /* size of this item */ > > - uint32_t __pad; /* pad to 64 bit aligned */ > > + uint8_t alfi_extra_names;/* count of name/val pairs > > */ > > + uint8_t __pad1; /* pad to 64 bit aligned */ > > + uint16_t __pad2; /* pad to 64 bit aligned */ > > uint64_t alfi_id; /* attri identifier */ > > uint64_t alfi_ino; /* the inode for this attr > > operation */ > > - uint32_t alfi_op_flags; /* marks the op as a set or > > remove */ > > - uint32_t alfi_name_len; /* attr name length */ > > - uint32_t alfi_value_len; /* attr value length */ > > - uint32_t alfi_attr_filter;/* attr filter flags */ > > + struct xfs_attri_log_name alfi_attr[]; /* attrs to operate > > on > > What's the length of this VLA? 1 for a normal SET or REPLACE > operation, and 2 for the "rename and replace value" operation? > > If so, why do we need two xfs_attri_log_name structures? The old > value > is unimportant, so we only need one alfi_value_len per > operation. Each > xfs_attri_log_format only describes one change, so it only needs one > alfi_op_flags per op. > > For now I also don't think attributes should be able to jump > namespaces, > so we'd only need one alfi_attr_filter per op as well. > > *lightbulb comes on* Oops, I think I led you astray with my > unfortunate > comment. :( > > IOWs, the only change to struct xfs_attri_log_format is: > > - uint32_t __pad; /* pad to 64 bit aligned */ > + uint32_t alfi_new_namelen;/* new attr name length */ > > and the rest of the changes in "[PATCH] xfs: Add new name to attri/d" > are more or less fine as is. > > I'll go reply to that before I get back to Dave's log accounting > stuff. > > --D Alrighty, I think thats the simplest solution for now. Will switch to that thread.... > > > */ > > }; > > > > struct xfs_attrd_log_format { > > diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h > > index 3e7f7eaa5b96..c040eeb88def 100644 > > --- a/fs/xfs/xfs_ondisk.h > > +++ b/fs/xfs/xfs_ondisk.h > > @@ -132,7 +132,7 @@ xfs_check_ondisk_structs(void) > > XFS_CHECK_STRUCT_SIZE(struct > > xfs_inode_log_format, 56); > > XFS_CHECK_STRUCT_SIZE(struct > > xfs_qoff_logformat, 20); > > XFS_CHECK_STRUCT_SIZE(struct > > xfs_trans_header, 16); > > - XFS_CHECK_STRUCT_SIZE(struct > > xfs_attri_log_format, 48); > > + XFS_CHECK_STRUCT_SIZE(struct > > xfs_attri_log_format, 24); > > XFS_CHECK_STRUCT_SIZE(struct > > xfs_attrd_log_format, 16); > > > > /* parent pointer ioctls */ > > root@garnet:/home/achender/work_area/xfs-linux# > > > > > > > > If the on disk size check thinks the format is 24 bytes, and then > > we > > surprise pack an array of structs after it, isnt that going to run > > over > > the next item? I think anything dynamic like this has to be an > > nvec. > > Maybe we leave the existing alfi_* as they are so the size doesnt > > change, and then if we have a value in alfi_extra_names, then we > > have > > an extra nvec that has the array in it. I think that would work. > > > > FWIW, an alternate solution would be to use the pad for a second > > name > > length, and then we get a patch that's very similar to the one I > > sent > > out last Tues, but backward compatible. Though it does eat the > > remaining pad and wouldn't be as flexible, I cant think of an attr > > op > > that would need more than two names either? > > > > Let me know what people think. Thanks! > > Allison > > > > > > > > > Cheers, > > > > > > > > > > Dave. > > > > > -- > > > > > Dave Chinner > > > > > david@fromorbit.com
diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c index 5077a7ad5646..c13d724a3e13 100644 --- a/fs/xfs/xfs_attr_item.c +++ b/fs/xfs/xfs_attr_item.c @@ -635,52 +635,33 @@ xfs_attri_item_recover( break; case XFS_ATTRI_OP_FLAGS_REMOVE: if (!xfs_inode_hasattr(args->dp)) - goto out; + return 0; attr->xattri_dela_state = xfs_attr_init_remove_state(args); break; default: ASSERT(0); - error = -EFSCORRUPTED; - goto out; + return -EFSCORRUPTED; } xfs_init_attr_trans(args, &tres, &total); error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, &tp); if (error) - goto out; + return error; args->trans = tp; done_item = xfs_trans_get_attrd(tp, attrip); + args->trans->t_flags |= XFS_TRANS_HAS_INTENT_DONE; + set_bit(XFS_LI_DIRTY, &done_item->attrd_item.li_flags); xfs_ilock(ip, XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, ip, 0); - error = xfs_xattri_finish_update(attr, done_item); - if (error == -EAGAIN) { - /* - * There's more work to do, so add the intent item to this - * transaction so that we can continue it later. - */ - xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list); - error = xfs_defer_ops_capture_and_commit(tp, capture_list); - if (error) - goto out_unlock; - - xfs_iunlock(ip, XFS_ILOCK_EXCL); - xfs_irele(ip); - return 0; - } - if (error) { - xfs_trans_cancel(tp); - goto out_unlock; - } - + xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list); error = xfs_defer_ops_capture_and_commit(tp, capture_list); -out_unlock: + xfs_iunlock(ip, XFS_ILOCK_EXCL); xfs_irele(ip); -out: - xfs_attr_free_item(attr); + return error; }
Recent parent pointer testing has exposed a bug in the underlying attr replay. A multi transaction replay currently performs a single step of the replay, then deferrs the rest if there is more to do. This causes race conditions with other attr replays that might be recovered before the remaining deferred work has had a chance to finish. This can lead to interleaved set and remove operations that may clobber the attribute fork. Fix this by deferring all work for any attribute operation. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> --- fs/xfs/xfs_attr_item.c | 35 ++++++++--------------------------- 1 file changed, 8 insertions(+), 27 deletions(-)