diff mbox

[04/71] xfs: defer should allow ->finish_item to request a new transaction

Message ID 147216794133.867.4063030531885190227.stgit@birch.djwong.org (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Darrick J. Wong Aug. 25, 2016, 11:32 p.m. UTC
When xfs_defer_finish calls ->finish_item, it's possible that
(refcount) won't be able to finish all the work in a single
transaction.  When this happens, the ->finish_item handler should
shorten the log done item's list count, update the work item to
reflect where work should continue, and return -EAGAIN so that
defer_finish knows to retain the pending item on the pending list,
roll the transaction, and restart processing where we left off.

Plumb in the code and document how this mechanism is supposed to work.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.c |   78 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 71 insertions(+), 7 deletions(-)

Comments

Christoph Hellwig Sept. 6, 2016, 6:38 a.m. UTC | #1
On Thu, Aug 25, 2016 at 04:32:21PM -0700, Darrick J. Wong wrote:
> When xfs_defer_finish calls ->finish_item, it's possible that
> (refcount) won't be able to finish all the work in a single
> transaction.  When this happens, the ->finish_item handler should
> shorten the log done item's list count, update the work item to
> reflect where work should continue, and return -EAGAIN so that
> defer_finish knows to retain the pending item on the pending list,
> roll the transaction, and restart processing where we left off.
> 
> Plumb in the code and document how this mechanism is supposed to work.

I've voiced this before, and I will again:  I think the whole xfs_defer
mechanism is a major, major mistake.  All three users look somewhat
similar, but actually are different underneath.  Especially the extent
free case is a lot simpler than all others, and we now place the burden
of a complex abstraction onto code that would otherwise be fairly
easily understandable.

Also it adds a memory allocation to the extent free code, and gets in
the way of merging the freed extent and extent busy structures,
something I prototyped a while ago.
Darrick J. Wong Sept. 6, 2016, 11:57 p.m. UTC | #2
On Mon, Sep 05, 2016 at 11:38:15PM -0700, Christoph Hellwig wrote:
> On Thu, Aug 25, 2016 at 04:32:21PM -0700, Darrick J. Wong wrote:
> > When xfs_defer_finish calls ->finish_item, it's possible that
> > (refcount) won't be able to finish all the work in a single
> > transaction.  When this happens, the ->finish_item handler should
> > shorten the log done item's list count, update the work item to
> > reflect where work should continue, and return -EAGAIN so that
> > defer_finish knows to retain the pending item on the pending list,
> > roll the transaction, and restart processing where we left off.
> > 
> > Plumb in the code and document how this mechanism is supposed to work.
> 
> I've voiced this before, and I will again:  I think the whole xfs_defer
> mechanism is a major, major mistake.  All three users look somewhat
> similar, but actually are different underneath

I agree with you that the four users are different underneath, and that
shoehorning all four through the same xfs_defer has its risks.  I'm not
sure specifically what's making you nervous, so I'll write a little
about how I arrived at the xfs_defer_ops design, and hopefully that'll
be enough of a jumping off point to discuss from there?

Reworking the existing xfs_bmap_free_t code was unsettling in the usual
sense of "am I really grokking this thing?"  The refcount item code
requiring the ability to roll a transaction midway through finishing the
item is not something the other three items care about.  There's an
additional question about the interaction between extent free and busy,
which I'll get to further down.

That said, prior to writing the xfs_defer code (that I submitted for
4.8), I *did* iterate through various designs, all of which eventually
failed.  The problem I'm dealing with here is this -- in the new world
of rmap and refcount, a data block mapping transaction can cause an
intense storm of metadata updates.  In the old days we only had to deal
with (in the free case):

1) update extent map,
2) alloc/free bmbt block,
3) free data extent

Now that process expands (explodes?) to:

1) update extent map,
2) bmbt block alloc/free,
3) bmbt rmap update,
4) rmapbt shape changes,
5) freeing(?) rmapbt blocks,
6) data block rmapbt update,
7) rmapbt shape change for bmbt update,
8) freeing(?) rmapbt blocks,
9) refcountbt update, 
A) rmapbt shape change for refcountbt update,
B) freeing(?) rmapbt blocks,
C) data block freeing,

In the bad first implementation I simply tried to cram all of these
updates into a single big transaction, which lead to reservation
blowouts and deadlocks when we have to perform multiple allocations for
data blocks and bmbt blocks.

The next design left the old xfs_bmap_free code mostly intact and
provided separate handlers for the rmap/refcount/bmap updates as you
suggested during the rmap review, with just enough of a defer_ops to
point to the four types of deferred ops that could go with each
transaction.  Each metadata update type was logged in order of type;
this proved insufficient because the various metadata updates are
interdependent -- for example, if adding an rmapbt record causes an
rmapbt block to be freed, we have to ensure that the rmap update and RUD
commit before we start working on the extent freeing.  Were this order
not maintained, we could crash after the EFD gets logged but before the
rmap update commits, and replay would retry the rmap update and free the
block again.  So, separately maintained lists of deferred operations
didn't work.

This resulted in the third (and current) deferred ops design.  Central
coordination between deferred ops is stronger than it was before.
Deferred operations are tracked in the order that they're added to the
defer_ops, and the intent items are logged in that order so that
recovery happens in (more or less) the same way that they would during a
successful transaction commit.  That required defer_ops to shoulder a
larger burden of state tracking, at which point it made more sense to
make extent freeing another defer_ops client.

However, there's also the question of how does this new defer_ops thing
interact with the extent busy code?  AFAIK, the extent busy code
prevents XFS from reusing a freed block for data until the transaction
that frees it has committed so that log recover cannot replay updates to
a block that has already been rewritten with file data (i.e.
xlog_cil_committed() has run to completion).  The deferred ops system
will roll the transaction repeatedly until all the ops are finished, but
there's nothing about it that changes that guarantee.  Metadata blocks
being freed are still put in the busy-extent rbtree where they remain
until the CIL commits and clears them (or they get reused for other
metadata) which prevents premature reuse.  In other words, extent
freeing should behave the same as before, even in the presence of the
new pieces.

> Especially the extent free case is a lot simpler than all others, and
> we now place the burden of a complex abstraction onto code that would
> otherwise be fairly easily understandable.

Right, it would be nice to be able to keep using the old bmap_free code
when we don't need the complex operations.  I guess you could make the
public xfs_defer.c functions short-circuit to the bmap_free code if
!rmap && !reflink.

> Also it adds a memory allocation to the extent free code,

The second memory allocation could be eliminated by kmem_alloc'ing more
bytes in xfs_defer_add and changing xfs_defer_add to take a pointer to a
work item buffer that would be copied into the extra space.  I'm not
sure how much we'd gain since there are already plenty of allocations
in that path anyway.

> and gets in the way of merging the freed extent and extent busy
> structures, something I prototyped a while ago.

I searched my email spool back to 9/2012 but didn't find any patches,
so I don't really know how to respond to this.

--D
diff mbox

Patch

diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 054a203..faba739 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -81,6 +81,10 @@ 
  *   - For each work item attached to the log intent item,
  *     * Perform the described action.
  *     * Attach the work item to the log done item.
+ *     * If the result of doing the work was -EAGAIN, ->finish work
+ *       wants a new transaction.  See the "Requesting a Fresh
+ *       Transaction while Finishing Deferred Work" section below for
+ *       details.
  *
  * The key here is that we must log an intent item for all pending
  * work items every time we roll the transaction, and that we must log
@@ -88,6 +92,34 @@ 
  * we can perform complex remapping operations, chaining intent items
  * as needed.
  *
+ * Requesting a Fresh Transaction while Finishing Deferred Work
+ *
+ * If ->finish_item decides that it needs a fresh transaction to
+ * finish the work, it must ask its caller (xfs_defer_finish) for a
+ * continuation.  The most likely cause of this circumstance are the
+ * refcount adjust functions deciding that they've logged enough items
+ * to be at risk of exceeding the transaction reservation.
+ *
+ * To get a fresh transaction, we want to log the existing log done
+ * item to prevent the log intent item from replaying, immediately log
+ * a new log intent item with the unfinished work items, roll the
+ * transaction, and re-call ->finish_item wherever it left off.  The
+ * log done item and the new log intent item must be in the same
+ * transaction or atomicity cannot be guaranteed; defer_finish ensures
+ * that this happens.
+ *
+ * This requires some coordination between ->finish_item and
+ * defer_finish.  Upon deciding to request a new transaction,
+ * ->finish_item should update the current work item to reflect the
+ * unfinished work.  Next, it should reset the log done item's list
+ * count to the number of items finished, and return -EAGAIN.
+ * defer_finish sees the -EAGAIN, logs the new log intent item
+ * with the remaining work items, and leaves the xfs_defer_pending
+ * item at the head of the dop_work queue.  Then it rolls the
+ * transaction and picks up processing where it left off.  It is
+ * required that ->finish_item must be careful to leave enough
+ * transaction reservation to fit the new log intent item.
+ *
  * This is an example of remapping the extent (E, E+B) into file X at
  * offset A and dealing with the extent (C, C+B) already being mapped
  * there:
@@ -104,21 +136,26 @@ 
  * | Intent to add rmap (X, E, A, B)                 |
  * +-------------------------------------------------+
  * | Reduce refcount for extent (C, B)               | t2
- * | Done reducing refcount for extent (C, B)        |
+ * | Done reducing refcount for extent (C, 9)        |
+ * | Intent to reduce refcount for extent (C+9, B-9) |
+ * | (ran out of space after 9 refcount updates)     |
+ * +-------------------------------------------------+
+ * | Reduce refcount for extent (C+9, B+9)           | t3
+ * | Done reducing refcount for extent (C+9, B-9)    |
  * | Increase refcount for extent (E, B)             |
  * | Done increasing refcount for extent (E, B)      |
  * | Intent to free extent (C, B)                    |
  * | Intent to free extent (F, 1) (refcountbt block) |
  * | Intent to remove rmap (F, 1, REFC)              |
  * +-------------------------------------------------+
- * | Remove rmap (X, C, A, B)                        | t3
+ * | Remove rmap (X, C, A, B)                        | t4
  * | Done removing rmap (X, C, A, B)                 |
  * | Add rmap (X, E, A, B)                           |
  * | Done adding rmap (X, E, A, B)                   |
  * | Remove rmap (F, 1, REFC)                        |
  * | Done removing rmap (F, 1, REFC)                 |
  * +-------------------------------------------------+
- * | Free extent (C, B)                              | t4
+ * | Free extent (C, B)                              | t5
  * | Done freeing extent (C, B)                      |
  * | Free extent (D, 1)                              |
  * | Done freeing extent (D, 1)                      |
@@ -141,6 +178,9 @@ 
  * - Intent to free extent (C, B)
  * - Intent to free extent (F, 1) (refcountbt block)
  * - Intent to remove rmap (F, 1, REFC)
+ *
+ * Note that the continuation requested between t2 and t3 is likely to
+ * reoccur.
  */
 
 static const struct xfs_defer_op_type *defer_op_types[XFS_DEFER_OPS_TYPE_MAX];
@@ -332,7 +372,16 @@  xfs_defer_finish(
 			dfp->dfp_count--;
 			error = dfp->dfp_type->finish_item(*tp, dop, li,
 					done_item, &state);
-			if (error) {
+			if (error == -EAGAIN) {
+				/*
+				 * Caller wants a fresh transaction;
+				 * put the work item back on the list
+				 * and jump out.
+				 */
+				list_add(li, &dfp->dfp_work);
+				dfp->dfp_count++;
+				break;
+			} else if (error) {
 				/*
 				 * Clean up after ourselves and jump out.
 				 * xfs_defer_cancel will take care of freeing
@@ -344,9 +393,24 @@  xfs_defer_finish(
 				goto out;
 			}
 		}
-		/* Done with the dfp, free it. */
-		list_del(&dfp->dfp_list);
-		kmem_free(dfp);
+		if (error == -EAGAIN) {
+			/*
+			 * Caller wants a fresh transaction, so log a
+			 * new log intent item to replace the old one
+			 * and roll the transaction.  See "Requesting
+			 * a Fresh Transaction while Finishing
+			 * Deferred Work" above.
+			 */
+			dfp->dfp_intent = dfp->dfp_type->create_intent(*tp,
+					dfp->dfp_count);
+			list_for_each(li, &dfp->dfp_work)
+				dfp->dfp_type->log_item(*tp, dfp->dfp_intent,
+						li);
+		} else {
+			/* Done with the dfp, free it. */
+			list_del(&dfp->dfp_list);
+			kmem_free(dfp);
+		}
 
 		if (cleanup_fn)
 			cleanup_fn(*tp, state, error);