[RFC] xfs: automatic log item relogging experiment
diff mbox series

Message ID 20191024172850.7698-1-bfoster@redhat.com
State New
Headers show
Series
  • [RFC] xfs: automatic log item relogging experiment
Related show

Commit Message

Brian Foster Oct. 24, 2019, 5:28 p.m. UTC
An experimental mechanism to automatically relog specified log
items.  This is useful for long running operations, like quotaoff,
that otherwise risk deadlocking on log space usage.

Not-signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi all,

This is an experiment that came out of a discussion with Darrick[1] on
some design bits of the latest online btree repair work. Specifically,
it logs free intents in the same transaction as block allocation to
guard against inconsistency in the event of a crash during the repair
sequence. These intents happen pin the log tail for an indeterminate
amount of time. Darrick was looking for some form of auto relog
mechanism to facilitate this general approach. It occurred to us that
this is the same problem we've had with quotaoff for some time, so I
figured it might be worth prototyping something against that to try and
prove the concept.

Note that this is RFC because the code and interfaces are a complete
mess and this is broken in certain ways. This occasionally triggers log
reservation overrun shutdowns because transaction reservation checking
has not yet been added, the cancellation path is overkill, etc. IOW, the
purpose of this patch is purely to test a concept.

The concept is essentially to flag a log item for relogging on first
transaction commit such that once it commits to the AIL, the next
transaction that happens to commit with sufficient unused reservation
opportunistically relogs the item to the current CIL context. For the
log intent case, the transaction that commits the done item is required
to cancel the relog state of the original intent to prevent further
relogging.

In practice, this allows a log item to automatically roll through CIL
checkpoints and not pin the tail of the log while something like a
quotaoff is running for a potentially long period of time. This is
applied to quotaoff and focused testing shows that it avoids the
associated deadlock.

Thoughts, reviews, flames appreciated.

[1] https://lore.kernel.org/linux-xfs/20191018143856.GA25763@bfoster/

 fs/xfs/xfs_log_cil.c     | 69 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_log_priv.h    |  6 ++++
 fs/xfs/xfs_qm_syscalls.c | 13 ++++++++
 fs/xfs/xfs_trace.h       |  2 ++
 fs/xfs/xfs_trans.c       |  4 +++
 fs/xfs/xfs_trans.h       |  4 ++-
 6 files changed, 97 insertions(+), 1 deletion(-)

Comments

Dave Chinner Oct. 24, 2019, 10:43 p.m. UTC | #1
On Thu, Oct 24, 2019 at 01:28:50PM -0400, Brian Foster wrote:
> An experimental mechanism to automatically relog specified log
> items.  This is useful for long running operations, like quotaoff,
> that otherwise risk deadlocking on log space usage.
> 
> Not-signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> Hi all,
> 
> This is an experiment that came out of a discussion with Darrick[1] on
> some design bits of the latest online btree repair work. Specifically,
> it logs free intents in the same transaction as block allocation to
> guard against inconsistency in the event of a crash during the repair
> sequence. These intents happen pin the log tail for an indeterminate
> amount of time. Darrick was looking for some form of auto relog
> mechanism to facilitate this general approach. It occurred to us that
> this is the same problem we've had with quotaoff for some time, so I
> figured it might be worth prototyping something against that to try and
> prove the concept.

Interesting idea. :)

> 
> Note that this is RFC because the code and interfaces are a complete
> mess and this is broken in certain ways. This occasionally triggers log
> reservation overrun shutdowns because transaction reservation checking
> has not yet been added, the cancellation path is overkill, etc. IOW, the
> purpose of this patch is purely to test a concept.

*nod*

> The concept is essentially to flag a log item for relogging on first
> transaction commit such that once it commits to the AIL, the next
> transaction that happens to commit with sufficient unused reservation
> opportunistically relogs the item to the current CIL context. For the
> log intent case, the transaction that commits the done item is required
> to cancel the relog state of the original intent to prevent further
> relogging.

Makes sense, but it seems like we removed the hook that would be
used by transactions to implement their own relogging on CIL commit
some time ago because nothign had used it for 15+ years....

> In practice, this allows a log item to automatically roll through CIL
> checkpoints and not pin the tail of the log while something like a
> quotaoff is running for a potentially long period of time. This is
> applied to quotaoff and focused testing shows that it avoids the
> associated deadlock.

Hmmm. How do we deal with multiple identical intents being found in
checkpoints with different LSNs in log recovery?

> Thoughts, reviews, flames appreciated.
> 
> [1] https://lore.kernel.org/linux-xfs/20191018143856.GA25763@bfoster/
> 
>  fs/xfs/xfs_log_cil.c     | 69 ++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_log_priv.h    |  6 ++++
>  fs/xfs/xfs_qm_syscalls.c | 13 ++++++++
>  fs/xfs/xfs_trace.h       |  2 ++
>  fs/xfs/xfs_trans.c       |  4 +++
>  fs/xfs/xfs_trans.h       |  4 ++-
>  6 files changed, 97 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index a1204424a938..b2d8b2d54df6 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -75,6 +75,33 @@ xlog_cil_iovec_space(
>  			sizeof(uint64_t));
>  }
>  
> +static void
> +xlog_cil_relog_items(
> +	struct xlog		*log,
> +	struct xfs_trans	*tp)
> +{
> +	struct xfs_cil		*cil = log->l_cilp;
> +	struct xfs_log_item	*lip;
> +
> +	ASSERT(tp->t_flags & XFS_TRANS_DIRTY);
> +
> +	if (list_empty(&cil->xc_relog))
> +		return;
> +
> +	/* XXX: need to check trans reservation, process multiple items, etc. */
> +	spin_lock(&cil->xc_relog_lock);
> +	lip = list_first_entry_or_null(&cil->xc_relog, struct xfs_log_item, li_cil);
> +	if (lip)
> +		list_del_init(&lip->li_cil);
> +	spin_unlock(&cil->xc_relog_lock);
> +
> +	if (lip) {
> +		xfs_trans_add_item(tp, lip);
> +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> +		trace_xfs_cil_relog(lip);
> +	}

I don't think this is safe - the log item needs to be locked to be
joined to a transaction. Hence this can race with whatever is
committing the done intent on the object and removing it from the
relog list and so the item could be clean (and potentially even
freed) by the time we go to add it to this transaction and mark it
dirty again...

> +}
> +
>  /*
>   * Allocate or pin log vector buffers for CIL insertion.
>   *
> @@ -1001,6 +1028,8 @@ xfs_log_commit_cil(
>  	struct xfs_log_item	*lip, *next;
>  	xfs_lsn_t		xc_commit_lsn;
>  
> +	xlog_cil_relog_items(log, tp);

Hmmm. This means that when there are relog items on the list, all
the transactional concurrency is going to be put onto the
xc_relog_lock spin lock. THis is potentially a major lock contention
point, especially if there are lots of items that need relogging.

> +
>  	/*
>  	 * Do all necessary memory allocation before we lock the CIL.
>  	 * This ensures the allocation does not deadlock with a CIL
> @@ -1196,6 +1225,8 @@ xlog_cil_init(
>  	spin_lock_init(&cil->xc_push_lock);
>  	init_rwsem(&cil->xc_ctx_lock);
>  	init_waitqueue_head(&cil->xc_commit_wait);
> +	INIT_LIST_HEAD(&cil->xc_relog);
> +	spin_lock_init(&cil->xc_relog_lock);
>  
>  	INIT_LIST_HEAD(&ctx->committing);
>  	INIT_LIST_HEAD(&ctx->busy_extents);
> @@ -1223,3 +1254,41 @@ xlog_cil_destroy(
>  	kmem_free(log->l_cilp);
>  }
>  
> +void
> +xfs_cil_relog_item(
> +	struct xlog		*log,
> +	struct xfs_log_item	*lip)
> +{
> +	struct xfs_cil		*cil = log->l_cilp;
> +
> +	ASSERT(test_bit(XFS_LI_RELOG, &lip->li_flags));
> +	ASSERT(list_empty(&lip->li_cil));

So this can't be used for things like inodes and buffers?

> +	spin_lock(&cil->xc_relog_lock);
> +	list_add_tail(&lip->li_cil, &cil->xc_relog);
> +	spin_unlock(&cil->xc_relog_lock);
> +
> +	trace_xfs_cil_relog_queue(lip);
> +}
> +
> +bool
> +xfs_cil_relog_steal(
> +	struct xlog		*log,
> +	struct xfs_log_item	*lip)
> +{
> +	struct xfs_cil		*cil = log->l_cilp;
> +	struct xfs_log_item	*pos, *ppos;
> +	bool			ret = false;
> +
> +	spin_lock(&cil->xc_relog_lock);
> +	list_for_each_entry_safe(pos, ppos, &cil->xc_relog, li_cil) {
> +		if (pos == lip) {
> +			list_del_init(&pos->li_cil);
> +			ret = true;
> +			break;
> +		}
> +	}
> +	spin_unlock(&cil->xc_relog_lock);

This is a remove operation, not a "steal" operation. But why are
we walking the relog list to find it? It would be much better to use
a flag to indicate what list the item is on, and then this just
becomes

	spin_lock(&cil->xc_relog_lock);
	if (test_and_clear_bit(XFS_LI_RELOG_LIST, &lip->li_flags))
		list_del_init(&pos->li_cil);
	spin_unlock(&cil->xc_relog_lock);



> @@ -556,6 +558,16 @@ xfs_qm_log_quotaoff_end(
>  					flags & XFS_ALL_QUOTA_ACCT);
>  	xfs_trans_log_quotaoff_item(tp, qoffi);
>  
> +	/*
> +	 * XXX: partly open coded relog of the start item to ensure no relogging
> +	 * after this point.
> +	 */
> +	clear_bit(XFS_LI_RELOG, &startqoff->qql_item.li_flags);
> +	if (xfs_cil_relog_steal(mp->m_log, &startqoff->qql_item)) {
> +		xfs_trans_add_item(tp, &startqoff->qql_item);
> +		xfs_trans_log_quotaoff_item(tp, startqoff);
> +	}

Urk. :)

> @@ -863,6 +864,9 @@ xfs_trans_committed_bulk(
>  		if (XFS_LSN_CMP(item_lsn, (xfs_lsn_t)-1) == 0)
>  			continue;
>  
> +		if (test_bit(XFS_LI_RELOG, &lip->li_flags))
> +			xfs_cil_relog_item(lip->li_mountp->m_log, lip);

Ok, so this is moving the item from commit back onto the relog list.
This is going to hammer the relog lock on workloads where there is a
lot of transaction concurrency and a substantial number of items on
the relog list....

----

Ok, so I mentioned that we removed the hooks that could have been
used for this some time ago.

What we actually want here is a notification that an object needs
relogging. I can see how appealing the concept of automatically
relogging is, but I'm unconvinced that we can make it work,
especially when there aren't sufficient reservations to relog
the items that need relogging.

commit d420e5c810bce5debce0238021b410d0ef99cf08
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Oct 15 09:17:53 2013 +1100

    xfs: remove unused transaction callback variables
    
    We don't do callbacks at transaction commit time, no do we have any
    infrastructure to set up or run such callbacks, so remove the
    variables and typedefs for these operations. If we ever need to add
    callbacks, we can reintroduce the variables at that time.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Ben Myers <bpm@sgi.com>
    Signed-off-by: Ben Myers <bpm@sgi.com>

diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 09cf40b89e8c..71c835e9e810 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -85,18 +85,11 @@ struct xfs_item_ops {
 #define XFS_ITEM_LOCKED                2
 #define XFS_ITEM_FLUSHING      3
 
-/*
- * This is the type of function which can be given to xfs_trans_callback()
- * to be called upon the transaction's commit to disk.
- */
-typedef void (*xfs_trans_callback_t)(struct xfs_trans *, void *);
-
 /*
  * This is the structure maintained for every active transaction.
  */
 typedef struct xfs_trans {
        unsigned int            t_magic;        /* magic number */
-       xfs_log_callback_t      t_logcb;        /* log callback struct */
        unsigned int            t_type;         /* transaction type */
        unsigned int            t_log_res;      /* amt of log space resvd */
        unsigned int            t_log_count;    /* count for perm log res */

That's basically the functionality we want here - when the log item
hits the journal, we want a callback to tell us so we can relog it
ourselves if deemed necessary. i.e. it's time to reintroduce the
transaction/log commit callback infrastructure...

This would get used in conjunction with a permanent transaction
reservation, allowing the owner of the object to keep it locked over
transaction commit (while whatever work is running between intent
and done), and the transaction commit turns into a trans_roll. Then
we reserve space for the next relogging commit, and go about our
business.

On notification of the intent being logged, the relogging work
(which already has a transaction and a reservation) can be dumped to
a workqueue (to get it out of iclog completion context) and the item
relogged and the transaction rolled and re-reserved again.

This would work with any type of log item, not just intents, and it
doesn't have any reservation "stealing" requirements. ANd because
we've pre-reserved the log space for the relogging transaction, the
relogging callback will never get hung up on it's own items
preventing it from getting more log space.

i.e. we essentially make use of the existing relogging mechanism,
just with callbacks to trigger periodic relogging of the items for
long running transactions...

Thoughts?

Cheers,

Dave.
Brian Foster Oct. 25, 2019, 12:41 p.m. UTC | #2
On Fri, Oct 25, 2019 at 09:43:08AM +1100, Dave Chinner wrote:
> On Thu, Oct 24, 2019 at 01:28:50PM -0400, Brian Foster wrote:
> > An experimental mechanism to automatically relog specified log
> > items.  This is useful for long running operations, like quotaoff,
> > that otherwise risk deadlocking on log space usage.
> > 
> > Not-signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > Hi all,
> > 
> > This is an experiment that came out of a discussion with Darrick[1] on
> > some design bits of the latest online btree repair work. Specifically,
> > it logs free intents in the same transaction as block allocation to
> > guard against inconsistency in the event of a crash during the repair
> > sequence. These intents happen pin the log tail for an indeterminate
> > amount of time. Darrick was looking for some form of auto relog
> > mechanism to facilitate this general approach. It occurred to us that
> > this is the same problem we've had with quotaoff for some time, so I
> > figured it might be worth prototyping something against that to try and
> > prove the concept.
> 
> Interesting idea. :)
> 
> > 
> > Note that this is RFC because the code and interfaces are a complete
> > mess and this is broken in certain ways. This occasionally triggers log
> > reservation overrun shutdowns because transaction reservation checking
> > has not yet been added, the cancellation path is overkill, etc. IOW, the
> > purpose of this patch is purely to test a concept.
> 
> *nod*
> 
> > The concept is essentially to flag a log item for relogging on first
> > transaction commit such that once it commits to the AIL, the next
> > transaction that happens to commit with sufficient unused reservation
> > opportunistically relogs the item to the current CIL context. For the
> > log intent case, the transaction that commits the done item is required
> > to cancel the relog state of the original intent to prevent further
> > relogging.
> 
> Makes sense, but it seems like we removed the hook that would be
> used by transactions to implement their own relogging on CIL commit
> some time ago because nothign had used it for 15+ years....
> 

Interesting, I'm not familiar with this...

> > In practice, this allows a log item to automatically roll through CIL
> > checkpoints and not pin the tail of the log while something like a
> > quotaoff is running for a potentially long period of time. This is
> > applied to quotaoff and focused testing shows that it avoids the
> > associated deadlock.
> 
> Hmmm. How do we deal with multiple identical intents being found in
> checkpoints with different LSNs in log recovery?
> 

Good question. I was kind of thinking they would be handled like
normally relogged items, but I hadn't got to recovery testing yet. For
quotaoff, no special handling is required because we simply turn off
quota flags (i.e. no filesystem changes are required) when the intent is
seen and don't do anything else with the log item. FWIW, we don't even
seem to check for an associated quotaoff_end item.

For something more involved like an EFI, it looks like we'd create a
duplicate log item for the intent and I suspect that would lead to a
double processing attempt. So this would require some further changes to
handle generic intent relogging properly. I don't think it's that
difficult of a problem; we do already allow for relogs of other things
obviously, we just don't currently do any tracking of already seen
intents. That said, this could probably be considered an ABI change if
older kernels don't know how to process the condition correctly, so we'd
have to guard against that with a feature bit or some such when a
filesystem isn't unmounted cleanly.

More broadly, this implies that we don't ever currently relog intents so
this is something that will come along with any implementation that
intends to do so unless we adopt an alternative along the lines of what
dfops processing does. Dfops completes the original intent and creates a
new one on each internal transaction roll. My early thought was that
would be overkill for relogs that are more intended to facilitate
unrelated progress (i.e. not pin the log tail) as opposed to track
progress in a particular operation (i.e. dfops freed 1 extent out of a 2
extent EFI, roll, complete the original EFI and create a new EFI with 1
extent left). The item never really changes in the first case where it
does in the second.

> > Thoughts, reviews, flames appreciated.
> > 
> > [1] https://lore.kernel.org/linux-xfs/20191018143856.GA25763@bfoster/
> > 
> >  fs/xfs/xfs_log_cil.c     | 69 ++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_log_priv.h    |  6 ++++
> >  fs/xfs/xfs_qm_syscalls.c | 13 ++++++++
> >  fs/xfs/xfs_trace.h       |  2 ++
> >  fs/xfs/xfs_trans.c       |  4 +++
> >  fs/xfs/xfs_trans.h       |  4 ++-
> >  6 files changed, 97 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index a1204424a938..b2d8b2d54df6 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -75,6 +75,33 @@ xlog_cil_iovec_space(
> >  			sizeof(uint64_t));
> >  }
> >  
> > +static void
> > +xlog_cil_relog_items(
> > +	struct xlog		*log,
> > +	struct xfs_trans	*tp)
> > +{
> > +	struct xfs_cil		*cil = log->l_cilp;
> > +	struct xfs_log_item	*lip;
> > +
> > +	ASSERT(tp->t_flags & XFS_TRANS_DIRTY);
> > +
> > +	if (list_empty(&cil->xc_relog))
> > +		return;
> > +
> > +	/* XXX: need to check trans reservation, process multiple items, etc. */
> > +	spin_lock(&cil->xc_relog_lock);
> > +	lip = list_first_entry_or_null(&cil->xc_relog, struct xfs_log_item, li_cil);
> > +	if (lip)
> > +		list_del_init(&lip->li_cil);
> > +	spin_unlock(&cil->xc_relog_lock);
> > +
> > +	if (lip) {
> > +		xfs_trans_add_item(tp, lip);
> > +		set_bit(XFS_LI_DIRTY, &lip->li_flags);
> > +		trace_xfs_cil_relog(lip);
> > +	}
> 
> I don't think this is safe - the log item needs to be locked to be
> joined to a transaction. Hence this can race with whatever is
> committing the done intent on the object and removing it from the
> relog list and so the item could be clean (and potentially even
> freed) by the time we go to add it to this transaction and mark it
> dirty again...
> 

This was targeted to intents for the time being, which AFAIA are
unshared structures in that there is no associated metadata object lock.
It's possible I've missed something by narrowly focusing on the quotaoff
case, however. I probably should have been more clear on that in the
commit log. For such intents, the item can't be dropped from the AIL (or
freed) until the done item commits, otherwise we'd have fs consistency
issues.

I was planning to eventually look into more generic log item relogging
if this basic experiment panned out. That would certainly be more
complicated in terms of having to transfer and manage lock/reference
state of the associated metadata objects, etc., as you point out.

> > +}
> > +
> >  /*
> >   * Allocate or pin log vector buffers for CIL insertion.
> >   *
> > @@ -1001,6 +1028,8 @@ xfs_log_commit_cil(
> >  	struct xfs_log_item	*lip, *next;
> >  	xfs_lsn_t		xc_commit_lsn;
> >  
> > +	xlog_cil_relog_items(log, tp);
> 
> Hmmm. This means that when there are relog items on the list, all
> the transactional concurrency is going to be put onto the
> xc_relog_lock spin lock. THis is potentially a major lock contention
> point, especially if there are lots of items that need relogging.
> 

Potentially, yes. The use cases I was considering were basically limited
to quotaoff and scrub, so I'm not sure performance is a huge concern for
now. The approach is opportunistic in general, so that also means this
could be implemented to avoid this contention without reducing the
effectiveness (once some actual reservation checking and whatnot is in
place).

> > +
> >  	/*
> >  	 * Do all necessary memory allocation before we lock the CIL.
> >  	 * This ensures the allocation does not deadlock with a CIL
> > @@ -1196,6 +1225,8 @@ xlog_cil_init(
> >  	spin_lock_init(&cil->xc_push_lock);
> >  	init_rwsem(&cil->xc_ctx_lock);
> >  	init_waitqueue_head(&cil->xc_commit_wait);
> > +	INIT_LIST_HEAD(&cil->xc_relog);
> > +	spin_lock_init(&cil->xc_relog_lock);
> >  
> >  	INIT_LIST_HEAD(&ctx->committing);
> >  	INIT_LIST_HEAD(&ctx->busy_extents);
> > @@ -1223,3 +1254,41 @@ xlog_cil_destroy(
> >  	kmem_free(log->l_cilp);
> >  }
> >  
> > +void
> > +xfs_cil_relog_item(
> > +	struct xlog		*log,
> > +	struct xfs_log_item	*lip)
> > +{
> > +	struct xfs_cil		*cil = log->l_cilp;
> > +
> > +	ASSERT(test_bit(XFS_LI_RELOG, &lip->li_flags));
> > +	ASSERT(list_empty(&lip->li_cil));
> 
> So this can't be used for things like inodes and buffers?
> 

Not at this stage, but I think it's worth considering from a design
standpoint because I'd expect to be able to cover that case eventually.
If there are big picture issues/concerns, it's best to think about them
now. ;)

> > +	spin_lock(&cil->xc_relog_lock);
> > +	list_add_tail(&lip->li_cil, &cil->xc_relog);
> > +	spin_unlock(&cil->xc_relog_lock);
> > +
> > +	trace_xfs_cil_relog_queue(lip);
> > +}
> > +
> > +bool
> > +xfs_cil_relog_steal(
> > +	struct xlog		*log,
> > +	struct xfs_log_item	*lip)
> > +{
> > +	struct xfs_cil		*cil = log->l_cilp;
> > +	struct xfs_log_item	*pos, *ppos;
> > +	bool			ret = false;
> > +
> > +	spin_lock(&cil->xc_relog_lock);
> > +	list_for_each_entry_safe(pos, ppos, &cil->xc_relog, li_cil) {
> > +		if (pos == lip) {
> > +			list_del_init(&pos->li_cil);
> > +			ret = true;
> > +			break;
> > +		}
> > +	}
> > +	spin_unlock(&cil->xc_relog_lock);
> 
> This is a remove operation, not a "steal" operation. But why are
> we walking the relog list to find it? It would be much better to use
> a flag to indicate what list the item is on, and then this just
> becomes
> 

Er, yeah this is a mess. I was actually planning to do exactly this,
only with the dirty bit instead of a custom bit. The reason for the
current code is basically due to a combination of some bugs I had in the
code and things getting implemented out of order such that I already had
this function by the time I realized I could just use the dirty bit. I
didn't feel the need to rework it for the purpose of the POC. Of course
that depends on some of the same intent-only assumptions as discussed
above, so a relog specific bit is probably a more appropriate long term
solution.

> 	spin_lock(&cil->xc_relog_lock);
> 	if (test_and_clear_bit(XFS_LI_RELOG_LIST, &lip->li_flags))
> 		list_del_init(&pos->li_cil);
> 	spin_unlock(&cil->xc_relog_lock);
> 
> 
> 
> > @@ -556,6 +558,16 @@ xfs_qm_log_quotaoff_end(
> >  					flags & XFS_ALL_QUOTA_ACCT);
> >  	xfs_trans_log_quotaoff_item(tp, qoffi);
> >  
> > +	/*
> > +	 * XXX: partly open coded relog of the start item to ensure no relogging
> > +	 * after this point.
> > +	 */
> > +	clear_bit(XFS_LI_RELOG, &startqoff->qql_item.li_flags);
> > +	if (xfs_cil_relog_steal(mp->m_log, &startqoff->qql_item)) {
> > +		xfs_trans_add_item(tp, &startqoff->qql_item);
> > +		xfs_trans_log_quotaoff_item(tp, startqoff);
> > +	}
> 
> Urk. :)
> 
> > @@ -863,6 +864,9 @@ xfs_trans_committed_bulk(
> >  		if (XFS_LSN_CMP(item_lsn, (xfs_lsn_t)-1) == 0)
> >  			continue;
> >  
> > +		if (test_bit(XFS_LI_RELOG, &lip->li_flags))
> > +			xfs_cil_relog_item(lip->li_mountp->m_log, lip);
> 
> Ok, so this is moving the item from commit back onto the relog list.
> This is going to hammer the relog lock on workloads where there is a
> lot of transaction concurrency and a substantial number of items on
> the relog list....
> 

Hmm.. so concurrency is somewhat limited on this end of things by this
being in log completion context. I suspect you are referring to the
broader concurrency between log completion and incoming transaction
commits, which I think is a fair point. I'd have to think more about
that when I got to polishing the relog processing stuff. This most
likely would require some balancing between reducing contention and
maintaining effectiveness.

> ----
> 
> Ok, so I mentioned that we removed the hooks that could have been
> used for this some time ago.
> 
> What we actually want here is a notification that an object needs
> relogging. I can see how appealing the concept of automatically
> relogging is, but I'm unconvinced that we can make it work,
> especially when there aren't sufficient reservations to relog
> the items that need relogging.
> 

I'm not unconvinced it can be made to work, but I agree that more work
is required to demonstrate that. I'm open to ideas for cleaner
solutions. That's why I posted this in its current form. :)

> commit d420e5c810bce5debce0238021b410d0ef99cf08
> Author: Dave Chinner <dchinner@redhat.com>
> Date:   Tue Oct 15 09:17:53 2013 +1100
> 
>     xfs: remove unused transaction callback variables
>     
>     We don't do callbacks at transaction commit time, no do we have any
>     infrastructure to set up or run such callbacks, so remove the
>     variables and typedefs for these operations. If we ever need to add
>     callbacks, we can reintroduce the variables at that time.
>     
>     Signed-off-by: Dave Chinner <dchinner@redhat.com>
>     Reviewed-by: Ben Myers <bpm@sgi.com>
>     Signed-off-by: Ben Myers <bpm@sgi.com>
> 
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 09cf40b89e8c..71c835e9e810 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -85,18 +85,11 @@ struct xfs_item_ops {
>  #define XFS_ITEM_LOCKED                2
>  #define XFS_ITEM_FLUSHING      3
>  
> -/*
> - * This is the type of function which can be given to xfs_trans_callback()
> - * to be called upon the transaction's commit to disk.
> - */
> -typedef void (*xfs_trans_callback_t)(struct xfs_trans *, void *);
> -
>  /*
>   * This is the structure maintained for every active transaction.
>   */
>  typedef struct xfs_trans {
>         unsigned int            t_magic;        /* magic number */
> -       xfs_log_callback_t      t_logcb;        /* log callback struct */
>         unsigned int            t_type;         /* transaction type */
>         unsigned int            t_log_res;      /* amt of log space resvd */
>         unsigned int            t_log_count;    /* count for perm log res */
> 
> That's basically the functionality we want here - when the log item
> hits the journal, we want a callback to tell us so we can relog it
> ourselves if deemed necessary. i.e. it's time to reintroduce the
> transaction/log commit callback infrastructure...
> 

As noted earlier, this old mechanism is new to me.. This commit just
appears to remove an unused hook and the surrounding commits don't look
related. Was this previously used for something in particular?

> This would get used in conjunction with a permanent transaction
> reservation, allowing the owner of the object to keep it locked over
> transaction commit (while whatever work is running between intent
> and done), and the transaction commit turns into a trans_roll. Then
> we reserve space for the next relogging commit, and go about our
> business.
> 

Hmm.. IIUC the original intent committer rolls instead of commits to
reacquire relogging reservation, but what if the submitter needs to
allocate and commit unrelated transactions before it gets to the point
of completing the intent? I was originally thinking about "pre-reserving
and donating" relogging reservation along these lines, but I thought
reserving transactions like this (first for the intent+relog, then for
whatever random transaction is next) was a potential deadlock vector.
Perhaps it's not if the associated items have all been committed to the
log subsystem. It also seemed unnecessary given our current design of
worst case reservation sizes, but that's a separate topic and may only
apply to intents (which are small compared to metadata associated with
generic log items).

So are you suggesting ownership of the committed transaction transfers
to the log subsystem somehow or another? For example, we internally roll
and queue the _transaction_ (not the log item) for relogging while from
the caller's perspective the transaction has committed? Or the
additional reservation is pulled off and the transaction commits (i.e.
frees)? Or something else?

FWIW, I'm also wondering if this lends itself to some form of batching
for if we get to the point of relogging a large number of small items.
For example, consider a single dedicated/relogging transaction for many
small (or related) intents vs. N independent transactions processing in
the background. This is something that could came later once the
fundamentals are worked out, however.

> On notification of the intent being logged, the relogging work
> (which already has a transaction and a reservation) can be dumped to
> a workqueue (to get it out of iclog completion context) and the item
> relogged and the transaction rolled and re-reserved again.
> 
> This would work with any type of log item, not just intents, and it
> doesn't have any reservation "stealing" requirements. ANd because
> we've pre-reserved the log space for the relogging transaction, the
> relogging callback will never get hung up on it's own items
> preventing it from getting more log space.
> 
> i.e. we essentially make use of the existing relogging mechanism,
> just with callbacks to trigger periodic relogging of the items for
> long running transactions...
> 
> Thoughts?
> 

All in all this sounds interesting. I still need to grok the transaction
reservation/ownership semantics you propose re: the questions above, but
I do like the prospect of reusing existing mechanisms and thus being
able to more easily handle generic log items. Thanks for the
feedback...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Oct. 25, 2019, 11:16 p.m. UTC | #3
On Fri, Oct 25, 2019 at 08:41:17AM -0400, Brian Foster wrote:
> On Fri, Oct 25, 2019 at 09:43:08AM +1100, Dave Chinner wrote:
> > On Thu, Oct 24, 2019 at 01:28:50PM -0400, Brian Foster wrote:
> > > An experimental mechanism to automatically relog specified log
> > > items.  This is useful for long running operations, like quotaoff,
> > > that otherwise risk deadlocking on log space usage.
> > > 
> > > Not-signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > 
> > > Hi all,
> > > 
> > > This is an experiment that came out of a discussion with Darrick[1] on
> > > some design bits of the latest online btree repair work. Specifically,
> > > it logs free intents in the same transaction as block allocation to
> > > guard against inconsistency in the event of a crash during the repair
> > > sequence. These intents happen pin the log tail for an indeterminate
> > > amount of time. Darrick was looking for some form of auto relog
> > > mechanism to facilitate this general approach. It occurred to us that
> > > this is the same problem we've had with quotaoff for some time, so I
> > > figured it might be worth prototyping something against that to try and
> > > prove the concept.
> > 
> > Interesting idea. :)
> > 
> > > 
> > > Note that this is RFC because the code and interfaces are a complete
> > > mess and this is broken in certain ways. This occasionally triggers log
> > > reservation overrun shutdowns because transaction reservation checking
> > > has not yet been added, the cancellation path is overkill, etc. IOW, the
> > > purpose of this patch is purely to test a concept.
> > 
> > *nod*
> > 
> > > The concept is essentially to flag a log item for relogging on first
> > > transaction commit such that once it commits to the AIL, the next
> > > transaction that happens to commit with sufficient unused reservation
> > > opportunistically relogs the item to the current CIL context. For the
> > > log intent case, the transaction that commits the done item is required
> > > to cancel the relog state of the original intent to prevent further
> > > relogging.
> > 
> > Makes sense, but it seems like we removed the hook that would be
> > used by transactions to implement their own relogging on CIL commit
> > some time ago because nothign had used it for 15+ years....
> > 
> 
> Interesting, I'm not familiar with this...
> 
> > > In practice, this allows a log item to automatically roll through CIL
> > > checkpoints and not pin the tail of the log while something like a
> > > quotaoff is running for a potentially long period of time. This is
> > > applied to quotaoff and focused testing shows that it avoids the
> > > associated deadlock.
> > 
> > Hmmm. How do we deal with multiple identical intents being found in
> > checkpoints with different LSNs in log recovery?
> > 
> 
> Good question. I was kind of thinking they would be handled like
> normally relogged items, but I hadn't got to recovery testing yet. For
> quotaoff, no special handling is required because we simply turn off
> quota flags (i.e. no filesystem changes are required) when the intent is
> seen and don't do anything else with the log item. FWIW, we don't even
> seem to check for an associated quotaoff_end item.
> 
> For something more involved like an EFI, it looks like we'd create a
> duplicate log item for the intent and I suspect that would lead to a
> double processing attempt. So this would require some further changes to
> handle generic intent relogging properly.

Right, that's kinda what I was getting at, but most intents as
currently implemented are somewhat special from a "relogging"
perspective in that they don't track ongoing modifications that have
been made since the intent was originally logged.

> I don't think it's that
> difficult of a problem; we do already allow for relogs of other things
> obviously, we just don't currently do any tracking of already seen
> intents. That said, this could probably be considered an ABI change if
> older kernels don't know how to process the condition correctly, so we'd
> have to guard against that with a feature bit or some such when a
> filesystem isn't unmounted cleanly.

I don't think tracking duplicate intents is a difficult problem,
either. The problem I see is when the original intent drops a full
log cycle behind (i.e. out of the active recovery window) before
the log recovery terminates and fails to find the intent-done item,
hence we are left in a situation where -some- of the work has been
done, but we don't know exactly what...

Intents were designed to work within a permanent (rolling)
transaction sequence - the intent and the done intent are directly
related and the work that is done is done within the context of that
rolling transaction. Expanding an intent to span an unbound amount
of work should be ok if this work is done in the context of a single
rolling transaction and we can relog the intent safely. However,
Allowing intents to cover multiple independent transactions is a big
can of worms I really don't us want to open...

> More broadly, this implies that we don't ever currently relog intents so
> this is something that will come along with any implementation that
> intends to do so unless we adopt an alternative along the lines of what
> dfops processing does. Dfops completes the original intent and creates a
> new one on each internal transaction roll.

Right - chaining intents gives us atomic transactions from a
recovery perspective. This works because intents cover only a small
operation, so individual intents will never pin the tail of the log
over the scope of the larger modification being made. This avoids
the can of worms that open-ended, transaction indepedent intents
expose....

> My early thought was that
> would be overkill for relogs that are more intended to facilitate
> unrelated progress (i.e. not pin the log tail) as opposed to track
> progress in a particular operation (i.e. dfops freed 1 extent out of a 2
> extent EFI, roll, complete the original EFI and create a new EFI with 1
> extent left). The item never really changes in the first case where it
> does in the second.

The EFI is probably a bad example here - they were designed to relog
the EFD on partial completion, not the EFI. i.e. the EFI can log an
aribtrary number of extents to free, the EFD gets relogged on each
extent free completion until all extents in the EFI are freed. The
EFI itself was never intended to be relogged in the situation you
describe above.(*)

So, essentially, the EFD is the reloggable object that tracks
completion, and so to relog an EFI, we could just  we could just
relog the original EFI with the current EFD state indicating how
much of the EFI had been completed. We already have to roll the
inode in each of the EFD committing transactions to prevent it from
pinning the tail of the log, so the EFI/EFD pair could easily be
relogged like this, too. (**)

But for intent/intent done pairs that aren't designed for a rolling
completion like the EFD are more problematic, especially if we
aren't actually logging a completion state but just relogging the
intent. If the work being done is not in the same transaction
context as the intent (i.e. independent transactions rather than a
rolling context), then the intent cannot be sanely relogged without
modification as there is work that is in the journal between the two
intents that cannot be tracked by the intent relogging.

i.e. if the intent falls off the tail of the log, then we have the
situation of having a partial operation already on stable storage
but no way of tracking how much work has been done. Hence I think we
must confine relogging of intents to a single permanent transaction
context, and intents that can get relogged must have an intent done
logged with them to indicate progress that has been made. And, of
course, log recovery will need mods to handle this, too.

(*) That iterative partial EFD completion was never implemented in
the original design - I don't know the reason as it's not in the
commit history - but I can guess that it was because the EFI pinned
the tail of the log and caused deadlocks when the log was small.
Accounting for this in reservations made the reservation size blow
out and so defeated the purpose of rolling transactions to keep the
reservation size down.

(**) We only ever log single extent EFIs even with defer ops because
of XFS_BUI_MAX_FAST_EXTENTS = 1.

> > commit d420e5c810bce5debce0238021b410d0ef99cf08
> > Author: Dave Chinner <dchinner@redhat.com>
> > Date:   Tue Oct 15 09:17:53 2013 +1100
> > 
> >     xfs: remove unused transaction callback variables
> >     
> >     We don't do callbacks at transaction commit time, no do we have any
> >     infrastructure to set up or run such callbacks, so remove the
> >     variables and typedefs for these operations. If we ever need to add
> >     callbacks, we can reintroduce the variables at that time.
> >     
> >     Signed-off-by: Dave Chinner <dchinner@redhat.com>
> >     Reviewed-by: Ben Myers <bpm@sgi.com>
> >     Signed-off-by: Ben Myers <bpm@sgi.com>
> > 
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index 09cf40b89e8c..71c835e9e810 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -85,18 +85,11 @@ struct xfs_item_ops {
> >  #define XFS_ITEM_LOCKED                2
> >  #define XFS_ITEM_FLUSHING      3
> >  
> > -/*
> > - * This is the type of function which can be given to xfs_trans_callback()
> > - * to be called upon the transaction's commit to disk.
> > - */
> > -typedef void (*xfs_trans_callback_t)(struct xfs_trans *, void *);
> > -
> >  /*
> >   * This is the structure maintained for every active transaction.
> >   */
> >  typedef struct xfs_trans {
> >         unsigned int            t_magic;        /* magic number */
> > -       xfs_log_callback_t      t_logcb;        /* log callback struct */
> >         unsigned int            t_type;         /* transaction type */
> >         unsigned int            t_log_res;      /* amt of log space resvd */
> >         unsigned int            t_log_count;    /* count for perm log res */
> > 
> > That's basically the functionality we want here - when the log item
> > hits the journal, we want a callback to tell us so we can relog it
> > ourselves if deemed necessary. i.e. it's time to reintroduce the
> > transaction/log commit callback infrastructure...
> > 
> 
> As noted earlier, this old mechanism is new to me.. This commit just
> appears to remove an unused hook and the surrounding commits don't look
> related. Was this previously used for something in particular?

This is how xfs_trans_committed() used to be attached to the
iclogbuf and called on completion. It was added back in 1994, and
only ever used to call xfs_trans_committed() in the XFS code. IIRC,
there was also an intent to use it in the CXFS metadata server to
co-ordinate internal server state with the on-disk filesystem state,
but I don't think it ever happened as cluster-wide state recovery
after a system/server crash was already huge problem for CXFS
without adding custom logging interactions to it...

The code, prior to it's removal as delayed logging didn't use it
anymore, was:

       /*
        * Tell the LM to call the transaction completion routine
        * when the log write with LSN commit_lsn completes (e.g.
        * when the transaction commit really hits the on-disk log).
        * After this call we cannot reference tp, because the call
        * can happen at any time and the call will free the transaction
        * structure pointed to by tp.  The only case where we call
        * the completion routine (xfs_trans_committed) directly is
        * if the log is turned off on a debug kernel or we're
        * running in simulation mode (the log is explicitly turned
        * off).
        */
       tp->t_logcb.cb_func = xfs_trans_committed;
       tp->t_logcb.cb_arg = tp;

       /*
        * We need to pass the iclog buffer which was used for the
        * transaction commit record into this function, and attach
        * the callback to it. The callback must be attached before
        * the items are unlocked to avoid racing with other threads
        * waiting for an item to unlock.
        */
       shutdown = xfs_log_notify(mp, commit_iclog, &(tp->t_logcb));

IOWs, we used to attach individual transactions directly to the
iclog to be called back from iclog completion and processed
directly. Essentially, it is a "commit is now stable" notification
mechanism.

> > This would get used in conjunction with a permanent transaction
> > reservation, allowing the owner of the object to keep it locked over
> > transaction commit (while whatever work is running between intent
> > and done), and the transaction commit turns into a trans_roll. Then
> > we reserve space for the next relogging commit, and go about our
> > business.
> > 
> 
> Hmm.. IIUC the original intent committer rolls instead of commits to
> reacquire relogging reservation, but what if the submitter needs to
> allocate and commit unrelated transactions before it gets to the point
> of completing the intent?

Not allowed. The intent + intent done must be committed as part of
the same rolling transaction sequence. Commiting an intent without a
reservation for the intent-done is a deadlock vector. i.e. intent
pins the tail of the log, can't reserve space for the intent-done to
remove it from the log.

Fundamentally, it is the responsibility of the running transaction
to ensure objects it logs do not pin the tail of the log and prevent
forwards progress of the transaction. The whole subsystem is
designed around this premise. Hence while I can see how automatic
relogging of intents like you've proposed is appealing, but it
doesn't follow the "forwards progress guarantee" rules that the
transactional model is built around.

Hence I think that if we have a long running intent that requires
relogging, it needs to be done under a single rolling transaction
context, not indepednent, unrelated transactions. We can roll
permanent transactions an unbound number of times (just think about
freeing several million extents in truncate) without penalty if they
follow the forwards progress rules, so the only thing we need to do
here is consider how to relog an intent pair for a specific
operation safely.

> I was originally thinking about "pre-reserving
> and donating" relogging reservation along these lines, but I thought
> reserving transactions like this (first for the intent+relog, then for
> whatever random transaction is next) was a potential deadlock vector.

This is what rolling transactions do. The reservation for any
specific transaction in a sequence is guaranteed to be enough to
for the next transaction in the sequence to run to completion.

> Perhaps it's not if the associated items have all been committed to the
> log subsystem. It also seemed unnecessary given our current design of
> worst case reservation sizes, but that's a separate topic and may only
> apply to intents (which are small compared to metadata associated with
> generic log items).
> 
> So are you suggesting ownership of the committed transaction transfers
> to the log subsystem somehow or another? For example, we internally roll
> and queue the _transaction_ (not the log item) for relogging while from
> the caller's perspective the transaction has committed? Or the
> additional reservation is pulled off and the transaction commits (i.e.
> frees)? Or something else?

Not internal. Subsystem adds a callback to the transaction it is
about to commit, that gets moved to the CIL context, when
the CIL commits and the iclog completion processing is done then
the CIL commit callbacks are run with whatever context the caller
attached to it.

> FWIW, I'm also wondering if this lends itself to some form of batching
> for if we get to the point of relogging a large number of small items.
> For example, consider a single dedicated/relogging transaction for many
> small (or related) intents vs. N independent transactions processing in
> the background. This is something that could came later once the
> fundamentals are worked out, however.

That's exactly the problem I see needing to be solved for the
"rebuild btrees in free space" type of long running operation that
repair will need to run.

i.e. an out-of-line btree rebuild that involves allocating space
that should be freed again if the rebuild fails or we crash during
the rebuild. Hence an EFI needs to be held for each allocation we
do until we get the tree rebuilt and do the atomic swap of the root
block. That atomic swap transaction also needs to commit all the
EFDs, as the space is now in use and we need to cancel the EFIs.(+)

Hence the rolling transaction will need to relog the primary "btree
rebuild in progress" intent/intent-done pair as well as all the all
the EFI/EFD pairs it holds for the extents it has allocated so far.
The subsystem will have to co-ordinate the intent commit callback
notification with it's ongoing transactional work that is done under
it's rolling transaction context.

IOWs, there's a whole lot more work needed than just updating a
single intent pair in "btree rebuild" situation like this. I also
don't really see how a "log internal" relogging mechanism based on
stealing reservations will be able to handle the complexity of
atomic multiple intent state updates. That requires
subsystem/application specific knowledge to understand the
relationship between all the intents that need to be relogged....

(+) The deferops and/or EFI code probably needs modification to
support non-freeing, delayed EFD processing like this - it needs to
log and track the intents it holds and relog them, but not free them
or run EFDs until a later point in time. i.e. it becomes 2-part
deferred op mechanism, with separate control of the intent-done
processing phase. I'd like to use this mechanism for DIO and
buffered writeback allocation (so we don't need to use unwritten
extents), but I haven't had time to dig into it yet...

> All in all this sounds interesting. I still need to grok the transaction
> reservation/ownership semantics you propose re: the questions above, but
> I do like the prospect of reusing existing mechanisms and thus being
> able to more easily handle generic log items. Thanks for the
> feedback...

Yeah, i'd much prefer to implement something that fits within the
existing model, too :)

Cheers,

Dave.
Brian Foster Oct. 28, 2019, 2:26 p.m. UTC | #4
On Sat, Oct 26, 2019 at 10:16:28AM +1100, Dave Chinner wrote:
> On Fri, Oct 25, 2019 at 08:41:17AM -0400, Brian Foster wrote:
> > On Fri, Oct 25, 2019 at 09:43:08AM +1100, Dave Chinner wrote:
> > > On Thu, Oct 24, 2019 at 01:28:50PM -0400, Brian Foster wrote:
> > > > An experimental mechanism to automatically relog specified log
> > > > items.  This is useful for long running operations, like quotaoff,
> > > > that otherwise risk deadlocking on log space usage.
> > > > 
> > > > Not-signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > > 
> > > > Hi all,
> > > > 
> > > > This is an experiment that came out of a discussion with Darrick[1] on
> > > > some design bits of the latest online btree repair work. Specifically,
> > > > it logs free intents in the same transaction as block allocation to
> > > > guard against inconsistency in the event of a crash during the repair
> > > > sequence. These intents happen pin the log tail for an indeterminate
> > > > amount of time. Darrick was looking for some form of auto relog
> > > > mechanism to facilitate this general approach. It occurred to us that
> > > > this is the same problem we've had with quotaoff for some time, so I
> > > > figured it might be worth prototyping something against that to try and
> > > > prove the concept.
> > > 
> > > Interesting idea. :)
> > > 
> > > > 
> > > > Note that this is RFC because the code and interfaces are a complete
> > > > mess and this is broken in certain ways. This occasionally triggers log
> > > > reservation overrun shutdowns because transaction reservation checking
> > > > has not yet been added, the cancellation path is overkill, etc. IOW, the
> > > > purpose of this patch is purely to test a concept.
> > > 
> > > *nod*
> > > 
> > > > The concept is essentially to flag a log item for relogging on first
> > > > transaction commit such that once it commits to the AIL, the next
> > > > transaction that happens to commit with sufficient unused reservation
> > > > opportunistically relogs the item to the current CIL context. For the
> > > > log intent case, the transaction that commits the done item is required
> > > > to cancel the relog state of the original intent to prevent further
> > > > relogging.
> > > 
> > > Makes sense, but it seems like we removed the hook that would be
> > > used by transactions to implement their own relogging on CIL commit
> > > some time ago because nothign had used it for 15+ years....
> > > 
> > 
> > Interesting, I'm not familiar with this...
> > 
> > > > In practice, this allows a log item to automatically roll through CIL
> > > > checkpoints and not pin the tail of the log while something like a
> > > > quotaoff is running for a potentially long period of time. This is
> > > > applied to quotaoff and focused testing shows that it avoids the
> > > > associated deadlock.
> > > 
> > > Hmmm. How do we deal with multiple identical intents being found in
> > > checkpoints with different LSNs in log recovery?
> > > 
> > 
> > Good question. I was kind of thinking they would be handled like
> > normally relogged items, but I hadn't got to recovery testing yet. For
> > quotaoff, no special handling is required because we simply turn off
> > quota flags (i.e. no filesystem changes are required) when the intent is
> > seen and don't do anything else with the log item. FWIW, we don't even
> > seem to check for an associated quotaoff_end item.
> > 
> > For something more involved like an EFI, it looks like we'd create a
> > duplicate log item for the intent and I suspect that would lead to a
> > double processing attempt. So this would require some further changes to
> > handle generic intent relogging properly.
> 
> Right, that's kinda what I was getting at, but most intents as
> currently implemented are somewhat special from a "relogging"
> perspective in that they don't track ongoing modifications that have
> been made since the intent was originally logged.
> 
> > I don't think it's that
> > difficult of a problem; we do already allow for relogs of other things
> > obviously, we just don't currently do any tracking of already seen
> > intents. That said, this could probably be considered an ABI change if
> > older kernels don't know how to process the condition correctly, so we'd
> > have to guard against that with a feature bit or some such when a
> > filesystem isn't unmounted cleanly.
> 
> I don't think tracking duplicate intents is a difficult problem,
> either. The problem I see is when the original intent drops a full
> log cycle behind (i.e. out of the active recovery window) before
> the log recovery terminates and fails to find the intent-done item,
> hence we are left in a situation where -some- of the work has been
> done, but we don't know exactly what...
> 
> Intents were designed to work within a permanent (rolling)
> transaction sequence - the intent and the done intent are directly
> related and the work that is done is done within the context of that
> rolling transaction. Expanding an intent to span an unbound amount
> of work should be ok if this work is done in the context of a single
> rolling transaction and we can relog the intent safely. However,
> Allowing intents to cover multiple independent transactions is a big
> can of worms I really don't us want to open...
> 

Oh, agreed... What I'm saying above is that automatic relogging as
defined here is not intended for tracking progress as such. It simply
moves along an intent that isn't ever modified. If it is modified, it's
the responsibility of the caller to relog/roll/whatever.

> > More broadly, this implies that we don't ever currently relog intents so
> > this is something that will come along with any implementation that
> > intends to do so unless we adopt an alternative along the lines of what
> > dfops processing does. Dfops completes the original intent and creates a
> > new one on each internal transaction roll.
> 
> Right - chaining intents gives us atomic transactions from a
> recovery perspective. This works because intents cover only a small
> operation, so individual intents will never pin the tail of the log
> over the scope of the larger modification being made. This avoids
> the can of worms that open-ended, transaction indepedent intents
> expose....
> 
> > My early thought was that
> > would be overkill for relogs that are more intended to facilitate
> > unrelated progress (i.e. not pin the log tail) as opposed to track
> > progress in a particular operation (i.e. dfops freed 1 extent out of a 2
> > extent EFI, roll, complete the original EFI and create a new EFI with 1
> > extent left). The item never really changes in the first case where it
> > does in the second.
> 
> The EFI is probably a bad example here - they were designed to relog
> the EFD on partial completion, not the EFI. i.e. the EFI can log an
> aribtrary number of extents to free, the EFD gets relogged on each
> extent free completion until all extents in the EFI are freed. The
> EFI itself was never intended to be relogged in the situation you
> describe above.(*)
> 
> So, essentially, the EFD is the reloggable object that tracks
> completion, and so to relog an EFI, we could just  we could just
> relog the original EFI with the current EFD state indicating how
> much of the EFI had been completed. We already have to roll the
> inode in each of the EFD committing transactions to prevent it from
> pinning the tail of the log, so the EFI/EFD pair could easily be
> relogged like this, too. (**)
> 
> But for intent/intent done pairs that aren't designed for a rolling
> completion like the EFD are more problematic, especially if we
> aren't actually logging a completion state but just relogging the
> intent. If the work being done is not in the same transaction
> context as the intent (i.e. independent transactions rather than a
> rolling context), then the intent cannot be sanely relogged without
> modification as there is work that is in the journal between the two
> intents that cannot be tracked by the intent relogging.
> 

Right.

> i.e. if the intent falls off the tail of the log, then we have the
> situation of having a partial operation already on stable storage
> but no way of tracking how much work has been done. Hence I think we
> must confine relogging of intents to a single permanent transaction
> context, and intents that can get relogged must have an intent done
> logged with them to indicate progress that has been made. And, of
> course, log recovery will need mods to handle this, too.
> 

I'm not quite following what you mean by an intent falling off the tail
of the log. This patch modifies quotaoff because it presents a simple
and preexisting example of the problem it attempts to solve. Does this
scenario apply to that use case, or are you considering a more involved
use case here (such as btree reconstruction)?

> (*) That iterative partial EFD completion was never implemented in
> the original design - I don't know the reason as it's not in the
> commit history - but I can guess that it was because the EFI pinned
> the tail of the log and caused deadlocks when the log was small.
> Accounting for this in reservations made the reservation size blow
> out and so defeated the purpose of rolling transactions to keep the
> reservation size down.
> 

Heh, Ok.. I was wondering why I hadn't seen code that actually does
anything like this. :P

> (**) We only ever log single extent EFIs even with defer ops because
> of XFS_BUI_MAX_FAST_EXTENTS = 1.
> 
> > > commit d420e5c810bce5debce0238021b410d0ef99cf08
> > > Author: Dave Chinner <dchinner@redhat.com>
> > > Date:   Tue Oct 15 09:17:53 2013 +1100
> > > 
> > >     xfs: remove unused transaction callback variables
> > >     
> > >     We don't do callbacks at transaction commit time, no do we have any
> > >     infrastructure to set up or run such callbacks, so remove the
> > >     variables and typedefs for these operations. If we ever need to add
> > >     callbacks, we can reintroduce the variables at that time.
> > >     
> > >     Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > >     Reviewed-by: Ben Myers <bpm@sgi.com>
> > >     Signed-off-by: Ben Myers <bpm@sgi.com>
> > > 
> > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > index 09cf40b89e8c..71c835e9e810 100644
> > > --- a/fs/xfs/xfs_trans.h
> > > +++ b/fs/xfs/xfs_trans.h
> > > @@ -85,18 +85,11 @@ struct xfs_item_ops {
> > >  #define XFS_ITEM_LOCKED                2
> > >  #define XFS_ITEM_FLUSHING      3
> > >  
> > > -/*
> > > - * This is the type of function which can be given to xfs_trans_callback()
> > > - * to be called upon the transaction's commit to disk.
> > > - */
> > > -typedef void (*xfs_trans_callback_t)(struct xfs_trans *, void *);
> > > -
> > >  /*
> > >   * This is the structure maintained for every active transaction.
> > >   */
> > >  typedef struct xfs_trans {
> > >         unsigned int            t_magic;        /* magic number */
> > > -       xfs_log_callback_t      t_logcb;        /* log callback struct */
> > >         unsigned int            t_type;         /* transaction type */
> > >         unsigned int            t_log_res;      /* amt of log space resvd */
> > >         unsigned int            t_log_count;    /* count for perm log res */
> > > 
> > > That's basically the functionality we want here - when the log item
> > > hits the journal, we want a callback to tell us so we can relog it
> > > ourselves if deemed necessary. i.e. it's time to reintroduce the
> > > transaction/log commit callback infrastructure...
> > > 
> > 
> > As noted earlier, this old mechanism is new to me.. This commit just
> > appears to remove an unused hook and the surrounding commits don't look
> > related. Was this previously used for something in particular?
> 
> This is how xfs_trans_committed() used to be attached to the
> iclogbuf and called on completion. It was added back in 1994, and
> only ever used to call xfs_trans_committed() in the XFS code. IIRC,
> there was also an intent to use it in the CXFS metadata server to
> co-ordinate internal server state with the on-disk filesystem state,
> but I don't think it ever happened as cluster-wide state recovery
> after a system/server crash was already huge problem for CXFS
> without adding custom logging interactions to it...
> 
> The code, prior to it's removal as delayed logging didn't use it
> anymore, was:
> 
>        /*
>         * Tell the LM to call the transaction completion routine
>         * when the log write with LSN commit_lsn completes (e.g.
>         * when the transaction commit really hits the on-disk log).
>         * After this call we cannot reference tp, because the call
>         * can happen at any time and the call will free the transaction
>         * structure pointed to by tp.  The only case where we call
>         * the completion routine (xfs_trans_committed) directly is
>         * if the log is turned off on a debug kernel or we're
>         * running in simulation mode (the log is explicitly turned
>         * off).
>         */
>        tp->t_logcb.cb_func = xfs_trans_committed;
>        tp->t_logcb.cb_arg = tp;
> 
>        /*
>         * We need to pass the iclog buffer which was used for the
>         * transaction commit record into this function, and attach
>         * the callback to it. The callback must be attached before
>         * the items are unlocked to avoid racing with other threads
>         * waiting for an item to unlock.
>         */
>        shutdown = xfs_log_notify(mp, commit_iclog, &(tp->t_logcb));
> 
> IOWs, we used to attach individual transactions directly to the
> iclog to be called back from iclog completion and processed
> directly. Essentially, it is a "commit is now stable" notification
> mechanism.
> 

Ok, so the transaction memory lifecycle was slightly longer compared to
the current model. I take it this was ultimately replaced with the CIL
context.

> > > This would get used in conjunction with a permanent transaction
> > > reservation, allowing the owner of the object to keep it locked over
> > > transaction commit (while whatever work is running between intent
> > > and done), and the transaction commit turns into a trans_roll. Then
> > > we reserve space for the next relogging commit, and go about our
> > > business.
> > > 
> > 
> > Hmm.. IIUC the original intent committer rolls instead of commits to
> > reacquire relogging reservation, but what if the submitter needs to
> > allocate and commit unrelated transactions before it gets to the point
> > of completing the intent?
> 
> Not allowed. The intent + intent done must be committed as part of
> the same rolling transaction sequence. Commiting an intent without a
> reservation for the intent-done is a deadlock vector. i.e. intent
> pins the tail of the log, can't reserve space for the intent-done to
> remove it from the log.
> 

Right, but this is exactly why quotaoff doesn't use such a rolling
transaction. The quotaoff sequence itself can produce transaction
allocations from inaccessible contexts (i.e. inode inactivation), so it
can't hold an open transaction during the dquot scanning and whatnot.
Instead, it commits the start intent, performs the necessary work, and
allocates a new transaction for quotaoff_end once it is safe again to
reserve log space. This of course is a deadlock vector because the tail
was pinned from the start, the indirect transaction activity generated
by quotaoff is unknown and the filesystem is still open to unrelated
workloads.

The problem this patch attempts to resolve is simply to keep the start
intent moving in the log, explicitly because it is not modified in any
way that tracks progress of quotaoff.

> Fundamentally, it is the responsibility of the running transaction
> to ensure objects it logs do not pin the tail of the log and prevent
> forwards progress of the transaction. The whole subsystem is
> designed around this premise. Hence while I can see how automatic
> relogging of intents like you've proposed is appealing, but it
> doesn't follow the "forwards progress guarantee" rules that the
> transactional model is built around.
> 

Yeah, that makes sense. This is how dfops is essentially designed, but
the purpose of this is to deal with the context issue described above.
It is explicitly intended only for items that have not made progress
updates, but otherwise might be the only object pinning the tail of the
log.

Based on your feedback to this point, it seems to me that you're
expecting that this kind of indirect transaction scenario shouldn't
apply to btree reconstruction. That's fine, it just means that perhaps
these two use cases (from a logging perspective) aren't quite as similar
as I anticipated.

> Hence I think that if we have a long running intent that requires
> relogging, it needs to be done under a single rolling transaction
> context, not indepednent, unrelated transactions. We can roll
> permanent transactions an unbound number of times (just think about
> freeing several million extents in truncate) without penalty if they
> follow the forwards progress rules, so the only thing we need to do
> here is consider how to relog an intent pair for a specific
> operation safely.
> 

Yeah, I can see how rolling to maintain the forward progress guarantee
prevents log deadlocks. I could also see how that still leaves practical
issues for long running operations. For example, if the btree
reconstruction phase takes a relative long amount of time without any
transaction commits (and with outstanding intents), then we risk pinning
the log tail and blocking (though not deadlocking) any other unrelated
operations until the repair completes. I suspect this is where the
transaction notification thing comes in, because that provides us the
hook to notify the transaction that "it's time to move this thing
along," whether that be on CIL context checkpoint as I've done here or
something more intelligent down the line.

> > I was originally thinking about "pre-reserving
> > and donating" relogging reservation along these lines, but I thought
> > reserving transactions like this (first for the intent+relog, then for
> > whatever random transaction is next) was a potential deadlock vector.
> 
> This is what rolling transactions do. The reservation for any
> specific transaction in a sequence is guaranteed to be enough to
> for the next transaction in the sequence to run to completion.
> 
> > Perhaps it's not if the associated items have all been committed to the
> > log subsystem. It also seemed unnecessary given our current design of
> > worst case reservation sizes, but that's a separate topic and may only
> > apply to intents (which are small compared to metadata associated with
> > generic log items).
> > 
> > So are you suggesting ownership of the committed transaction transfers
> > to the log subsystem somehow or another? For example, we internally roll
> > and queue the _transaction_ (not the log item) for relogging while from
> > the caller's perspective the transaction has committed? Or the
> > additional reservation is pulled off and the transaction commits (i.e.
> > frees)? Or something else?
> 
> Not internal. Subsystem adds a callback to the transaction it is
> about to commit, that gets moved to the CIL context, when
> the CIL commits and the iclog completion processing is done then
> the CIL commit callbacks are run with whatever context the caller
> attached to it.
> 

Ok. The context provided above helps and I like the idea in general.
I'm still missing some pieces around how this would be used in something
like the btree reconstruction case, however.

Suppose the intents that need relogging are committed with associated
callbacks and context, the btree reconstruction is in progress and the
callback is invoked. Would the callback handler simply just roll the
transaction in this case and relog the attached items? (If the former,
isn't it eventually a deadlock vector to roll from log commit context
once the rolling transaction runs out of ticket counts?). Or are you
anticipating something more complex in terms of the callback notifying
the repair sequence (somehow) that it needs to relog its rolling context
transaction at the next opportunity?

I also think it's worth distinguishing between quotaoff and the repair
use case at this point. As noted above, this doesn't really address the
former and if we do take this callback approach, I'd like to find a way
to apply it there if at all possible. Admittedly, quotaoff is a bit of a
hacky use case as it is, so I'd even be fine with something along the
lines of running it it two separate (but coordinated) tasks[1] (i.e., one
dedicated to rolling and relogging the start intent and another to do
the actual quotaoff work), as long as the approach is ultimately safe
and resolves the problem. Thoughts?

[1] That could be anything from a specific quotaoff task to a more
generic background relog worker that could be shared across users and
batch to fewer transactions.

> > FWIW, I'm also wondering if this lends itself to some form of batching
> > for if we get to the point of relogging a large number of small items.
> > For example, consider a single dedicated/relogging transaction for many
> > small (or related) intents vs. N independent transactions processing in
> > the background. This is something that could came later once the
> > fundamentals are worked out, however.
> 
> That's exactly the problem I see needing to be solved for the
> "rebuild btrees in free space" type of long running operation that
> repair will need to run.
> 
> i.e. an out-of-line btree rebuild that involves allocating space
> that should be freed again if the rebuild fails or we crash during
> the rebuild. Hence an EFI needs to be held for each allocation we
> do until we get the tree rebuilt and do the atomic swap of the root
> block. That atomic swap transaction also needs to commit all the
> EFDs, as the space is now in use and we need to cancel the EFIs.(+)
> 
> Hence the rolling transaction will need to relog the primary "btree
> rebuild in progress" intent/intent-done pair as well as all the all
> the EFI/EFD pairs it holds for the extents it has allocated so far.
> The subsystem will have to co-ordinate the intent commit callback
> notification with it's ongoing transactional work that is done under
> it's rolling transaction context.
> 
> IOWs, there's a whole lot more work needed than just updating a
> single intent pair in "btree rebuild" situation like this. I also
> don't really see how a "log internal" relogging mechanism based on
> stealing reservations will be able to handle the complexity of
> atomic multiple intent state updates. That requires
> subsystem/application specific knowledge to understand the
> relationship between all the intents that need to be relogged....
> 

Yeah, though to be fair you're attributing more responsibility and thus
more complexity than I initially intended for this mechanism. This was
intended to be a "don't deadlock on this unchanged item" mechanism and
thus fairly isolated/limited in use. I'd compare it to something like
ordered buffers in terms of complexity level. I.e., a low level
mechanism for specific use cases and to be managed/understood properly
by associated users.

I've no issue with moving in the direction discussed here to facilitate
more complex higher level mechanisms (like tracking operational progress
of btree reconstruction, etc.), I'm just saying that the complexity
argument you make here changes as the requirements do.

> (+) The deferops and/or EFI code probably needs modification to
> support non-freeing, delayed EFD processing like this - it needs to
> log and track the intents it holds and relog them, but not free them
> or run EFDs until a later point in time. i.e. it becomes 2-part
> deferred op mechanism, with separate control of the intent-done
> processing phase. I'd like to use this mechanism for DIO and
> buffered writeback allocation (so we don't need to use unwritten
> extents), but I haven't had time to dig into it yet...
> 

I can definitely see opening up the dfops interface to drive/control
intent relogging vs. progress updates vs. completion. This is somewhat
complicated by tricks like reusing certain intents in non-traditional
ways, such as completing an EFI with an EFD without actually freeing
blocks. Regardless, my questions to this point are more around
usage/semantics of the log commit callback and using it to manage
transaction rolls. Once those building blocks are settled, I'm sure we
can work out a reasonable interface.

Brian

> > All in all this sounds interesting. I still need to grok the transaction
> > reservation/ownership semantics you propose re: the questions above, but
> > I do like the prospect of reusing existing mechanisms and thus being
> > able to more easily handle generic log items. Thanks for the
> > feedback...
> 
> Yeah, i'd much prefer to implement something that fits within the
> existing model, too :)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Darrick J. Wong Oct. 28, 2019, 6:20 p.m. UTC | #5
[/me finally jumps in with his perspectives]

On Mon, Oct 28, 2019 at 10:26:12AM -0400, Brian Foster wrote:
> On Sat, Oct 26, 2019 at 10:16:28AM +1100, Dave Chinner wrote:
> > On Fri, Oct 25, 2019 at 08:41:17AM -0400, Brian Foster wrote:
> > > On Fri, Oct 25, 2019 at 09:43:08AM +1100, Dave Chinner wrote:
> > > > On Thu, Oct 24, 2019 at 01:28:50PM -0400, Brian Foster wrote:
> > > > > An experimental mechanism to automatically relog specified log
> > > > > items.  This is useful for long running operations, like quotaoff,
> > > > > that otherwise risk deadlocking on log space usage.
> > > > > 
> > > > > Not-signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > ---
> > > > > 
> > > > > Hi all,
> > > > > 
> > > > > This is an experiment that came out of a discussion with Darrick[1] on
> > > > > some design bits of the latest online btree repair work. Specifically,
> > > > > it logs free intents in the same transaction as block allocation to
> > > > > guard against inconsistency in the event of a crash during the repair
> > > > > sequence. These intents happen pin the log tail for an indeterminate
> > > > > amount of time. Darrick was looking for some form of auto relog
> > > > > mechanism to facilitate this general approach. It occurred to us that
> > > > > this is the same problem we've had with quotaoff for some time, so I
> > > > > figured it might be worth prototyping something against that to try and
> > > > > prove the concept.
> > > > 
> > > > Interesting idea. :)
> > > > 
> > > > > 
> > > > > Note that this is RFC because the code and interfaces are a complete
> > > > > mess and this is broken in certain ways. This occasionally triggers log
> > > > > reservation overrun shutdowns because transaction reservation checking
> > > > > has not yet been added, the cancellation path is overkill, etc. IOW, the
> > > > > purpose of this patch is purely to test a concept.
> > > > 
> > > > *nod*
> > > > 
> > > > > The concept is essentially to flag a log item for relogging on first
> > > > > transaction commit such that once it commits to the AIL, the next
> > > > > transaction that happens to commit with sufficient unused reservation
> > > > > opportunistically relogs the item to the current CIL context. For the
> > > > > log intent case, the transaction that commits the done item is required
> > > > > to cancel the relog state of the original intent to prevent further
> > > > > relogging.
> > > > 
> > > > Makes sense, but it seems like we removed the hook that would be
> > > > used by transactions to implement their own relogging on CIL commit
> > > > some time ago because nothign had used it for 15+ years....
> > > > 
> > > 
> > > Interesting, I'm not familiar with this...
> > > 
> > > > > In practice, this allows a log item to automatically roll through CIL
> > > > > checkpoints and not pin the tail of the log while something like a
> > > > > quotaoff is running for a potentially long period of time. This is
> > > > > applied to quotaoff and focused testing shows that it avoids the
> > > > > associated deadlock.
> > > > 
> > > > Hmmm. How do we deal with multiple identical intents being found in
> > > > checkpoints with different LSNs in log recovery?
> > > > 
> > > 
> > > Good question. I was kind of thinking they would be handled like
> > > normally relogged items, but I hadn't got to recovery testing yet. For
> > > quotaoff, no special handling is required because we simply turn off
> > > quota flags (i.e. no filesystem changes are required) when the intent is
> > > seen and don't do anything else with the log item. FWIW, we don't even
> > > seem to check for an associated quotaoff_end item.
> > > 
> > > For something more involved like an EFI, it looks like we'd create a
> > > duplicate log item for the intent and I suspect that would lead to a
> > > double processing attempt. So this would require some further changes to
> > > handle generic intent relogging properly.
> > 
> > Right, that's kinda what I was getting at, but most intents as
> > currently implemented are somewhat special from a "relogging"
> > perspective in that they don't track ongoing modifications that have
> > been made since the intent was originally logged.
> > 
> > > I don't think it's that
> > > difficult of a problem; we do already allow for relogs of other things
> > > obviously, we just don't currently do any tracking of already seen
> > > intents. That said, this could probably be considered an ABI change if
> > > older kernels don't know how to process the condition correctly, so we'd
> > > have to guard against that with a feature bit or some such when a
> > > filesystem isn't unmounted cleanly.
> > 
> > I don't think tracking duplicate intents is a difficult problem,
> > either. The problem I see is when the original intent drops a full
> > log cycle behind (i.e. out of the active recovery window) before
> > the log recovery terminates and fails to find the intent-done item,
> > hence we are left in a situation where -some- of the work has been
> > done, but we don't know exactly what...
> > 
> > Intents were designed to work within a permanent (rolling)
> > transaction sequence - the intent and the done intent are directly
> > related and the work that is done is done within the context of that
> > rolling transaction. Expanding an intent to span an unbound amount
> > of work should be ok if this work is done in the context of a single
> > rolling transaction and we can relog the intent safely. However,
> > Allowing intents to cover multiple independent transactions is a big
> > can of worms I really don't us want to open...
> > 
> 
> Oh, agreed... What I'm saying above is that automatic relogging as
> defined here is not intended for tracking progress as such. It simply
> moves along an intent that isn't ever modified. If it is modified, it's
> the responsibility of the caller to relog/roll/whatever.

(FWIW that's what I was thinking too.)

> > > More broadly, this implies that we don't ever currently relog intents so
> > > this is something that will come along with any implementation that
> > > intends to do so unless we adopt an alternative along the lines of what
> > > dfops processing does. Dfops completes the original intent and creates a
> > > new one on each internal transaction roll.
> > 
> > Right - chaining intents gives us atomic transactions from a
> > recovery perspective. This works because intents cover only a small
> > operation, so individual intents will never pin the tail of the log
> > over the scope of the larger modification being made. This avoids
> > the can of worms that open-ended, transaction indepedent intents
> > expose....
> > 
> > > My early thought was that
> > > would be overkill for relogs that are more intended to facilitate
> > > unrelated progress (i.e. not pin the log tail) as opposed to track
> > > progress in a particular operation (i.e. dfops freed 1 extent out of a 2
> > > extent EFI, roll, complete the original EFI and create a new EFI with 1
> > > extent left). The item never really changes in the first case where it
> > > does in the second.
> > 
> > The EFI is probably a bad example here - they were designed to relog
> > the EFD on partial completion, not the EFI. i.e. the EFI can log an
> > aribtrary number of extents to free, the EFD gets relogged on each
> > extent free completion until all extents in the EFI are freed. The
> > EFI itself was never intended to be relogged in the situation you
> > describe above.(*)
> > 
> > So, essentially, the EFD is the reloggable object that tracks
> > completion, and so to relog an EFI, we could just  we could just
> > relog the original EFI with the current EFD state indicating how
> > much of the EFI had been completed. We already have to roll the
> > inode in each of the EFD committing transactions to prevent it from
> > pinning the tail of the log, so the EFI/EFD pair could easily be
> > relogged like this, too. (**)
> > 
> > But for intent/intent done pairs that aren't designed for a rolling
> > completion like the EFD are more problematic, especially if we
> > aren't actually logging a completion state but just relogging the
> > intent. If the work being done is not in the same transaction
> > context as the intent (i.e. independent transactions rather than a
> > rolling context), then the intent cannot be sanely relogged without
> > modification as there is work that is in the journal between the two
> > intents that cannot be tracked by the intent relogging.
> > 
> 
> Right.
> 
> > i.e. if the intent falls off the tail of the log, then we have the
> > situation of having a partial operation already on stable storage
> > but no way of tracking how much work has been done. Hence I think we
> > must confine relogging of intents to a single permanent transaction
> > context, and intents that can get relogged must have an intent done
> > logged with them to indicate progress that has been made. And, of
> > course, log recovery will need mods to handle this, too.
> > 
> 
> I'm not quite following what you mean by an intent falling off the tail
> of the log. This patch modifies quotaoff because it presents a simple
> and preexisting example of the problem it attempts to solve. Does this
> scenario apply to that use case, or are you considering a more involved
> use case here (such as btree reconstruction)?

/me wonders, for quotaoff we end up with a transaction sequence:

[start quotaoff] <potentially large pile of transactions> [end quotaoff]

And AFAICT the quotaoff item makes it so that the log recovery code
doesn't bother recovering dquot items, right?  So I'm not sure what
intermediate progress information we need to know about?  Either we've
finished quotaoff and won't be logging more dquots, or we haven't.

I might just be confused, but I think your worry here is that something
like this will happen?

Log an intent, start doing the work:

[log intent][start doing work]..........................

Then other threads log some other work and the head get close to the end:

............[start doing work][other work]..............

So we relog the intent and the other threads continue logging more work.
Then the log wraps, zapping the first intent:

[new work]..[start doing work][other work][relog intent]

Now we crash.  Log recovery recovers "start doing work", then it
recovers "other work", then it notes "relog intent", and finally it
recovers "new work".  Next, it decides to restart "relog intent", which
then trips over the filesystem because the item recovery code is too
dumb to realize that we already did some, but not all of, the intended
work?

(Did I get that right?)

At least for repair the transaction sequence looks roughly like this:

[allocate, EFI0][allocate, EFI1] <long wait to write new btree> \
	[commit root, EFD0, EFD1, EFI2, EFI3][free, EFD2][free, EFD3]

EFI[01] are for the new btree blocks, EFI[23] are to kill the old ones.

So there's not really any intermediate progress that's going through the
log -- either we're ready to commit the new root and mess around with
our EFI set, or we're not.  There's no intermediate progress to trip
over.

(Also it occurs to me just now that since btree reconstruction uses
delwri_submit before committing the new root, it really ought to force
the blocks to media via an explicit flush at the end or something to
make sure that we've really committed the new btree blocks to stable
storage, right?)

> > (*) That iterative partial EFD completion was never implemented in
> > the original design - I don't know the reason as it's not in the
> > commit history - but I can guess that it was because the EFI pinned
> > the tail of the log and caused deadlocks when the log was small.
> > Accounting for this in reservations made the reservation size blow
> > out and so defeated the purpose of rolling transactions to keep the
> > reservation size down.
> > 
> 
> Heh, Ok.. I was wondering why I hadn't seen code that actually does
> anything like this. :P
> 
> > (**) We only ever log single extent EFIs even with defer ops because
> > of XFS_BUI_MAX_FAST_EXTENTS = 1.
> > 
> > > > commit d420e5c810bce5debce0238021b410d0ef99cf08
> > > > Author: Dave Chinner <dchinner@redhat.com>
> > > > Date:   Tue Oct 15 09:17:53 2013 +1100
> > > > 
> > > >     xfs: remove unused transaction callback variables
> > > >     
> > > >     We don't do callbacks at transaction commit time, no do we have any
> > > >     infrastructure to set up or run such callbacks, so remove the
> > > >     variables and typedefs for these operations. If we ever need to add
> > > >     callbacks, we can reintroduce the variables at that time.
> > > >     
> > > >     Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > >     Reviewed-by: Ben Myers <bpm@sgi.com>
> > > >     Signed-off-by: Ben Myers <bpm@sgi.com>
> > > > 
> > > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > > index 09cf40b89e8c..71c835e9e810 100644
> > > > --- a/fs/xfs/xfs_trans.h
> > > > +++ b/fs/xfs/xfs_trans.h
> > > > @@ -85,18 +85,11 @@ struct xfs_item_ops {
> > > >  #define XFS_ITEM_LOCKED                2
> > > >  #define XFS_ITEM_FLUSHING      3
> > > >  
> > > > -/*
> > > > - * This is the type of function which can be given to xfs_trans_callback()
> > > > - * to be called upon the transaction's commit to disk.
> > > > - */
> > > > -typedef void (*xfs_trans_callback_t)(struct xfs_trans *, void *);
> > > > -
> > > >  /*
> > > >   * This is the structure maintained for every active transaction.
> > > >   */
> > > >  typedef struct xfs_trans {
> > > >         unsigned int            t_magic;        /* magic number */
> > > > -       xfs_log_callback_t      t_logcb;        /* log callback struct */
> > > >         unsigned int            t_type;         /* transaction type */
> > > >         unsigned int            t_log_res;      /* amt of log space resvd */
> > > >         unsigned int            t_log_count;    /* count for perm log res */
> > > > 
> > > > That's basically the functionality we want here - when the log item
> > > > hits the journal, we want a callback to tell us so we can relog it
> > > > ourselves if deemed necessary. i.e. it's time to reintroduce the
> > > > transaction/log commit callback infrastructure...
> > > > 
> > > 
> > > As noted earlier, this old mechanism is new to me.. This commit just
> > > appears to remove an unused hook and the surrounding commits don't look
> > > related. Was this previously used for something in particular?
> > 
> > This is how xfs_trans_committed() used to be attached to the
> > iclogbuf and called on completion. It was added back in 1994, and
> > only ever used to call xfs_trans_committed() in the XFS code. IIRC,
> > there was also an intent to use it in the CXFS metadata server to
> > co-ordinate internal server state with the on-disk filesystem state,
> > but I don't think it ever happened as cluster-wide state recovery
> > after a system/server crash was already huge problem for CXFS
> > without adding custom logging interactions to it...
> > 
> > The code, prior to it's removal as delayed logging didn't use it
> > anymore, was:
> > 
> >        /*
> >         * Tell the LM to call the transaction completion routine
> >         * when the log write with LSN commit_lsn completes (e.g.
> >         * when the transaction commit really hits the on-disk log).
> >         * After this call we cannot reference tp, because the call
> >         * can happen at any time and the call will free the transaction
> >         * structure pointed to by tp.  The only case where we call
> >         * the completion routine (xfs_trans_committed) directly is
> >         * if the log is turned off on a debug kernel or we're
> >         * running in simulation mode (the log is explicitly turned
> >         * off).
> >         */
> >        tp->t_logcb.cb_func = xfs_trans_committed;
> >        tp->t_logcb.cb_arg = tp;
> > 
> >        /*
> >         * We need to pass the iclog buffer which was used for the
> >         * transaction commit record into this function, and attach
> >         * the callback to it. The callback must be attached before
> >         * the items are unlocked to avoid racing with other threads
> >         * waiting for an item to unlock.
> >         */
> >        shutdown = xfs_log_notify(mp, commit_iclog, &(tp->t_logcb));
> > 
> > IOWs, we used to attach individual transactions directly to the
> > iclog to be called back from iclog completion and processed
> > directly. Essentially, it is a "commit is now stable" notification
> > mechanism.
> > 
> 
> Ok, so the transaction memory lifecycle was slightly longer compared to
> the current model. I take it this was ultimately replaced with the CIL
> context.
> 
> > > > This would get used in conjunction with a permanent transaction
> > > > reservation, allowing the owner of the object to keep it locked over
> > > > transaction commit (while whatever work is running between intent
> > > > and done), and the transaction commit turns into a trans_roll. Then
> > > > we reserve space for the next relogging commit, and go about our
> > > > business.
> > > > 
> > > 
> > > Hmm.. IIUC the original intent committer rolls instead of commits to
> > > reacquire relogging reservation, but what if the submitter needs to
> > > allocate and commit unrelated transactions before it gets to the point
> > > of completing the intent?
> > 
> > Not allowed. The intent + intent done must be committed as part of
> > the same rolling transaction sequence. Commiting an intent without a
> > reservation for the intent-done is a deadlock vector. i.e. intent
> > pins the tail of the log, can't reserve space for the intent-done to
> > remove it from the log.
> > 
> 
> Right, but this is exactly why quotaoff doesn't use such a rolling
> transaction. The quotaoff sequence itself can produce transaction
> allocations from inaccessible contexts (i.e. inode inactivation), so it
> can't hold an open transaction during the dquot scanning and whatnot.
> Instead, it commits the start intent, performs the necessary work, and
> allocates a new transaction for quotaoff_end once it is safe again to
> reserve log space. This of course is a deadlock vector because the tail
> was pinned from the start, the indirect transaction activity generated
> by quotaoff is unknown and the filesystem is still open to unrelated
> workloads.
> 
> The problem this patch attempts to resolve is simply to keep the start
> intent moving in the log, explicitly because it is not modified in any
> way that tracks progress of quotaoff.
> 
> > Fundamentally, it is the responsibility of the running transaction
> > to ensure objects it logs do not pin the tail of the log and prevent
> > forwards progress of the transaction. The whole subsystem is
> > designed around this premise. Hence while I can see how automatic
> > relogging of intents like you've proposed is appealing, but it
> > doesn't follow the "forwards progress guarantee" rules that the
> > transactional model is built around.
> > 
> 
> Yeah, that makes sense. This is how dfops is essentially designed, but
> the purpose of this is to deal with the context issue described above.
> It is explicitly intended only for items that have not made progress
> updates, but otherwise might be the only object pinning the tail of the
> log.
> 
> Based on your feedback to this point, it seems to me that you're
> expecting that this kind of indirect transaction scenario shouldn't
> apply to btree reconstruction. That's fine, it just means that perhaps
> these two use cases (from a logging perspective) aren't quite as similar
> as I anticipated.
> 
> > Hence I think that if we have a long running intent that requires
> > relogging, it needs to be done under a single rolling transaction
> > context, not indepednent, unrelated transactions. We can roll
> > permanent transactions an unbound number of times (just think about
> > freeing several million extents in truncate) without penalty if they
> > follow the forwards progress rules, so the only thing we need to do
> > here is consider how to relog an intent pair for a specific
> > operation safely.
> > 
> 
> Yeah, I can see how rolling to maintain the forward progress guarantee
> prevents log deadlocks. I could also see how that still leaves practical
> issues for long running operations. For example, if the btree
> reconstruction phase takes a relative long amount of time without any
> transaction commits (and with outstanding intents), then we risk pinning
> the log tail and blocking (though not deadlocking) any other unrelated
> operations until the repair completes. I suspect this is where the
> transaction notification thing comes in, because that provides us the
> hook to notify the transaction that "it's time to move this thing
> along," whether that be on CIL context checkpoint as I've done here or
> something more intelligent down the line.
> 
> > > I was originally thinking about "pre-reserving
> > > and donating" relogging reservation along these lines, but I thought
> > > reserving transactions like this (first for the intent+relog, then for
> > > whatever random transaction is next) was a potential deadlock vector.
> > 
> > This is what rolling transactions do. The reservation for any
> > specific transaction in a sequence is guaranteed to be enough to
> > for the next transaction in the sequence to run to completion.
> > 
> > > Perhaps it's not if the associated items have all been committed to the
> > > log subsystem. It also seemed unnecessary given our current design of
> > > worst case reservation sizes, but that's a separate topic and may only
> > > apply to intents (which are small compared to metadata associated with
> > > generic log items).
> > > 
> > > So are you suggesting ownership of the committed transaction transfers
> > > to the log subsystem somehow or another? For example, we internally roll
> > > and queue the _transaction_ (not the log item) for relogging while from
> > > the caller's perspective the transaction has committed? Or the
> > > additional reservation is pulled off and the transaction commits (i.e.
> > > frees)? Or something else?
> > 
> > Not internal. Subsystem adds a callback to the transaction it is
> > about to commit, that gets moved to the CIL context, when
> > the CIL commits and the iclog completion processing is done then
> > the CIL commit callbacks are run with whatever context the caller
> > attached to it.
> > 
> 
> Ok. The context provided above helps and I like the idea in general.
> I'm still missing some pieces around how this would be used in something
> like the btree reconstruction case, however.
> 
> Suppose the intents that need relogging are committed with associated
> callbacks and context, the btree reconstruction is in progress and the
> callback is invoked. Would the callback handler simply just roll the
> transaction in this case and relog the attached items? (If the former,
> isn't it eventually a deadlock vector to roll from log commit context
> once the rolling transaction runs out of ticket counts?). Or are you
> anticipating something more complex in terms of the callback notifying
> the repair sequence (somehow) that it needs to relog its rolling context
> transaction at the next opportunity?

I was thinking (this morning, on IRC :P) that log items could have a
"hey we're getting full, please relog" handler.  When the head gets more
than (say) 75% of the log size past the tail, we call the handler, if
one was supplied.

Repair of course supplies a handle, so it just kicks off a workqueue
item to allocate a new transaction (hah!) that logs an intent done for
the existing intent, logs a new identical intent, stuffs that back into
repair's bookkeeping, and commits the transaction...

> I also think it's worth distinguishing between quotaoff and the repair
> use case at this point. As noted above, this doesn't really address the
> former and if we do take this callback approach, I'd like to find a way
> to apply it there if at all possible. Admittedly, quotaoff is a bit of a
> hacky use case as it is, so I'd even be fine with something along the
> lines of running it it two separate (but coordinated) tasks[1] (i.e., one
> dedicated to rolling and relogging the start intent and another to do
> the actual quotaoff work), as long as the approach is ultimately safe
> and resolves the problem. Thoughts?
> 
> [1] That could be anything from a specific quotaoff task to a more
> generic background relog worker that could be shared across users and
> batch to fewer transactions.

...like this, perhaps.

> > > FWIW, I'm also wondering if this lends itself to some form of batching
> > > for if we get to the point of relogging a large number of small items.
> > > For example, consider a single dedicated/relogging transaction for many
> > > small (or related) intents vs. N independent transactions processing in
> > > the background. This is something that could came later once the
> > > fundamentals are worked out, however.
> > 
> > That's exactly the problem I see needing to be solved for the
> > "rebuild btrees in free space" type of long running operation that
> > repair will need to run.
> > 
> > i.e. an out-of-line btree rebuild that involves allocating space
> > that should be freed again if the rebuild fails or we crash during
> > the rebuild. Hence an EFI needs to be held for each allocation we
> > do until we get the tree rebuilt and do the atomic swap of the root
> > block. That atomic swap transaction also needs to commit all the
> > EFDs, as the space is now in use and we need to cancel the EFIs.(+)
> > 
> > Hence the rolling transaction will need to relog the primary "btree
> > rebuild in progress" intent/intent-done pair as well as all the all
> > the EFI/EFD pairs it holds for the extents it has allocated so far.
> > The subsystem will have to co-ordinate the intent commit callback
> > notification with it's ongoing transactional work that is done under
> > it's rolling transaction context.
> > 
> > IOWs, there's a whole lot more work needed than just updating a
> > single intent pair in "btree rebuild" situation like this. I also
> > don't really see how a "log internal" relogging mechanism based on
> > stealing reservations will be able to handle the complexity of
> > atomic multiple intent state updates. That requires
> > subsystem/application specific knowledge to understand the
> > relationship between all the intents that need to be relogged....
> > 
> 
> Yeah, though to be fair you're attributing more responsibility and thus
> more complexity than I initially intended for this mechanism. This was
> intended to be a "don't deadlock on this unchanged item" mechanism and
> thus fairly isolated/limited in use. I'd compare it to something like
> ordered buffers in terms of complexity level. I.e., a low level
> mechanism for specific use cases and to be managed/understood properly
> by associated users.
> 
> I've no issue with moving in the direction discussed here to facilitate
> more complex higher level mechanisms (like tracking operational progress
> of btree reconstruction, etc.), I'm just saying that the complexity
> argument you make here changes as the requirements do.
> 
> > (+) The deferops and/or EFI code probably needs modification to
> > support non-freeing, delayed EFD processing like this - it needs to
> > log and track the intents it holds and relog them, but not free them
> > or run EFDs until a later point in time. i.e. it becomes 2-part
> > deferred op mechanism, with separate control of the intent-done
> > processing phase. I'd like to use this mechanism for DIO and
> > buffered writeback allocation (so we don't need to use unwritten
> > extents), but I haven't had time to dig into it yet...

Heh, I suspected this was going to come up in this discussion. :)

> I can definitely see opening up the dfops interface to drive/control
> intent relogging vs. progress updates vs. completion. This is somewhat
> complicated by tricks like reusing certain intents in non-traditional
> ways, such as completing an EFI with an EFD without actually freeing
> blocks. Regardless, my questions to this point are more around
> usage/semantics of the log commit callback and using it to manage
> transaction rolls. Once those building blocks are settled, I'm sure we
> can work out a reasonable interface.

<nod>

--D

> 
> Brian
> 
> > > All in all this sounds interesting. I still need to grok the transaction
> > > reservation/ownership semantics you propose re: the questions above, but
> > > I do like the prospect of reusing existing mechanisms and thus being
> > > able to more easily handle generic log items. Thanks for the
> > > feedback...
> > 
> > Yeah, i'd much prefer to implement something that fits within the
> > existing model, too :)
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
>
Brian Foster Oct. 29, 2019, 1:57 p.m. UTC | #6
On Mon, Oct 28, 2019 at 11:20:38AM -0700, Darrick J. Wong wrote:
> [/me finally jumps in with his perspectives]
> 
> On Mon, Oct 28, 2019 at 10:26:12AM -0400, Brian Foster wrote:
> > On Sat, Oct 26, 2019 at 10:16:28AM +1100, Dave Chinner wrote:
> > > On Fri, Oct 25, 2019 at 08:41:17AM -0400, Brian Foster wrote:
> > > > On Fri, Oct 25, 2019 at 09:43:08AM +1100, Dave Chinner wrote:
> > > > > On Thu, Oct 24, 2019 at 01:28:50PM -0400, Brian Foster wrote:
> > > > > > An experimental mechanism to automatically relog specified log
> > > > > > items.  This is useful for long running operations, like quotaoff,
> > > > > > that otherwise risk deadlocking on log space usage.
> > > > > > 
> > > > > > Not-signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > > ---
> > > > > > 
...
> > 
> > > i.e. if the intent falls off the tail of the log, then we have the
> > > situation of having a partial operation already on stable storage
> > > but no way of tracking how much work has been done. Hence I think we
> > > must confine relogging of intents to a single permanent transaction
> > > context, and intents that can get relogged must have an intent done
> > > logged with them to indicate progress that has been made. And, of
> > > course, log recovery will need mods to handle this, too.
> > > 
> > 
> > I'm not quite following what you mean by an intent falling off the tail
> > of the log. This patch modifies quotaoff because it presents a simple
> > and preexisting example of the problem it attempts to solve. Does this
> > scenario apply to that use case, or are you considering a more involved
> > use case here (such as btree reconstruction)?
> 
> /me wonders, for quotaoff we end up with a transaction sequence:
> 
> [start quotaoff] <potentially large pile of transactions> [end quotaoff]
> 
> And AFAICT the quotaoff item makes it so that the log recovery code
> doesn't bother recovering dquot items, right?  So I'm not sure what
> intermediate progress information we need to know about?  Either we've
> finished quotaoff and won't be logging more dquots, or we haven't.
> 
> I might just be confused, but I think your worry here is that something
> like this will happen?
> 
> Log an intent, start doing the work:
> 
> [log intent][start doing work]..........................
> 
> Then other threads log some other work and the head get close to the end:
> 
> ............[start doing work][other work]..............
> 
> So we relog the intent and the other threads continue logging more work.
> Then the log wraps, zapping the first intent:
> 
> [new work]..[start doing work][other work][relog intent]
> 
> Now we crash.  Log recovery recovers "start doing work", then it
> recovers "other work", then it notes "relog intent", and finally it
> recovers "new work".  Next, it decides to restart "relog intent", which
> then trips over the filesystem because the item recovery code is too
> dumb to realize that we already did some, but not all of, the intended
> work?
> 

Hm, Ok.. that sort of makes sense in terms of a problem statement. I
don't see it how it necessarily applies to the mechanism in this patch
because the use case covers extents that don't carry partial progress as
such. For example, from a quotaoff recovery standpoint, there's no real
difference whether it sees the original intent or a relogged variant
(the issue of seeing both notwithstanding). My understanding was the
same thing sort of applies for the repair use case because even though
an EFI does support partial progress relogging (via EFD, as Dave points
out), the free operations would only ever occur in error scenarios.

> (Did I get that right?)
> 
> At least for repair the transaction sequence looks roughly like this:
> 
> [allocate, EFI0][allocate, EFI1] <long wait to write new btree> \
> 	[commit root, EFD0, EFD1, EFI2, EFI3][free, EFD2][free, EFD3]
> 
> EFI[01] are for the new btree blocks, EFI[23] are to kill the old ones.
> 

Yep, that's my understanding from your previous descriptions. So
relogging (at least as defined by this version of the rfc) only really
applies to EFI[01] and only for the duration of the btree
reconstruction. It obviously needs to cancel to commit the associated
EFDs on repair completion (success), but the same would probably hold
true if the repair failed and we actually wanted to return the blocks to
the free space trees instead of shutdown.

> So there's not really any intermediate progress that's going through the
> log -- either we're ready to commit the new root and mess around with
> our EFI set, or we're not.  There's no intermediate progress to trip
> over.
> 

Indeed.

> (Also it occurs to me just now that since btree reconstruction uses
> delwri_submit before committing the new root, it really ought to force
> the blocks to media via an explicit flush at the end or something to
> make sure that we've really committed the new btree blocks to stable
> storage, right?)
> 

IIRC btree block population are all unlogged changes (hence the delwri
queue), so a flush probably makes sense. Making the final root commit
transaction synchronous might be appropriate, since that eliminates risk
of losing the repair via a crash after xfs_scrub returns back from
kernel space but before the transaction actually commits to the physical
log. If that sync trans commit occurs after delwri queue I/O completion,
the REQ_PREFLUSH associated with log buffer I/O might be sufficient to
guarantee the new btree is persistent.

...
> > > 
> > 
> > Ok. The context provided above helps and I like the idea in general.
> > I'm still missing some pieces around how this would be used in something
> > like the btree reconstruction case, however.
> > 
> > Suppose the intents that need relogging are committed with associated
> > callbacks and context, the btree reconstruction is in progress and the
> > callback is invoked. Would the callback handler simply just roll the
> > transaction in this case and relog the attached items? (If the former,
> > isn't it eventually a deadlock vector to roll from log commit context
> > once the rolling transaction runs out of ticket counts?). Or are you
> > anticipating something more complex in terms of the callback notifying
> > the repair sequence (somehow) that it needs to relog its rolling context
> > transaction at the next opportunity?
> 
> I was thinking (this morning, on IRC :P) that log items could have a
> "hey we're getting full, please relog" handler.  When the head gets more
> than (say) 75% of the log size past the tail, we call the handler, if
> one was supplied.
> 
> Repair of course supplies a handle, so it just kicks off a workqueue
> item to allocate a new transaction (hah!) that logs an intent done for
> the existing intent, logs a new identical intent, stuffs that back into
> repair's bookkeeping, and commits the transaction...
> 

Yep, I think that's what Dave was getting at with regard to resurrecting
the old transaction callback. That would invoke at roughly the same time
I hardcoded adding to the relog list in this patch (i.e. once per CIL
checkpoint) and would generally trigger the relog/roll event for tracked
items.

The question at that point is more of how to manage the transaction (and
associated reservation) and general communication between the original
intent committer context, the ongoing relog context (i.e. the relog
task/workqueue) and then the original context again in order to cancel
any further relogging of a particular intent that is about to complete.

In thinking more about it, I am starting to wonder whether the initial
transaction roll thing makes sense. For one, the initial context can't
hold a memory reference to a transaction that some other context is
going to roll in the first place because that trans just becomes freed
memory on the first roll event. It might make more sense for the initial
transaction context to roll internally and transfer the new tp to
another context, but that is slightly shady too in that rolled
transactions share a log ticket and the log ticket is explicitly tied to
the allocating task for reservation scheduling purposes (i.e. see
->t_ticket->t_task).

The more I think about that, the more it _seems_ a bit more logical to
do something analogous to the CIL itself where relog context holds a
reservation ticket and the transaction processing is abstracted slightly
from the reservation management. That might facilitate several
independent tasks to acquire and transfer relog reservation to a central
relog context/ticket that could potentially relog many (unrelated) items
in its own transaction(s). That gets a little hairy itself because we
essentially have a dynamic reservation ticket where various relogged
items might be coming and going after each roll, etc., so I'd definitely
need to think about that some more. Perhaps it's good enough for a POC
to just let the relog task allocate its own fixed size transaction for
the time being (as you suggest above) and worry out the proper
reservation management as a next step..

Brian

> > I also think it's worth distinguishing between quotaoff and the repair
> > use case at this point. As noted above, this doesn't really address the
> > former and if we do take this callback approach, I'd like to find a way
> > to apply it there if at all possible. Admittedly, quotaoff is a bit of a
> > hacky use case as it is, so I'd even be fine with something along the
> > lines of running it it two separate (but coordinated) tasks[1] (i.e., one
> > dedicated to rolling and relogging the start intent and another to do
> > the actual quotaoff work), as long as the approach is ultimately safe
> > and resolves the problem. Thoughts?
> > 
> > [1] That could be anything from a specific quotaoff task to a more
> > generic background relog worker that could be shared across users and
> > batch to fewer transactions.
> 
> ...like this, perhaps.
> 
> > > > FWIW, I'm also wondering if this lends itself to some form of batching
> > > > for if we get to the point of relogging a large number of small items.
> > > > For example, consider a single dedicated/relogging transaction for many
> > > > small (or related) intents vs. N independent transactions processing in
> > > > the background. This is something that could came later once the
> > > > fundamentals are worked out, however.
> > > 
> > > That's exactly the problem I see needing to be solved for the
> > > "rebuild btrees in free space" type of long running operation that
> > > repair will need to run.
> > > 
> > > i.e. an out-of-line btree rebuild that involves allocating space
> > > that should be freed again if the rebuild fails or we crash during
> > > the rebuild. Hence an EFI needs to be held for each allocation we
> > > do until we get the tree rebuilt and do the atomic swap of the root
> > > block. That atomic swap transaction also needs to commit all the
> > > EFDs, as the space is now in use and we need to cancel the EFIs.(+)
> > > 
> > > Hence the rolling transaction will need to relog the primary "btree
> > > rebuild in progress" intent/intent-done pair as well as all the all
> > > the EFI/EFD pairs it holds for the extents it has allocated so far.
> > > The subsystem will have to co-ordinate the intent commit callback
> > > notification with it's ongoing transactional work that is done under
> > > it's rolling transaction context.
> > > 
> > > IOWs, there's a whole lot more work needed than just updating a
> > > single intent pair in "btree rebuild" situation like this. I also
> > > don't really see how a "log internal" relogging mechanism based on
> > > stealing reservations will be able to handle the complexity of
> > > atomic multiple intent state updates. That requires
> > > subsystem/application specific knowledge to understand the
> > > relationship between all the intents that need to be relogged....
> > > 
> > 
> > Yeah, though to be fair you're attributing more responsibility and thus
> > more complexity than I initially intended for this mechanism. This was
> > intended to be a "don't deadlock on this unchanged item" mechanism and
> > thus fairly isolated/limited in use. I'd compare it to something like
> > ordered buffers in terms of complexity level. I.e., a low level
> > mechanism for specific use cases and to be managed/understood properly
> > by associated users.
> > 
> > I've no issue with moving in the direction discussed here to facilitate
> > more complex higher level mechanisms (like tracking operational progress
> > of btree reconstruction, etc.), I'm just saying that the complexity
> > argument you make here changes as the requirements do.
> > 
> > > (+) The deferops and/or EFI code probably needs modification to
> > > support non-freeing, delayed EFD processing like this - it needs to
> > > log and track the intents it holds and relog them, but not free them
> > > or run EFDs until a later point in time. i.e. it becomes 2-part
> > > deferred op mechanism, with separate control of the intent-done
> > > processing phase. I'd like to use this mechanism for DIO and
> > > buffered writeback allocation (so we don't need to use unwritten
> > > extents), but I haven't had time to dig into it yet...
> 
> Heh, I suspected this was going to come up in this discussion. :)
> 
> > I can definitely see opening up the dfops interface to drive/control
> > intent relogging vs. progress updates vs. completion. This is somewhat
> > complicated by tricks like reusing certain intents in non-traditional
> > ways, such as completing an EFI with an EFD without actually freeing
> > blocks. Regardless, my questions to this point are more around
> > usage/semantics of the log commit callback and using it to manage
> > transaction rolls. Once those building blocks are settled, I'm sure we
> > can work out a reasonable interface.
> 
> <nod>
> 
> --D
> 
> > 
> > Brian
> > 
> > > > All in all this sounds interesting. I still need to grok the transaction
> > > > reservation/ownership semantics you propose re: the questions above, but
> > > > I do like the prospect of reusing existing mechanisms and thus being
> > > > able to more easily handle generic log items. Thanks for the
> > > > feedback...
> > > 
> > > Yeah, i'd much prefer to implement something that fits within the
> > > existing model, too :)
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> >

Patch
diff mbox series

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index a1204424a938..b2d8b2d54df6 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -75,6 +75,33 @@  xlog_cil_iovec_space(
 			sizeof(uint64_t));
 }
 
+static void
+xlog_cil_relog_items(
+	struct xlog		*log,
+	struct xfs_trans	*tp)
+{
+	struct xfs_cil		*cil = log->l_cilp;
+	struct xfs_log_item	*lip;
+
+	ASSERT(tp->t_flags & XFS_TRANS_DIRTY);
+
+	if (list_empty(&cil->xc_relog))
+		return;
+
+	/* XXX: need to check trans reservation, process multiple items, etc. */
+	spin_lock(&cil->xc_relog_lock);
+	lip = list_first_entry_or_null(&cil->xc_relog, struct xfs_log_item, li_cil);
+	if (lip)
+		list_del_init(&lip->li_cil);
+	spin_unlock(&cil->xc_relog_lock);
+
+	if (lip) {
+		xfs_trans_add_item(tp, lip);
+		set_bit(XFS_LI_DIRTY, &lip->li_flags);
+		trace_xfs_cil_relog(lip);
+	}
+}
+
 /*
  * Allocate or pin log vector buffers for CIL insertion.
  *
@@ -1001,6 +1028,8 @@  xfs_log_commit_cil(
 	struct xfs_log_item	*lip, *next;
 	xfs_lsn_t		xc_commit_lsn;
 
+	xlog_cil_relog_items(log, tp);
+
 	/*
 	 * Do all necessary memory allocation before we lock the CIL.
 	 * This ensures the allocation does not deadlock with a CIL
@@ -1196,6 +1225,8 @@  xlog_cil_init(
 	spin_lock_init(&cil->xc_push_lock);
 	init_rwsem(&cil->xc_ctx_lock);
 	init_waitqueue_head(&cil->xc_commit_wait);
+	INIT_LIST_HEAD(&cil->xc_relog);
+	spin_lock_init(&cil->xc_relog_lock);
 
 	INIT_LIST_HEAD(&ctx->committing);
 	INIT_LIST_HEAD(&ctx->busy_extents);
@@ -1223,3 +1254,41 @@  xlog_cil_destroy(
 	kmem_free(log->l_cilp);
 }
 
+void
+xfs_cil_relog_item(
+	struct xlog		*log,
+	struct xfs_log_item	*lip)
+{
+	struct xfs_cil		*cil = log->l_cilp;
+
+	ASSERT(test_bit(XFS_LI_RELOG, &lip->li_flags));
+	ASSERT(list_empty(&lip->li_cil));
+
+	spin_lock(&cil->xc_relog_lock);
+	list_add_tail(&lip->li_cil, &cil->xc_relog);
+	spin_unlock(&cil->xc_relog_lock);
+
+	trace_xfs_cil_relog_queue(lip);
+}
+
+bool
+xfs_cil_relog_steal(
+	struct xlog		*log,
+	struct xfs_log_item	*lip)
+{
+	struct xfs_cil		*cil = log->l_cilp;
+	struct xfs_log_item	*pos, *ppos;
+	bool			ret = false;
+
+	spin_lock(&cil->xc_relog_lock);
+	list_for_each_entry_safe(pos, ppos, &cil->xc_relog, li_cil) {
+		if (pos == lip) {
+			list_del_init(&pos->li_cil);
+			ret = true;
+			break;
+		}
+	}
+	spin_unlock(&cil->xc_relog_lock);
+
+	return ret;
+}
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 4f19375f6592..f75a0a9f6984 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -10,6 +10,7 @@  struct xfs_buf;
 struct xlog;
 struct xlog_ticket;
 struct xfs_mount;
+struct xfs_log_item;
 
 /*
  * Flags for log structure
@@ -275,6 +276,9 @@  struct xfs_cil {
 	wait_queue_head_t	xc_commit_wait;
 	xfs_lsn_t		xc_current_sequence;
 	struct work_struct	xc_push_work;
+
+	struct list_head	xc_relog;
+	spinlock_t		xc_relog_lock;
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -511,6 +515,8 @@  int	xlog_cil_init(struct xlog *log);
 void	xlog_cil_init_post_recovery(struct xlog *log);
 void	xlog_cil_destroy(struct xlog *log);
 bool	xlog_cil_empty(struct xlog *log);
+void	xfs_cil_relog_item(struct xlog *log, struct xfs_log_item *lip);
+bool	xfs_cil_relog_steal(struct xlog *log, struct xfs_log_item *lip);
 
 /*
  * CIL force routines
diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index da7ad0383037..5e529190029d 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -18,6 +18,8 @@ 
 #include "xfs_quota.h"
 #include "xfs_qm.h"
 #include "xfs_icache.h"
+#include "xfs_log_priv.h"
+#include "xfs_trans_priv.h"
 
 STATIC int	xfs_qm_log_quotaoff(xfs_mount_t *, xfs_qoff_logitem_t **, uint);
 STATIC int	xfs_qm_log_quotaoff_end(xfs_mount_t *, xfs_qoff_logitem_t *,
@@ -556,6 +558,16 @@  xfs_qm_log_quotaoff_end(
 					flags & XFS_ALL_QUOTA_ACCT);
 	xfs_trans_log_quotaoff_item(tp, qoffi);
 
+	/*
+	 * XXX: partly open coded relog of the start item to ensure no relogging
+	 * after this point.
+	 */
+	clear_bit(XFS_LI_RELOG, &startqoff->qql_item.li_flags);
+	if (xfs_cil_relog_steal(mp->m_log, &startqoff->qql_item)) {
+		xfs_trans_add_item(tp, &startqoff->qql_item);
+		xfs_trans_log_quotaoff_item(tp, startqoff);
+	}
+
 	/*
 	 * We have to make sure that the transaction is secure on disk before we
 	 * return and actually stop quota accounting. So, make it synchronous.
@@ -583,6 +595,7 @@  xfs_qm_log_quotaoff(
 		goto out;
 
 	qoffi = xfs_trans_get_qoff_item(tp, NULL, flags & XFS_ALL_QUOTA_ACCT);
+	set_bit(XFS_LI_RELOG, &qoffi->qql_item.li_flags);
 	xfs_trans_log_quotaoff_item(tp, qoffi);
 
 	spin_lock(&mp->m_sb_lock);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 926f4d10dc02..6fe31a00e362 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1063,6 +1063,8 @@  DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
+DEFINE_LOG_ITEM_EVENT(xfs_cil_relog);
+DEFINE_LOG_ITEM_EVENT(xfs_cil_relog_queue);
 
 DECLARE_EVENT_CLASS(xfs_ail_class,
 	TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index f4795fdb7389..95c74c6de4e2 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -19,6 +19,7 @@ 
 #include "xfs_trace.h"
 #include "xfs_error.h"
 #include "xfs_defer.h"
+#include "xfs_log_priv.h"
 
 kmem_zone_t	*xfs_trans_zone;
 
@@ -863,6 +864,9 @@  xfs_trans_committed_bulk(
 		if (XFS_LSN_CMP(item_lsn, (xfs_lsn_t)-1) == 0)
 			continue;
 
+		if (test_bit(XFS_LI_RELOG, &lip->li_flags))
+			xfs_cil_relog_item(lip->li_mountp->m_log, lip);
+
 		/*
 		 * if we are aborting the operation, no point in inserting the
 		 * object into the AIL as we are in a shutdown situation.
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 64d7f171ebd3..596102405acf 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -59,12 +59,14 @@  struct xfs_log_item {
 #define	XFS_LI_ABORTED	1
 #define	XFS_LI_FAILED	2
 #define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
+#define XFS_LI_RELOG	4	/* relog item on checkpoint commit */
 
 #define XFS_LI_FLAGS \
 	{ (1 << XFS_LI_IN_AIL),		"IN_AIL" }, \
 	{ (1 << XFS_LI_ABORTED),	"ABORTED" }, \
 	{ (1 << XFS_LI_FAILED),		"FAILED" }, \
-	{ (1 << XFS_LI_DIRTY),		"DIRTY" }
+	{ (1 << XFS_LI_DIRTY),		"DIRTY" }, \
+	{ (1 << XFS_LI_RELOG),		"RELOG" }
 
 struct xfs_item_ops {
 	unsigned flags;