diff mbox series

[1/3] xfs: EFI recovery needs it's own transaction reservation

Message ID 20200909081912.1185392-2-david@fromorbit.com (mailing list archive)
State New
Headers show
Series xfs: intent recovery reservation changes | expand

Commit Message

Dave Chinner Sept. 9, 2020, 8:19 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

Recovering an EFI currently uses a itruncate reservation, which is
designed for a rolling transaction that modifies the BMBT and
logs the EFI in one commit, then frees the space and logs the EFD in
the second commit.

Recovering the EFI only requires the second transaction in this
pair, and hence has a smaller log space requirement than a truncate
operation. Hence when the extent free is being processed at runtime,
the log reservation that is held by the filesystem is only enough to
complete the extent free, not the entire truncate operation.

Hence if the EFI pins the tail of the log and the log fills up while
the extent is being freed, the amount of reserved free space in the
log is not enough to start another entire truncate operation. Hence
if we crash at this point, log recovery will deadlock with the EFI
pinning the tail of the log and the log not having enough free space
to reserve an itruncate transaction.

As such, EFI recovery needs it's own log space reservation separate
to the itruncate reservation. We only need what is required free the
extent, and this matches the space we have reserved at runtime for
this operation and hence should prevent the recovery deadlock from
occurring.

This patch adds the new reservation in a way that minimises the
change such that it should be back-portable to older kernels easily.
Follow up patches will factor and rework the reservations to be more
correct and more tightly defined.

Note: this would appear to be a generic problem with intent
recovery; we use the entire operation reservation for recovery,
not the reservation that was held at runtime after the intent was
logged. I suspect all intents are going to require their own
reservation as a result.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c | 10 ++++++++++
 fs/xfs/libxfs/xfs_trans_resv.h |  2 ++
 fs/xfs/xfs_extfree_item.c      |  2 +-
 3 files changed, 13 insertions(+), 1 deletion(-)

Comments

Brian Foster Sept. 9, 2020, 1:31 p.m. UTC | #1
On Wed, Sep 09, 2020 at 06:19:10PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Recovering an EFI currently uses a itruncate reservation, which is
> designed for a rolling transaction that modifies the BMBT and
> logs the EFI in one commit, then frees the space and logs the EFD in
> the second commit.
> 
> Recovering the EFI only requires the second transaction in this
> pair, and hence has a smaller log space requirement than a truncate
> operation. Hence when the extent free is being processed at runtime,
> the log reservation that is held by the filesystem is only enough to
> complete the extent free, not the entire truncate operation.
> 
> Hence if the EFI pins the tail of the log and the log fills up while
> the extent is being freed, the amount of reserved free space in the
> log is not enough to start another entire truncate operation. Hence
> if we crash at this point, log recovery will deadlock with the EFI
> pinning the tail of the log and the log not having enough free space
> to reserve an itruncate transaction.
> 
> As such, EFI recovery needs it's own log space reservation separate
> to the itruncate reservation. We only need what is required free the
> extent, and this matches the space we have reserved at runtime for
> this operation and hence should prevent the recovery deadlock from
> occurring.
> 
> This patch adds the new reservation in a way that minimises the
> change such that it should be back-portable to older kernels easily.
> Follow up patches will factor and rework the reservations to be more
> correct and more tightly defined.
> 
> Note: this would appear to be a generic problem with intent
> recovery; we use the entire operation reservation for recovery,
> not the reservation that was held at runtime after the intent was
> logged. I suspect all intents are going to require their own
> reservation as a result.
> 

It might be worth explicitly pointing out that support for EFI/EFD
intents goes farther back than the various intents associated with newer
features, hence the value of a targeted fix. Otherwise the problem
description makes sense and approach seems reasonable to me. Thanks for
the writeup. Some questions...

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_resv.c | 10 ++++++++++
>  fs/xfs/libxfs/xfs_trans_resv.h |  2 ++
>  fs/xfs/xfs_extfree_item.c      |  2 +-
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index d1a0848cb52e..da2ec052ac0a 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -916,6 +916,16 @@ xfs_trans_resv_calc(
>  		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
>  	resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
>  
> +	/*
> +	 * Log recovery reservations for intent replay
> +	 *
> +	 * EFI recovery is itruncate minus the initial transaction that logs
> +	 * logs the EFI.
> +	 */
> +	resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres;
> +	resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1;

tr_itruncate.tr_logcount looks like it's either 2 or 8 depending on
whether reflink is enabled. On one hand this seems conservative enough,
but do we know exactly what those extra unit counts are accounted for in
the reflink case? It looks like extents are only freed when the last
reference is dropped (otherwise we log a refcount intent), which makes
me wonder whether we really need 7 log count units if recovery
encounters an EFI.

Also, while looking through the code I noticed that truncate does the
following:

		...
                error = xfs_defer_finish(&tp);
                if (error)
                        goto out;

                error = xfs_trans_roll_inode(&tp, ip);
                if (error)
                        goto out;
		...

... which looks like it rolls the transaction an extra time per-extent.
I don't think that contributes to this problem vs just being a runtime
inefficiency, so maybe I'll fling a patch up for that separately.

> +	resp->tr_efi.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +
>  	/*
>  	 * The following transactions are logged in logical format with
>  	 * a default log count.
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> index 7241ab28cf84..13173b3eaac9 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.h
> +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> @@ -50,6 +50,8 @@ struct xfs_trans_resv {
>  	struct xfs_trans_res	tr_qm_equotaoff;/* end of turn quota off */
>  	struct xfs_trans_res	tr_sb;		/* modify superblock */
>  	struct xfs_trans_res	tr_fsyncts;	/* update timestamps on fsync */
> +	struct xfs_trans_res	tr_efi;		/* EFI log item recovery */
> +

Extra whitespace line.

Brian

>  };
>  
>  /* shorthand way of accessing reservation structure */
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index 6cb8cd11072a..1ea9ab4cd44e 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -618,7 +618,7 @@ xfs_efi_item_recover(
>  		}
>  	}
>  
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_efi, 0, 0, 0, &tp);
>  	if (error)
>  		return error;
>  	efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
> -- 
> 2.28.0
>
Dave Chinner Sept. 9, 2020, 9:44 p.m. UTC | #2
On Wed, Sep 09, 2020 at 09:31:11AM -0400, Brian Foster wrote:
> On Wed, Sep 09, 2020 at 06:19:10PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Recovering an EFI currently uses a itruncate reservation, which is
> > designed for a rolling transaction that modifies the BMBT and
> > logs the EFI in one commit, then frees the space and logs the EFD in
> > the second commit.
> > 
> > Recovering the EFI only requires the second transaction in this
> > pair, and hence has a smaller log space requirement than a truncate
> > operation. Hence when the extent free is being processed at runtime,
> > the log reservation that is held by the filesystem is only enough to
> > complete the extent free, not the entire truncate operation.
> > 
> > Hence if the EFI pins the tail of the log and the log fills up while
> > the extent is being freed, the amount of reserved free space in the
> > log is not enough to start another entire truncate operation. Hence
> > if we crash at this point, log recovery will deadlock with the EFI
> > pinning the tail of the log and the log not having enough free space
> > to reserve an itruncate transaction.
> > 
> > As such, EFI recovery needs it's own log space reservation separate
> > to the itruncate reservation. We only need what is required free the
> > extent, and this matches the space we have reserved at runtime for
> > this operation and hence should prevent the recovery deadlock from
> > occurring.
> > 
> > This patch adds the new reservation in a way that minimises the
> > change such that it should be back-portable to older kernels easily.
> > Follow up patches will factor and rework the reservations to be more
> > correct and more tightly defined.
> > 
> > Note: this would appear to be a generic problem with intent
> > recovery; we use the entire operation reservation for recovery,
> > not the reservation that was held at runtime after the intent was
> > logged. I suspect all intents are going to require their own
> > reservation as a result.
> > 
> 
> It might be worth explicitly pointing out that support for EFI/EFD
> intents goes farther back than the various intents associated with newer
> features, hence the value of a targeted fix.

Ok.

> > @@ -916,6 +916,16 @@ xfs_trans_resv_calc(
> >  		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
> >  	resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> >  
> > +	/*
> > +	 * Log recovery reservations for intent replay
> > +	 *
> > +	 * EFI recovery is itruncate minus the initial transaction that logs
> > +	 * logs the EFI.
> > +	 */
> > +	resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres;
> > +	resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1;
> 
> tr_itruncate.tr_logcount looks like it's either 2 or 8 depending on
> whether reflink is enabled. On one hand this seems conservative enough,
> but do we know exactly what those extra unit counts are accounted for in
> the reflink case?

Right, in the reflink case we may have to roll the transaction many more
times to do the refcount btree and reverse map btree modifications.
Those are done under separate intents and so we have to roll and
commit the defered ops more times on a reflink/rmap based
filesystem. Hence the logcount is higher so that the initial
reservation can roll more times before regrant during a roll has to
go and physically reserve more write space in the log to continue
rolling the transaction.

> It looks like extents are only freed when the last
> reference is dropped (otherwise we log a refcount intent), which makes
> me wonder whether we really need 7 log count units if recovery
> encounters an EFI.

I don't know if the numbers are correct, and it really is out of
scope for this patch to audit/fix that. I really think we need to
map this whole thing out in a diagram at this point because I now
suspect that the allocfree log count calculation is not correct,
either...

> Also, while looking through the code I noticed that truncate does the
> following:
> 
> 		...
>                 error = xfs_defer_finish(&tp);
>                 if (error)
>                         goto out;
> 
>                 error = xfs_trans_roll_inode(&tp, ip);
>                 if (error)
>                         goto out;
> 		...
> 
> ... which looks like it rolls the transaction an extra time per-extent.
> I don't think that contributes to this problem vs just being a runtime
> inefficiency, so maybe I'll fling a patch up for that separately.

Yeah, I'm not sure if this is correct/needed or not. 

> >  	 * The following transactions are logged in logical format with
> >  	 * a default log count.
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> > index 7241ab28cf84..13173b3eaac9 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.h
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> > @@ -50,6 +50,8 @@ struct xfs_trans_resv {
> >  	struct xfs_trans_res	tr_qm_equotaoff;/* end of turn quota off */
> >  	struct xfs_trans_res	tr_sb;		/* modify superblock */
> >  	struct xfs_trans_res	tr_fsyncts;	/* update timestamps on fsync */
> > +	struct xfs_trans_res	tr_efi;		/* EFI log item recovery */
> > +
> 
> Extra whitespace line.

Will fix.

Thanks!

-Dave.
Brian Foster Sept. 10, 2020, 1:18 p.m. UTC | #3
On Thu, Sep 10, 2020 at 07:44:55AM +1000, Dave Chinner wrote:
> On Wed, Sep 09, 2020 at 09:31:11AM -0400, Brian Foster wrote:
> > On Wed, Sep 09, 2020 at 06:19:10PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Recovering an EFI currently uses a itruncate reservation, which is
> > > designed for a rolling transaction that modifies the BMBT and
> > > logs the EFI in one commit, then frees the space and logs the EFD in
> > > the second commit.
> > > 
> > > Recovering the EFI only requires the second transaction in this
> > > pair, and hence has a smaller log space requirement than a truncate
> > > operation. Hence when the extent free is being processed at runtime,
> > > the log reservation that is held by the filesystem is only enough to
> > > complete the extent free, not the entire truncate operation.
> > > 
> > > Hence if the EFI pins the tail of the log and the log fills up while
> > > the extent is being freed, the amount of reserved free space in the
> > > log is not enough to start another entire truncate operation. Hence
> > > if we crash at this point, log recovery will deadlock with the EFI
> > > pinning the tail of the log and the log not having enough free space
> > > to reserve an itruncate transaction.
> > > 
> > > As such, EFI recovery needs it's own log space reservation separate
> > > to the itruncate reservation. We only need what is required free the
> > > extent, and this matches the space we have reserved at runtime for
> > > this operation and hence should prevent the recovery deadlock from
> > > occurring.
> > > 
> > > This patch adds the new reservation in a way that minimises the
> > > change such that it should be back-portable to older kernels easily.
> > > Follow up patches will factor and rework the reservations to be more
> > > correct and more tightly defined.
> > > 
> > > Note: this would appear to be a generic problem with intent
> > > recovery; we use the entire operation reservation for recovery,
> > > not the reservation that was held at runtime after the intent was
> > > logged. I suspect all intents are going to require their own
> > > reservation as a result.
> > > 
> > 
> > It might be worth explicitly pointing out that support for EFI/EFD
> > intents goes farther back than the various intents associated with newer
> > features, hence the value of a targeted fix.
> 
> Ok.
> 
> > > @@ -916,6 +916,16 @@ xfs_trans_resv_calc(
> > >  		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
> > >  	resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> > >  
> > > +	/*
> > > +	 * Log recovery reservations for intent replay
> > > +	 *
> > > +	 * EFI recovery is itruncate minus the initial transaction that logs
> > > +	 * logs the EFI.
> > > +	 */
> > > +	resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres;
> > > +	resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1;
> > 
> > tr_itruncate.tr_logcount looks like it's either 2 or 8 depending on
> > whether reflink is enabled. On one hand this seems conservative enough,
> > but do we know exactly what those extra unit counts are accounted for in
> > the reflink case?
> 
> Right, in the reflink case we may have to roll the transaction many more
> times to do the refcount btree and reverse map btree modifications.
> Those are done under separate intents and so we have to roll and
> commit the defered ops more times on a reflink/rmap based
> filesystem. Hence the logcount is higher so that the initial
> reservation can roll more times before regrant during a roll has to
> go and physically reserve more write space in the log to continue
> rolling the transaction.
> 
> > It looks like extents are only freed when the last
> > reference is dropped (otherwise we log a refcount intent), which makes
> > me wonder whether we really need 7 log count units if recovery
> > encounters an EFI.
> 
> I don't know if the numbers are correct, and it really is out of
> scope for this patch to audit/fix that. I really think we need to
> map this whole thing out in a diagram at this point because I now
> suspect that the allocfree log count calculation is not correct,
> either...
> 

I agree up to the point where it relates to this specific EFI recovery
issue. reflink is enabled by default, which means the default EFI
recovery reservation is going to have 7 logcount units. Is that actually
enough of a reduction to prevent this same recovery problem on newer
fs'? I'm wondering if the tr_efi logcount should just be set to 1, for
example, at least for the short term fix.

Brian

> > Also, while looking through the code I noticed that truncate does the
> > following:
> > 
> > 		...
> >                 error = xfs_defer_finish(&tp);
> >                 if (error)
> >                         goto out;
> > 
> >                 error = xfs_trans_roll_inode(&tp, ip);
> >                 if (error)
> >                         goto out;
> > 		...
> > 
> > ... which looks like it rolls the transaction an extra time per-extent.
> > I don't think that contributes to this problem vs just being a runtime
> > inefficiency, so maybe I'll fling a patch up for that separately.
> 
> Yeah, I'm not sure if this is correct/needed or not. 
> 
> > >  	 * The following transactions are logged in logical format with
> > >  	 * a default log count.
> > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> > > index 7241ab28cf84..13173b3eaac9 100644
> > > --- a/fs/xfs/libxfs/xfs_trans_resv.h
> > > +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> > > @@ -50,6 +50,8 @@ struct xfs_trans_resv {
> > >  	struct xfs_trans_res	tr_qm_equotaoff;/* end of turn quota off */
> > >  	struct xfs_trans_res	tr_sb;		/* modify superblock */
> > >  	struct xfs_trans_res	tr_fsyncts;	/* update timestamps on fsync */
> > > +	struct xfs_trans_res	tr_efi;		/* EFI log item recovery */
> > > +
> > 
> > Extra whitespace line.
> 
> Will fix.
> 
> Thanks!
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Sept. 10, 2020, 9:29 p.m. UTC | #4
On Thu, Sep 10, 2020 at 09:18:10AM -0400, Brian Foster wrote:
> On Thu, Sep 10, 2020 at 07:44:55AM +1000, Dave Chinner wrote:
> > On Wed, Sep 09, 2020 at 09:31:11AM -0400, Brian Foster wrote:
> > > It looks like extents are only freed when the last
> > > reference is dropped (otherwise we log a refcount intent), which makes
> > > me wonder whether we really need 7 log count units if recovery
> > > encounters an EFI.
> > 
> > I don't know if the numbers are correct, and it really is out of
> > scope for this patch to audit/fix that. I really think we need to
> > map this whole thing out in a diagram at this point because I now
> > suspect that the allocfree log count calculation is not correct,
> > either...
> 
> I agree up to the point where it relates to this specific EFI recovery
> issue. reflink is enabled by default, which means the default EFI
> recovery reservation is going to have 7 logcount units. Is that actually
> enough of a reduction to prevent this same recovery problem on newer
> fs'? I'm wondering if the tr_efi logcount should just be set to 1, for
> example, at least for the short term fix.

I spent yesterday mapping out the whole "free extent" chain to
understand exactly what was necessary, and in discussing this with
Darrick on #xfs I came to the conclusion that we cannot -ever- have
a logcount of more than 1 for an intent recovery of any sort.

That's because the transaction may have rolled enough times to
exhaust the initial grant of unit * logcount, so the only
reservation that the runtime kernel has when it crashs is a single
transaction unit reservation (which gets re-reserved on every roll).

Hence recovery cannot assume that intent that was being processed has more
than a single unit reservation of log space available to be used,
and hence all intent recovery reservations must start with a log
count of 1.

THere are other restrictions we need to deal with, too. multiple
intents may pin the tail of the log, so we can't just process a
single intent chain at a time as that will result in using all the
log space for a single intent chain and reservation deadlocking on
one of the intents we haven't yet started.

Hence the first thing we have to do in recovering intents is take an
active reservation for -all- intents in the log to match the
reservation state at runtime. Only then can we guarantee that
intents that pin the tail of the log will have the reservation space
needed to be able to unpin the log tail.

Further, because we now may have consumed all the available log
reservation space to guarantee forwards progress, the first commit
of the first intent may block trying to regrant space if another
intent also pins the tail of the log. This means we cannot replay
intents in a serial manner as we currently do. Each intent chain and it's
transaction context needs to run in it's own process context so it
can block waiting for other intents to make progress, just like
happens at runtime when the system crashed. IOWs, intent replay
needs to be done with a task context per intent chain (probably via
a workqueue).

AFAICT, this is the only way we can make intent recovery deadlock
free - we have to recreate the complete log reservation state in
memory before we start recovery of a single intent, we can only
assume an intent had a single unit reservation of log space assigned
to it, and intent chain execution needs to run concurrently so
commits can block waiting for log space to become available as other
intents commit and unpin the tail of the log...

Cheers,

Dave.
Brian Foster Sept. 11, 2020, 12:37 p.m. UTC | #5
On Fri, Sep 11, 2020 at 07:29:27AM +1000, Dave Chinner wrote:
> On Thu, Sep 10, 2020 at 09:18:10AM -0400, Brian Foster wrote:
> > On Thu, Sep 10, 2020 at 07:44:55AM +1000, Dave Chinner wrote:
> > > On Wed, Sep 09, 2020 at 09:31:11AM -0400, Brian Foster wrote:
> > > > It looks like extents are only freed when the last
> > > > reference is dropped (otherwise we log a refcount intent), which makes
> > > > me wonder whether we really need 7 log count units if recovery
> > > > encounters an EFI.
> > > 
> > > I don't know if the numbers are correct, and it really is out of
> > > scope for this patch to audit/fix that. I really think we need to
> > > map this whole thing out in a diagram at this point because I now
> > > suspect that the allocfree log count calculation is not correct,
> > > either...
> > 
> > I agree up to the point where it relates to this specific EFI recovery
> > issue. reflink is enabled by default, which means the default EFI
> > recovery reservation is going to have 7 logcount units. Is that actually
> > enough of a reduction to prevent this same recovery problem on newer
> > fs'? I'm wondering if the tr_efi logcount should just be set to 1, for
> > example, at least for the short term fix.
> 
> I spent yesterday mapping out the whole "free extent" chain to
> understand exactly what was necessary, and in discussing this with
> Darrick on #xfs I came to the conclusion that we cannot -ever- have
> a logcount of more than 1 for an intent recovery of any sort.
> 
> That's because the transaction may have rolled enough times to
> exhaust the initial grant of unit * logcount, so the only
> reservation that the runtime kernel has when it crashs is a single
> transaction unit reservation (which gets re-reserved on every roll).
> 
> Hence recovery cannot assume that intent that was being processed has more
> than a single unit reservation of log space available to be used,
> and hence all intent recovery reservations must start with a log
> count of 1.
> 

Makes sense.

> THere are other restrictions we need to deal with, too. multiple
> intents may pin the tail of the log, so we can't just process a
> single intent chain at a time as that will result in using all the
> log space for a single intent chain and reservation deadlocking on
> one of the intents we haven't yet started.
> 
> Hence the first thing we have to do in recovering intents is take an
> active reservation for -all- intents in the log to match the
> reservation state at runtime. Only then can we guarantee that
> intents that pin the tail of the log will have the reservation space
> needed to be able to unpin the log tail.
> 

Which brings up the ordering question when multiple intents pin the
tail...

> Further, because we now may have consumed all the available log
> reservation space to guarantee forwards progress, the first commit
> of the first intent may block trying to regrant space if another
> intent also pins the tail of the log. This means we cannot replay
> intents in a serial manner as we currently do. Each intent chain and it's
> transaction context needs to run in it's own process context so it
> can block waiting for other intents to make progress, just like
> happens at runtime when the system crashed. IOWs, intent replay
> needs to be done with a task context per intent chain (probably via
> a workqueue).
> 

Indeed.

> AFAICT, this is the only way we can make intent recovery deadlock
> free - we have to recreate the complete log reservation state in
> memory before we start recovery of a single intent, we can only
> assume an intent had a single unit reservation of log space assigned
> to it, and intent chain execution needs to run concurrently so
> commits can block waiting for log space to become available as other
> intents commit and unpin the tail of the log...
> 

Ok, so the primary recovery task would either need to acquire and
distribute the initial reservation or otherwise implement some kind of
execution barrier for the intent recovery wq tasks to ensure that every
currently log resident intent acquires a transaction before any other is
allowed to roll. That doesn't seem _too_ limiting a restriction given
that it technically should allow many intents (that don't result in
intent chains) to execute to completion before waiting on others at all,
though it does sound like a complete rework of the current
post-recovery, single threaded intent recovery approach.

I'm kind of wondering if we had some mechanism to freeze/unfreeze all
transaction rolls, if we could then rework recovery to issue intent
recovery wq tasks during pass2 as log records are written back to the
fs. For example, with transaction rolls frozen, pass2 of recovery
completes a particular log record and on I/O completion of all
associated buffers, kicks off workqueue tasks for each unfinished intent
in that particular record. The workqueue tasks will either complete and
exit or unconditionally block on the next transaction roll (regrant).
The main recovery task carries on and repeats for all subsequent records
in the log, unfreezes regrants and waits for all outstanding
tasks/transactions to (hopefully) complete. Hm?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index d1a0848cb52e..da2ec052ac0a 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -916,6 +916,16 @@  xfs_trans_resv_calc(
 		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
 	resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
+	/*
+	 * Log recovery reservations for intent replay
+	 *
+	 * EFI recovery is itruncate minus the initial transaction that logs
+	 * logs the EFI.
+	 */
+	resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres;
+	resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1;
+	resp->tr_efi.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+
 	/*
 	 * The following transactions are logged in logical format with
 	 * a default log count.
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index 7241ab28cf84..13173b3eaac9 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -50,6 +50,8 @@  struct xfs_trans_resv {
 	struct xfs_trans_res	tr_qm_equotaoff;/* end of turn quota off */
 	struct xfs_trans_res	tr_sb;		/* modify superblock */
 	struct xfs_trans_res	tr_fsyncts;	/* update timestamps on fsync */
+	struct xfs_trans_res	tr_efi;		/* EFI log item recovery */
+
 };
 
 /* shorthand way of accessing reservation structure */
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 6cb8cd11072a..1ea9ab4cd44e 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -618,7 +618,7 @@  xfs_efi_item_recover(
 		}
 	}
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_efi, 0, 0, 0, &tp);
 	if (error)
 		return error;
 	efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);