diff mbox

[4/4] xfs: account only rmapbt-used blocks against rmapbt perag res

Message ID 20180205174601.51574-5-bfoster@redhat.com (mailing list archive)
State Accepted
Headers show

Commit Message

Brian Foster Feb. 5, 2018, 5:46 p.m. UTC
The rmapbt perag metadata reservation reserves blocks for the
reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
the agfl and perag accounting is updated as blocks are allocated
from the allocation btrees, the reservation actually accounts blocks
as they are allocated to (or freed from) the agfl rather than the
rmapbt itself.

While this works for blocks that are eventually used for the rmapbt,
not all agfl blocks are destined for the rmapbt. Blocks that are
allocated to the agfl (and thus "reserved" for the rmapbt) but then
used by another structure leads to a growing inconsistency over time
between the runtime tracking of rmapbt usage vs. actual rmapbt
usage. Since the runtime tracking thinks all agfl blocks are rmapbt
blocks, it essentially believes that less future reservation is
required to satisfy the rmapbt than what is actually necessary.

The inconsistency is rectified across mount cycles because the perag
reservation is initialized based on the actual rmapbt usage at mount
time. The problem, however, is that the excessive drain of the
reservation at runtime opens a window to allocate blocks for other
purposes that might be required for the rmapbt on a subsequent
mount. This problem can be demonstrated by a simple test that runs
an allocation workload to consume agfl blocks over time and then
observe the difference in the agfl reservation requirement across an
unmount/mount cycle:

  mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
  ...
  ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
  umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
  mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194

As the above tracepoints show, the reservation requirement reduces
from 3194 blocks to 2956 blocks as the workload runs.  Without any
other changes in the filesystem, the same reservation requirement
jumps from 2956 to 3052 blocks over a umount/mount cycle.

To address this divergence, update the RMAPBT reservation to account
blocks used for the rmapbt only rather than all blocks filled into
the agfl. This patch makes several high-level changes toward that
end:

1.) Reintroduce an AGFL reservation type to serve as an accounting
    no-op for blocks allocated to (or freed from) the AGFL.
2.) Invoke RMAPBT usage accounting from the actual rmapbt block
    allocation path rather than the AGFL allocation path.

The first change is required because agfl blocks are considered free
blocks throughout their lifetime. The perag reservation subsystem is
invoked unconditionally by the allocation subsystem, so we need a
way to tell the perag subsystem (via the allocation subsystem) to
not make any accounting changes for blocks filled into the AGFL.

The second change causes the in-core RMAPBT reservation usage
accounting to remain consistent with the on-disk state at all times
and eliminates the risk of leaving the rmapbt reservation
underfilled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ag_resv.c    |  4 ++++
 fs/xfs/libxfs/xfs_ag_resv.h    | 31 +++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_alloc.c      | 18 ++++++------------
 fs/xfs/libxfs/xfs_rmap_btree.c |  4 ++++
 fs/xfs/xfs_mount.h             |  1 +
 5 files changed, 46 insertions(+), 12 deletions(-)

Comments

Darrick J. Wong Feb. 7, 2018, 12:03 a.m. UTC | #1
On Mon, Feb 05, 2018 at 12:46:01PM -0500, Brian Foster wrote:
> The rmapbt perag metadata reservation reserves blocks for the
> reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
> the agfl and perag accounting is updated as blocks are allocated
> from the allocation btrees, the reservation actually accounts blocks
> as they are allocated to (or freed from) the agfl rather than the
> rmapbt itself.
> 
> While this works for blocks that are eventually used for the rmapbt,
> not all agfl blocks are destined for the rmapbt. Blocks that are
> allocated to the agfl (and thus "reserved" for the rmapbt) but then
> used by another structure leads to a growing inconsistency over time
> between the runtime tracking of rmapbt usage vs. actual rmapbt
> usage. Since the runtime tracking thinks all agfl blocks are rmapbt
> blocks, it essentially believes that less future reservation is
> required to satisfy the rmapbt than what is actually necessary.
> 
> The inconsistency is rectified across mount cycles because the perag
> reservation is initialized based on the actual rmapbt usage at mount
> time. The problem, however, is that the excessive drain of the
> reservation at runtime opens a window to allocate blocks for other
> purposes that might be required for the rmapbt on a subsequent
> mount. This problem can be demonstrated by a simple test that runs
> an allocation workload to consume agfl blocks over time and then
> observe the difference in the agfl reservation requirement across an
> unmount/mount cycle:
> 
>   mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
>   ...
>   ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
>   umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
>   mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194
> 
> As the above tracepoints show, the reservation requirement reduces
> from 3194 blocks to 2956 blocks as the workload runs.  Without any
> other changes in the filesystem, the same reservation requirement
> jumps from 2956 to 3052 blocks over a umount/mount cycle.
> 
> To address this divergence, update the RMAPBT reservation to account
> blocks used for the rmapbt only rather than all blocks filled into
> the agfl. This patch makes several high-level changes toward that
> end:
> 
> 1.) Reintroduce an AGFL reservation type to serve as an accounting
>     no-op for blocks allocated to (or freed from) the AGFL.
> 2.) Invoke RMAPBT usage accounting from the actual rmapbt block
>     allocation path rather than the AGFL allocation path.
> 
> The first change is required because agfl blocks are considered free
> blocks throughout their lifetime. The perag reservation subsystem is
> invoked unconditionally by the allocation subsystem, so we need a
> way to tell the perag subsystem (via the allocation subsystem) to
> not make any accounting changes for blocks filled into the AGFL.
> 
> The second change causes the in-core RMAPBT reservation usage
> accounting to remain consistent with the on-disk state at all times
> and eliminates the risk of leaving the rmapbt reservation
> underfilled.

Sorry I haven't gotten to this in a couple of days.  This looks like a
reasonable enough fix to the complex perag machinery, though Dave and I
were batting around an idea last week -- what if instead of all these
reservation types and tracking we simply maintained a per-ag free space
reservation of a couple megabytes, handed out blocks if we got
desperate, and replenished that res when we had a good opportunity?

I /think/ the answer to that question is that we need to try to reserve
enough blocks to satisfy the maximum size of the btree (if there is
one), not just a fixed-size delayed free space.  We already hand out
blocks if we get desperate, and we replenish the reservation when we
can.  Lastly, we need the reservation type tracking so that we avoid
things like the refcount btree stealing blocks from the agfl in
desperation.  I've also been wondering if each reservation should get
its own type so that we track refcountbt and finobt blocks separately.

Otherwise I think this series looks ok, but I'll go over it in more
detail tomorrow.

--D

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ag_resv.c    |  4 ++++
>  fs/xfs/libxfs/xfs_ag_resv.h    | 31 +++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_alloc.c      | 18 ++++++------------
>  fs/xfs/libxfs/xfs_rmap_btree.c |  4 ++++
>  fs/xfs/xfs_mount.h             |  1 +
>  5 files changed, 46 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> index 0ca2e680034a..03885a968de8 100644
> --- a/fs/xfs/libxfs/xfs_ag_resv.c
> +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> @@ -326,6 +326,8 @@ xfs_ag_resv_alloc_extent(
>  	trace_xfs_ag_resv_alloc_extent(pag, type, args->len);
>  
>  	switch (type) {
> +	case XFS_AG_RESV_AGFL:
> +		return;
>  	case XFS_AG_RESV_METADATA:
>  	case XFS_AG_RESV_RMAPBT:
>  		resv = xfs_perag_resv(pag, type);
> @@ -366,6 +368,8 @@ xfs_ag_resv_free_extent(
>  	trace_xfs_ag_resv_free_extent(pag, type, len);
>  
>  	switch (type) {
> +	case XFS_AG_RESV_AGFL:
> +		return;
>  	case XFS_AG_RESV_METADATA:
>  	case XFS_AG_RESV_RMAPBT:
>  		resv = xfs_perag_resv(pag, type);
> diff --git a/fs/xfs/libxfs/xfs_ag_resv.h b/fs/xfs/libxfs/xfs_ag_resv.h
> index 8d6c687deef3..938f2f96c5e8 100644
> --- a/fs/xfs/libxfs/xfs_ag_resv.h
> +++ b/fs/xfs/libxfs/xfs_ag_resv.h
> @@ -32,4 +32,35 @@ void xfs_ag_resv_alloc_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
>  void xfs_ag_resv_free_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
>  		struct xfs_trans *tp, xfs_extlen_t len);
>  
> +/*
> + * RMAPBT reservation accounting wrappers. Since rmapbt blocks are sourced from
> + * the AGFL, they are allocated one at a time and the reservation updates don't
> + * require a transaction.
> + */
> +static inline void
> +xfs_ag_resv_rmapbt_alloc(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno)
> +{
> +	struct xfs_alloc_arg	args = {0};
> +	struct xfs_perag	*pag;
> +
> +	args.len = 1;
> +	pag = xfs_perag_get(mp, agno);
> +	xfs_ag_resv_alloc_extent(pag, XFS_AG_RESV_RMAPBT, &args);
> +	xfs_perag_put(pag);
> +}
> +
> +static inline void
> +xfs_ag_resv_rmapbt_free(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno)
> +{
> +	struct xfs_perag	*pag;
> +
> +	pag = xfs_perag_get(mp, agno);
> +	xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT, NULL, 1);
> +	xfs_perag_put(pag);
> +}
> +
>  #endif	/* __XFS_AG_RESV_H__ */
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index e53d55453dc5..8c401efb2d74 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -711,7 +711,7 @@ xfs_alloc_ag_vextent(
>  
>  	ASSERT(args->len >= args->minlen);
>  	ASSERT(args->len <= args->maxlen);
> -	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_RMAPBT);
> +	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
>  	ASSERT(args->agbno % args->alignment == 0);
>  
>  	/* if not file data, insert new block into the reverse map btree */
> @@ -1564,7 +1564,6 @@ xfs_alloc_ag_vextent_small(
>  	int		*stat)	/* status: 0-freelist, 1-normal/none */
>  {
>  	struct xfs_owner_info	oinfo;
> -	struct xfs_perag	*pag;
>  	int		error;
>  	xfs_agblock_t	fbno;
>  	xfs_extlen_t	flen;
> @@ -1583,7 +1582,7 @@ xfs_alloc_ag_vextent_small(
>  	 * freelist.
>  	 */
>  	else if (args->minlen == 1 && args->alignment == 1 &&
> -		 args->resv != XFS_AG_RESV_RMAPBT &&
> +		 args->resv != XFS_AG_RESV_AGFL &&
>  		 (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount)
>  		  > args->minleft)) {
>  		error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0);
> @@ -1616,18 +1615,13 @@ xfs_alloc_ag_vextent_small(
>  			/*
>  			 * If we're feeding an AGFL block to something that
>  			 * doesn't live in the free space, we need to clear
> -			 * out the OWN_AG rmap and add the block back to
> -			 * the RMAPBT per-AG reservation.
> +			 * out the OWN_AG rmap.
>  			 */
>  			xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
>  			error = xfs_rmap_free(args->tp, args->agbp, args->agno,
>  					fbno, 1, &oinfo);
>  			if (error)
>  				goto error0;
> -			pag = xfs_perag_get(args->mp, args->agno);
> -			xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT,
> -					args->tp, 1);
> -			xfs_perag_put(pag);
>  
>  			*stat = 0;
>  			return 0;
> @@ -2153,7 +2147,7 @@ xfs_alloc_fix_freelist(
>  		if (error)
>  			goto out_agbp_relse;
>  		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
> -					   &targs.oinfo, XFS_AG_RESV_RMAPBT);
> +					   &targs.oinfo, XFS_AG_RESV_AGFL);
>  		if (error)
>  			goto out_agbp_relse;
>  		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
> @@ -2179,7 +2173,7 @@ xfs_alloc_fix_freelist(
>  	while (pag->pagf_flcount < need) {
>  		targs.agbno = 0;
>  		targs.maxlen = need - pag->pagf_flcount;
> -		targs.resv = XFS_AG_RESV_RMAPBT;
> +		targs.resv = XFS_AG_RESV_AGFL;
>  
>  		/* Allocate as many blocks as possible at once. */
>  		error = xfs_alloc_ag_vextent(&targs);
> @@ -2860,7 +2854,7 @@ xfs_free_extent(
>  	int			error;
>  
>  	ASSERT(len != 0);
> -	ASSERT(type != XFS_AG_RESV_RMAPBT);
> +	ASSERT(type != XFS_AG_RESV_AGFL);
>  
>  	if (XFS_TEST_ERROR(false, mp,
>  			XFS_ERRTAG_FREE_EXTENT))
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> index e829c3e489ea..8560091554e0 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> @@ -130,6 +130,8 @@ xfs_rmapbt_alloc_block(
>  	be32_add_cpu(&agf->agf_rmap_blocks, 1);
>  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_RMAP_BLOCKS);
>  
> +	xfs_ag_resv_rmapbt_alloc(cur->bc_mp, cur->bc_private.a.agno);
> +
>  	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
>  	*stat = 1;
>  	return 0;
> @@ -158,6 +160,8 @@ xfs_rmapbt_free_block(
>  			      XFS_EXTENT_BUSY_SKIP_DISCARD);
>  	xfs_trans_agbtree_delta(cur->bc_tp, -1);
>  
> +	xfs_ag_resv_rmapbt_free(cur->bc_mp, cur->bc_private.a.agno);
> +
>  	return 0;
>  }
>  
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index cf84288b65ca..7aae5af71aba 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -326,6 +326,7 @@ xfs_daddr_to_agbno(struct xfs_mount *mp, xfs_daddr_t d)
>  /* per-AG block reservation data structures*/
>  enum xfs_ag_resv_type {
>  	XFS_AG_RESV_NONE = 0,
> +	XFS_AG_RESV_AGFL,
>  	XFS_AG_RESV_METADATA,
>  	XFS_AG_RESV_RMAPBT,
>  };
> -- 
> 2.13.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Feb. 7, 2018, 2:49 p.m. UTC | #2
On Tue, Feb 06, 2018 at 04:03:50PM -0800, Darrick J. Wong wrote:
> On Mon, Feb 05, 2018 at 12:46:01PM -0500, Brian Foster wrote:
> > The rmapbt perag metadata reservation reserves blocks for the
> > reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
> > the agfl and perag accounting is updated as blocks are allocated
> > from the allocation btrees, the reservation actually accounts blocks
> > as they are allocated to (or freed from) the agfl rather than the
> > rmapbt itself.
> > 
> > While this works for blocks that are eventually used for the rmapbt,
> > not all agfl blocks are destined for the rmapbt. Blocks that are
> > allocated to the agfl (and thus "reserved" for the rmapbt) but then
> > used by another structure leads to a growing inconsistency over time
> > between the runtime tracking of rmapbt usage vs. actual rmapbt
> > usage. Since the runtime tracking thinks all agfl blocks are rmapbt
> > blocks, it essentially believes that less future reservation is
> > required to satisfy the rmapbt than what is actually necessary.
> > 
> > The inconsistency is rectified across mount cycles because the perag
> > reservation is initialized based on the actual rmapbt usage at mount
> > time. The problem, however, is that the excessive drain of the
> > reservation at runtime opens a window to allocate blocks for other
> > purposes that might be required for the rmapbt on a subsequent
> > mount. This problem can be demonstrated by a simple test that runs
> > an allocation workload to consume agfl blocks over time and then
> > observe the difference in the agfl reservation requirement across an
> > unmount/mount cycle:
> > 
> >   mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
> >   ...
> >   ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
> >   umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
> >   mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194
> > 
> > As the above tracepoints show, the reservation requirement reduces
> > from 3194 blocks to 2956 blocks as the workload runs.  Without any
> > other changes in the filesystem, the same reservation requirement
> > jumps from 2956 to 3052 blocks over a umount/mount cycle.
> > 
> > To address this divergence, update the RMAPBT reservation to account
> > blocks used for the rmapbt only rather than all blocks filled into
> > the agfl. This patch makes several high-level changes toward that
> > end:
> > 
> > 1.) Reintroduce an AGFL reservation type to serve as an accounting
> >     no-op for blocks allocated to (or freed from) the AGFL.
> > 2.) Invoke RMAPBT usage accounting from the actual rmapbt block
> >     allocation path rather than the AGFL allocation path.
> > 
> > The first change is required because agfl blocks are considered free
> > blocks throughout their lifetime. The perag reservation subsystem is
> > invoked unconditionally by the allocation subsystem, so we need a
> > way to tell the perag subsystem (via the allocation subsystem) to
> > not make any accounting changes for blocks filled into the AGFL.
> > 
> > The second change causes the in-core RMAPBT reservation usage
> > accounting to remain consistent with the on-disk state at all times
> > and eliminates the risk of leaving the rmapbt reservation
> > underfilled.
> 
> Sorry I haven't gotten to this in a couple of days.  This looks like a
> reasonable enough fix to the complex perag machinery, though Dave and I
> were batting around an idea last week -- what if instead of all these
> reservation types and tracking we simply maintained a per-ag free space
> reservation of a couple megabytes, handed out blocks if we got
> desperate, and replenished that res when we had a good opportunity?
> 

Firstly, I essentially view this patch as a bug fix vs. some kind of
larger rework. I'm hoping the split of the rename bits helps emphasize
that... It's really just a change to accurately account rmapbt block
consumption at runtime.

I don't have much context on the broader question you've called out...
I assume there was a larger requirement (associated with
reflink+rmapbt?) that necessitated this kind of worst case reservation
in the first place..? We've since applied it to things like the finobt
(which I'm still not totally convinced was the right thing to do based
on the vague justification for it), which kind of blurs the line between
where it's a requirement vs. nice-to-have/band-aid for me.

> I /think/ the answer to that question is that we need to try to reserve
> enough blocks to satisfy the maximum size of the btree (if there is
> one), not just a fixed-size delayed free space.  We already hand out
> blocks if we get desperate, and we replenish the reservation when we
> can.  Lastly, we need the reservation type tracking so that we avoid
> things like the refcount btree stealing blocks from the agfl in
> desperation.  I've also been wondering if each reservation should get
> its own type so that we track refcountbt and finobt blocks separately.
> 

In principle, I'm not really for or against the existing mechanism,
further granularizing it or changing it out for something else entirely.
I'm confused about what exactly we're trying to fix in the first place,
though. Is there a larger fundamental problem at play here? Is the
reservation (or complexity of the mechanism) simply overkill? Something
else..?

> Otherwise I think this series looks ok, but I'll go over it in more
> detail tomorrow.
> 

Thanks. No rush..

Brian

> --D
> 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_ag_resv.c    |  4 ++++
> >  fs/xfs/libxfs/xfs_ag_resv.h    | 31 +++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_alloc.c      | 18 ++++++------------
> >  fs/xfs/libxfs/xfs_rmap_btree.c |  4 ++++
> >  fs/xfs/xfs_mount.h             |  1 +
> >  5 files changed, 46 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> > index 0ca2e680034a..03885a968de8 100644
> > --- a/fs/xfs/libxfs/xfs_ag_resv.c
> > +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> > @@ -326,6 +326,8 @@ xfs_ag_resv_alloc_extent(
> >  	trace_xfs_ag_resv_alloc_extent(pag, type, args->len);
> >  
> >  	switch (type) {
> > +	case XFS_AG_RESV_AGFL:
> > +		return;
> >  	case XFS_AG_RESV_METADATA:
> >  	case XFS_AG_RESV_RMAPBT:
> >  		resv = xfs_perag_resv(pag, type);
> > @@ -366,6 +368,8 @@ xfs_ag_resv_free_extent(
> >  	trace_xfs_ag_resv_free_extent(pag, type, len);
> >  
> >  	switch (type) {
> > +	case XFS_AG_RESV_AGFL:
> > +		return;
> >  	case XFS_AG_RESV_METADATA:
> >  	case XFS_AG_RESV_RMAPBT:
> >  		resv = xfs_perag_resv(pag, type);
> > diff --git a/fs/xfs/libxfs/xfs_ag_resv.h b/fs/xfs/libxfs/xfs_ag_resv.h
> > index 8d6c687deef3..938f2f96c5e8 100644
> > --- a/fs/xfs/libxfs/xfs_ag_resv.h
> > +++ b/fs/xfs/libxfs/xfs_ag_resv.h
> > @@ -32,4 +32,35 @@ void xfs_ag_resv_alloc_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
> >  void xfs_ag_resv_free_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
> >  		struct xfs_trans *tp, xfs_extlen_t len);
> >  
> > +/*
> > + * RMAPBT reservation accounting wrappers. Since rmapbt blocks are sourced from
> > + * the AGFL, they are allocated one at a time and the reservation updates don't
> > + * require a transaction.
> > + */
> > +static inline void
> > +xfs_ag_resv_rmapbt_alloc(
> > +	struct xfs_mount	*mp,
> > +	xfs_agnumber_t		agno)
> > +{
> > +	struct xfs_alloc_arg	args = {0};
> > +	struct xfs_perag	*pag;
> > +
> > +	args.len = 1;
> > +	pag = xfs_perag_get(mp, agno);
> > +	xfs_ag_resv_alloc_extent(pag, XFS_AG_RESV_RMAPBT, &args);
> > +	xfs_perag_put(pag);
> > +}
> > +
> > +static inline void
> > +xfs_ag_resv_rmapbt_free(
> > +	struct xfs_mount	*mp,
> > +	xfs_agnumber_t		agno)
> > +{
> > +	struct xfs_perag	*pag;
> > +
> > +	pag = xfs_perag_get(mp, agno);
> > +	xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT, NULL, 1);
> > +	xfs_perag_put(pag);
> > +}
> > +
> >  #endif	/* __XFS_AG_RESV_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index e53d55453dc5..8c401efb2d74 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -711,7 +711,7 @@ xfs_alloc_ag_vextent(
> >  
> >  	ASSERT(args->len >= args->minlen);
> >  	ASSERT(args->len <= args->maxlen);
> > -	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_RMAPBT);
> > +	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
> >  	ASSERT(args->agbno % args->alignment == 0);
> >  
> >  	/* if not file data, insert new block into the reverse map btree */
> > @@ -1564,7 +1564,6 @@ xfs_alloc_ag_vextent_small(
> >  	int		*stat)	/* status: 0-freelist, 1-normal/none */
> >  {
> >  	struct xfs_owner_info	oinfo;
> > -	struct xfs_perag	*pag;
> >  	int		error;
> >  	xfs_agblock_t	fbno;
> >  	xfs_extlen_t	flen;
> > @@ -1583,7 +1582,7 @@ xfs_alloc_ag_vextent_small(
> >  	 * freelist.
> >  	 */
> >  	else if (args->minlen == 1 && args->alignment == 1 &&
> > -		 args->resv != XFS_AG_RESV_RMAPBT &&
> > +		 args->resv != XFS_AG_RESV_AGFL &&
> >  		 (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount)
> >  		  > args->minleft)) {
> >  		error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0);
> > @@ -1616,18 +1615,13 @@ xfs_alloc_ag_vextent_small(
> >  			/*
> >  			 * If we're feeding an AGFL block to something that
> >  			 * doesn't live in the free space, we need to clear
> > -			 * out the OWN_AG rmap and add the block back to
> > -			 * the RMAPBT per-AG reservation.
> > +			 * out the OWN_AG rmap.
> >  			 */
> >  			xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> >  			error = xfs_rmap_free(args->tp, args->agbp, args->agno,
> >  					fbno, 1, &oinfo);
> >  			if (error)
> >  				goto error0;
> > -			pag = xfs_perag_get(args->mp, args->agno);
> > -			xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT,
> > -					args->tp, 1);
> > -			xfs_perag_put(pag);
> >  
> >  			*stat = 0;
> >  			return 0;
> > @@ -2153,7 +2147,7 @@ xfs_alloc_fix_freelist(
> >  		if (error)
> >  			goto out_agbp_relse;
> >  		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
> > -					   &targs.oinfo, XFS_AG_RESV_RMAPBT);
> > +					   &targs.oinfo, XFS_AG_RESV_AGFL);
> >  		if (error)
> >  			goto out_agbp_relse;
> >  		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
> > @@ -2179,7 +2173,7 @@ xfs_alloc_fix_freelist(
> >  	while (pag->pagf_flcount < need) {
> >  		targs.agbno = 0;
> >  		targs.maxlen = need - pag->pagf_flcount;
> > -		targs.resv = XFS_AG_RESV_RMAPBT;
> > +		targs.resv = XFS_AG_RESV_AGFL;
> >  
> >  		/* Allocate as many blocks as possible at once. */
> >  		error = xfs_alloc_ag_vextent(&targs);
> > @@ -2860,7 +2854,7 @@ xfs_free_extent(
> >  	int			error;
> >  
> >  	ASSERT(len != 0);
> > -	ASSERT(type != XFS_AG_RESV_RMAPBT);
> > +	ASSERT(type != XFS_AG_RESV_AGFL);
> >  
> >  	if (XFS_TEST_ERROR(false, mp,
> >  			XFS_ERRTAG_FREE_EXTENT))
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > index e829c3e489ea..8560091554e0 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > @@ -130,6 +130,8 @@ xfs_rmapbt_alloc_block(
> >  	be32_add_cpu(&agf->agf_rmap_blocks, 1);
> >  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_RMAP_BLOCKS);
> >  
> > +	xfs_ag_resv_rmapbt_alloc(cur->bc_mp, cur->bc_private.a.agno);
> > +
> >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
> >  	*stat = 1;
> >  	return 0;
> > @@ -158,6 +160,8 @@ xfs_rmapbt_free_block(
> >  			      XFS_EXTENT_BUSY_SKIP_DISCARD);
> >  	xfs_trans_agbtree_delta(cur->bc_tp, -1);
> >  
> > +	xfs_ag_resv_rmapbt_free(cur->bc_mp, cur->bc_private.a.agno);
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index cf84288b65ca..7aae5af71aba 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -326,6 +326,7 @@ xfs_daddr_to_agbno(struct xfs_mount *mp, xfs_daddr_t d)
> >  /* per-AG block reservation data structures*/
> >  enum xfs_ag_resv_type {
> >  	XFS_AG_RESV_NONE = 0,
> > +	XFS_AG_RESV_AGFL,
> >  	XFS_AG_RESV_METADATA,
> >  	XFS_AG_RESV_RMAPBT,
> >  };
> > -- 
> > 2.13.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Feb. 8, 2018, 2:20 a.m. UTC | #3
On Wed, Feb 07, 2018 at 09:49:35AM -0500, Brian Foster wrote:
> On Tue, Feb 06, 2018 at 04:03:50PM -0800, Darrick J. Wong wrote:
> > On Mon, Feb 05, 2018 at 12:46:01PM -0500, Brian Foster wrote:
> > > The rmapbt perag metadata reservation reserves blocks for the
> > > reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
> > > the agfl and perag accounting is updated as blocks are allocated
> > > from the allocation btrees, the reservation actually accounts blocks
> > > as they are allocated to (or freed from) the agfl rather than the
> > > rmapbt itself.
> > > 
> > > While this works for blocks that are eventually used for the rmapbt,
> > > not all agfl blocks are destined for the rmapbt. Blocks that are
> > > allocated to the agfl (and thus "reserved" for the rmapbt) but then
> > > used by another structure leads to a growing inconsistency over time
> > > between the runtime tracking of rmapbt usage vs. actual rmapbt
> > > usage. Since the runtime tracking thinks all agfl blocks are rmapbt
> > > blocks, it essentially believes that less future reservation is
> > > required to satisfy the rmapbt than what is actually necessary.
> > > 
> > > The inconsistency is rectified across mount cycles because the perag
> > > reservation is initialized based on the actual rmapbt usage at mount
> > > time. The problem, however, is that the excessive drain of the
> > > reservation at runtime opens a window to allocate blocks for other
> > > purposes that might be required for the rmapbt on a subsequent
> > > mount. This problem can be demonstrated by a simple test that runs
> > > an allocation workload to consume agfl blocks over time and then
> > > observe the difference in the agfl reservation requirement across an
> > > unmount/mount cycle:
> > > 
> > >   mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
> > >   ...
> > >   ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
> > >   umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
> > >   mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194
> > > 
> > > As the above tracepoints show, the reservation requirement reduces
> > > from 3194 blocks to 2956 blocks as the workload runs.  Without any
> > > other changes in the filesystem, the same reservation requirement
> > > jumps from 2956 to 3052 blocks over a umount/mount cycle.
> > > 
> > > To address this divergence, update the RMAPBT reservation to account
> > > blocks used for the rmapbt only rather than all blocks filled into
> > > the agfl. This patch makes several high-level changes toward that
> > > end:
> > > 
> > > 1.) Reintroduce an AGFL reservation type to serve as an accounting
> > >     no-op for blocks allocated to (or freed from) the AGFL.
> > > 2.) Invoke RMAPBT usage accounting from the actual rmapbt block
> > >     allocation path rather than the AGFL allocation path.
> > > 
> > > The first change is required because agfl blocks are considered free
> > > blocks throughout their lifetime. The perag reservation subsystem is
> > > invoked unconditionally by the allocation subsystem, so we need a
> > > way to tell the perag subsystem (via the allocation subsystem) to
> > > not make any accounting changes for blocks filled into the AGFL.
> > > 
> > > The second change causes the in-core RMAPBT reservation usage
> > > accounting to remain consistent with the on-disk state at all times
> > > and eliminates the risk of leaving the rmapbt reservation
> > > underfilled.
> > 
> > Sorry I haven't gotten to this in a couple of days.  This looks like a
> > reasonable enough fix to the complex perag machinery, though Dave and I
> > were batting around an idea last week -- what if instead of all these
> > reservation types and tracking we simply maintained a per-ag free space
> > reservation of a couple megabytes, handed out blocks if we got
> > desperate, and replenished that res when we had a good opportunity?
> > 
> 
> Firstly, I essentially view this patch as a bug fix vs. some kind of
> larger rework. I'm hoping the split of the rename bits helps emphasize
> that... It's really just a change to accurately account rmapbt block
> consumption at runtime.
> 
> I don't have much context on the broader question you've called out...
> I assume there was a larger requirement (associated with
> reflink+rmapbt?) that necessitated this kind of worst case reservation
> in the first place..?

Without reflink, we protect fs writes from hitting ENOSPC midway through
a transaction by reserving all the blocks we think we might need ahead
of time.  Making the reservation at the fs level works because any
blocks we might have to allocate to handle the write can be in any AG,
so if a given AG is out of space we just move on to the next one.

However, once copy-on-write enters the picture, that last assumption
is no longer valid because we have to decrease the reference count in
the specific AG that has the shared blocks.  We now have a requirement
that said AG also has enough space to handle any refcountbt splits that
might happen in the process of decreasing the refcount, and that's where
we get into trouble.  The FS may have plenty of space (so the
reservation goes thorugh) but the AG may be out of space, in which case
the refcount update fails and the fs shuts down.

Therefore, we reserve some free space (by hiding it from the incore
metadata) and every time we need a block for the refcountbt we get one
from the reservation, hence we can never run out.

> We've since applied it to things like the finobt (which I'm still not
> totally convinced was the right thing to do based on the vague
> justification for it), which kind of blurs the line between where it's
> a requirement vs. nice-to-have/band-aid for me.

I think the finobt reservation is required: Suppose you have a
filesystem with a lot of empty files, a lot of single-block files, and a
lot of big files such that there's no free space anywhere.  Suppose
further that there's an AG where every finobt block is exactly full,
there's an inode chunk with exactly 64 inodes in use, and every block in
that AG is in use (meaning zero AGFL blocks).  Now find one of the empty
inodes in that totally-full chunk and try to free it.  Truncation
doesn't free up any blocks, but we have to expand the finobt to add the
record for the chunk.  We can't find any blocks in that AG so we shut
down.

> > I /think/ the answer to that question is that we need to try to reserve
> > enough blocks to satisfy the maximum size of the btree (if there is
> > one), not just a fixed-size delayed free space.  We already hand out
> > blocks if we get desperate, and we replenish the reservation when we
> > can.  Lastly, we need the reservation type tracking so that we avoid
> > things like the refcount btree stealing blocks from the agfl in
> > desperation.  I've also been wondering if each reservation should get
> > its own type so that we track refcountbt and finobt blocks separately.
> > 
> 
> In principle, I'm not really for or against the existing mechanism,
> further granularizing it or changing it out for something else entirely.
> I'm confused about what exactly we're trying to fix in the first place,
> though. Is there a larger fundamental problem at play here? Is the
> reservation (or complexity of the mechanism) simply overkill? Something
> else..?

Mostly my insecurity over "Was this the right thing for the job?  What
even /was/ the job?" :)

--D

> > Otherwise I think this series looks ok, but I'll go over it in more
> > detail tomorrow.
> > 
> 
> Thanks. No rush..
> 
> Brian
> 
> > --D
> > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_ag_resv.c    |  4 ++++
> > >  fs/xfs/libxfs/xfs_ag_resv.h    | 31 +++++++++++++++++++++++++++++++
> > >  fs/xfs/libxfs/xfs_alloc.c      | 18 ++++++------------
> > >  fs/xfs/libxfs/xfs_rmap_btree.c |  4 ++++
> > >  fs/xfs/xfs_mount.h             |  1 +
> > >  5 files changed, 46 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> > > index 0ca2e680034a..03885a968de8 100644
> > > --- a/fs/xfs/libxfs/xfs_ag_resv.c
> > > +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> > > @@ -326,6 +326,8 @@ xfs_ag_resv_alloc_extent(
> > >  	trace_xfs_ag_resv_alloc_extent(pag, type, args->len);
> > >  
> > >  	switch (type) {
> > > +	case XFS_AG_RESV_AGFL:
> > > +		return;
> > >  	case XFS_AG_RESV_METADATA:
> > >  	case XFS_AG_RESV_RMAPBT:
> > >  		resv = xfs_perag_resv(pag, type);
> > > @@ -366,6 +368,8 @@ xfs_ag_resv_free_extent(
> > >  	trace_xfs_ag_resv_free_extent(pag, type, len);
> > >  
> > >  	switch (type) {
> > > +	case XFS_AG_RESV_AGFL:
> > > +		return;
> > >  	case XFS_AG_RESV_METADATA:
> > >  	case XFS_AG_RESV_RMAPBT:
> > >  		resv = xfs_perag_resv(pag, type);
> > > diff --git a/fs/xfs/libxfs/xfs_ag_resv.h b/fs/xfs/libxfs/xfs_ag_resv.h
> > > index 8d6c687deef3..938f2f96c5e8 100644
> > > --- a/fs/xfs/libxfs/xfs_ag_resv.h
> > > +++ b/fs/xfs/libxfs/xfs_ag_resv.h
> > > @@ -32,4 +32,35 @@ void xfs_ag_resv_alloc_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
> > >  void xfs_ag_resv_free_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
> > >  		struct xfs_trans *tp, xfs_extlen_t len);
> > >  
> > > +/*
> > > + * RMAPBT reservation accounting wrappers. Since rmapbt blocks are sourced from
> > > + * the AGFL, they are allocated one at a time and the reservation updates don't
> > > + * require a transaction.
> > > + */
> > > +static inline void
> > > +xfs_ag_resv_rmapbt_alloc(
> > > +	struct xfs_mount	*mp,
> > > +	xfs_agnumber_t		agno)
> > > +{
> > > +	struct xfs_alloc_arg	args = {0};
> > > +	struct xfs_perag	*pag;
> > > +
> > > +	args.len = 1;
> > > +	pag = xfs_perag_get(mp, agno);
> > > +	xfs_ag_resv_alloc_extent(pag, XFS_AG_RESV_RMAPBT, &args);
> > > +	xfs_perag_put(pag);
> > > +}
> > > +
> > > +static inline void
> > > +xfs_ag_resv_rmapbt_free(
> > > +	struct xfs_mount	*mp,
> > > +	xfs_agnumber_t		agno)
> > > +{
> > > +	struct xfs_perag	*pag;
> > > +
> > > +	pag = xfs_perag_get(mp, agno);
> > > +	xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT, NULL, 1);
> > > +	xfs_perag_put(pag);
> > > +}
> > > +
> > >  #endif	/* __XFS_AG_RESV_H__ */
> > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > index e53d55453dc5..8c401efb2d74 100644
> > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > @@ -711,7 +711,7 @@ xfs_alloc_ag_vextent(
> > >  
> > >  	ASSERT(args->len >= args->minlen);
> > >  	ASSERT(args->len <= args->maxlen);
> > > -	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_RMAPBT);
> > > +	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
> > >  	ASSERT(args->agbno % args->alignment == 0);
> > >  
> > >  	/* if not file data, insert new block into the reverse map btree */
> > > @@ -1564,7 +1564,6 @@ xfs_alloc_ag_vextent_small(
> > >  	int		*stat)	/* status: 0-freelist, 1-normal/none */
> > >  {
> > >  	struct xfs_owner_info	oinfo;
> > > -	struct xfs_perag	*pag;
> > >  	int		error;
> > >  	xfs_agblock_t	fbno;
> > >  	xfs_extlen_t	flen;
> > > @@ -1583,7 +1582,7 @@ xfs_alloc_ag_vextent_small(
> > >  	 * freelist.
> > >  	 */
> > >  	else if (args->minlen == 1 && args->alignment == 1 &&
> > > -		 args->resv != XFS_AG_RESV_RMAPBT &&
> > > +		 args->resv != XFS_AG_RESV_AGFL &&
> > >  		 (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount)
> > >  		  > args->minleft)) {
> > >  		error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0);
> > > @@ -1616,18 +1615,13 @@ xfs_alloc_ag_vextent_small(
> > >  			/*
> > >  			 * If we're feeding an AGFL block to something that
> > >  			 * doesn't live in the free space, we need to clear
> > > -			 * out the OWN_AG rmap and add the block back to
> > > -			 * the RMAPBT per-AG reservation.
> > > +			 * out the OWN_AG rmap.
> > >  			 */
> > >  			xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > >  			error = xfs_rmap_free(args->tp, args->agbp, args->agno,
> > >  					fbno, 1, &oinfo);
> > >  			if (error)
> > >  				goto error0;
> > > -			pag = xfs_perag_get(args->mp, args->agno);
> > > -			xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT,
> > > -					args->tp, 1);
> > > -			xfs_perag_put(pag);
> > >  
> > >  			*stat = 0;
> > >  			return 0;
> > > @@ -2153,7 +2147,7 @@ xfs_alloc_fix_freelist(
> > >  		if (error)
> > >  			goto out_agbp_relse;
> > >  		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
> > > -					   &targs.oinfo, XFS_AG_RESV_RMAPBT);
> > > +					   &targs.oinfo, XFS_AG_RESV_AGFL);
> > >  		if (error)
> > >  			goto out_agbp_relse;
> > >  		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
> > > @@ -2179,7 +2173,7 @@ xfs_alloc_fix_freelist(
> > >  	while (pag->pagf_flcount < need) {
> > >  		targs.agbno = 0;
> > >  		targs.maxlen = need - pag->pagf_flcount;
> > > -		targs.resv = XFS_AG_RESV_RMAPBT;
> > > +		targs.resv = XFS_AG_RESV_AGFL;
> > >  
> > >  		/* Allocate as many blocks as possible at once. */
> > >  		error = xfs_alloc_ag_vextent(&targs);
> > > @@ -2860,7 +2854,7 @@ xfs_free_extent(
> > >  	int			error;
> > >  
> > >  	ASSERT(len != 0);
> > > -	ASSERT(type != XFS_AG_RESV_RMAPBT);
> > > +	ASSERT(type != XFS_AG_RESV_AGFL);
> > >  
> > >  	if (XFS_TEST_ERROR(false, mp,
> > >  			XFS_ERRTAG_FREE_EXTENT))
> > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > index e829c3e489ea..8560091554e0 100644
> > > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > @@ -130,6 +130,8 @@ xfs_rmapbt_alloc_block(
> > >  	be32_add_cpu(&agf->agf_rmap_blocks, 1);
> > >  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_RMAP_BLOCKS);
> > >  
> > > +	xfs_ag_resv_rmapbt_alloc(cur->bc_mp, cur->bc_private.a.agno);
> > > +
> > >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
> > >  	*stat = 1;
> > >  	return 0;
> > > @@ -158,6 +160,8 @@ xfs_rmapbt_free_block(
> > >  			      XFS_EXTENT_BUSY_SKIP_DISCARD);
> > >  	xfs_trans_agbtree_delta(cur->bc_tp, -1);
> > >  
> > > +	xfs_ag_resv_rmapbt_free(cur->bc_mp, cur->bc_private.a.agno);
> > > +
> > >  	return 0;
> > >  }
> > >  
> > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > > index cf84288b65ca..7aae5af71aba 100644
> > > --- a/fs/xfs/xfs_mount.h
> > > +++ b/fs/xfs/xfs_mount.h
> > > @@ -326,6 +326,7 @@ xfs_daddr_to_agbno(struct xfs_mount *mp, xfs_daddr_t d)
> > >  /* per-AG block reservation data structures*/
> > >  enum xfs_ag_resv_type {
> > >  	XFS_AG_RESV_NONE = 0,
> > > +	XFS_AG_RESV_AGFL,
> > >  	XFS_AG_RESV_METADATA,
> > >  	XFS_AG_RESV_RMAPBT,
> > >  };
> > > -- 
> > > 2.13.6
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Feb. 8, 2018, 1:19 p.m. UTC | #4
On Wed, Feb 07, 2018 at 06:20:37PM -0800, Darrick J. Wong wrote:
> On Wed, Feb 07, 2018 at 09:49:35AM -0500, Brian Foster wrote:
> > On Tue, Feb 06, 2018 at 04:03:50PM -0800, Darrick J. Wong wrote:
> > > On Mon, Feb 05, 2018 at 12:46:01PM -0500, Brian Foster wrote:
> > > > The rmapbt perag metadata reservation reserves blocks for the
> > > > reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
> > > > the agfl and perag accounting is updated as blocks are allocated
> > > > from the allocation btrees, the reservation actually accounts blocks
> > > > as they are allocated to (or freed from) the agfl rather than the
> > > > rmapbt itself.
> > > > 
> > > > While this works for blocks that are eventually used for the rmapbt,
> > > > not all agfl blocks are destined for the rmapbt. Blocks that are
> > > > allocated to the agfl (and thus "reserved" for the rmapbt) but then
> > > > used by another structure leads to a growing inconsistency over time
> > > > between the runtime tracking of rmapbt usage vs. actual rmapbt
> > > > usage. Since the runtime tracking thinks all agfl blocks are rmapbt
> > > > blocks, it essentially believes that less future reservation is
> > > > required to satisfy the rmapbt than what is actually necessary.
> > > > 
> > > > The inconsistency is rectified across mount cycles because the perag
> > > > reservation is initialized based on the actual rmapbt usage at mount
> > > > time. The problem, however, is that the excessive drain of the
> > > > reservation at runtime opens a window to allocate blocks for other
> > > > purposes that might be required for the rmapbt on a subsequent
> > > > mount. This problem can be demonstrated by a simple test that runs
> > > > an allocation workload to consume agfl blocks over time and then
> > > > observe the difference in the agfl reservation requirement across an
> > > > unmount/mount cycle:
> > > > 
> > > >   mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
> > > >   ...
> > > >   ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
> > > >   umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
> > > >   mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194
> > > > 
> > > > As the above tracepoints show, the reservation requirement reduces
> > > > from 3194 blocks to 2956 blocks as the workload runs.  Without any
> > > > other changes in the filesystem, the same reservation requirement
> > > > jumps from 2956 to 3052 blocks over a umount/mount cycle.
> > > > 
> > > > To address this divergence, update the RMAPBT reservation to account
> > > > blocks used for the rmapbt only rather than all blocks filled into
> > > > the agfl. This patch makes several high-level changes toward that
> > > > end:
> > > > 
> > > > 1.) Reintroduce an AGFL reservation type to serve as an accounting
> > > >     no-op for blocks allocated to (or freed from) the AGFL.
> > > > 2.) Invoke RMAPBT usage accounting from the actual rmapbt block
> > > >     allocation path rather than the AGFL allocation path.
> > > > 
> > > > The first change is required because agfl blocks are considered free
> > > > blocks throughout their lifetime. The perag reservation subsystem is
> > > > invoked unconditionally by the allocation subsystem, so we need a
> > > > way to tell the perag subsystem (via the allocation subsystem) to
> > > > not make any accounting changes for blocks filled into the AGFL.
> > > > 
> > > > The second change causes the in-core RMAPBT reservation usage
> > > > accounting to remain consistent with the on-disk state at all times
> > > > and eliminates the risk of leaving the rmapbt reservation
> > > > underfilled.
> > > 
> > > Sorry I haven't gotten to this in a couple of days.  This looks like a
> > > reasonable enough fix to the complex perag machinery, though Dave and I
> > > were batting around an idea last week -- what if instead of all these
> > > reservation types and tracking we simply maintained a per-ag free space
> > > reservation of a couple megabytes, handed out blocks if we got
> > > desperate, and replenished that res when we had a good opportunity?
> > > 
> > 
> > Firstly, I essentially view this patch as a bug fix vs. some kind of
> > larger rework. I'm hoping the split of the rename bits helps emphasize
> > that... It's really just a change to accurately account rmapbt block
> > consumption at runtime.
> > 
> > I don't have much context on the broader question you've called out...
> > I assume there was a larger requirement (associated with
> > reflink+rmapbt?) that necessitated this kind of worst case reservation
> > in the first place..?
> 
> Without reflink, we protect fs writes from hitting ENOSPC midway through
> a transaction by reserving all the blocks we think we might need ahead
> of time.  Making the reservation at the fs level works because any
> blocks we might have to allocate to handle the write can be in any AG,
> so if a given AG is out of space we just move on to the next one.
> 
> However, once copy-on-write enters the picture, that last assumption
> is no longer valid because we have to decrease the reference count in
> the specific AG that has the shared blocks.  We now have a requirement
> that said AG also has enough space to handle any refcountbt splits that
> might happen in the process of decreasing the refcount, and that's where
> we get into trouble.  The FS may have plenty of space (so the
> reservation goes thorugh) but the AG may be out of space, in which case
> the refcount update fails and the fs shuts down.
> 
> Therefore, we reserve some free space (by hiding it from the incore
> metadata) and every time we need a block for the refcountbt we get one
> from the reservation, hence we can never run out.
> 

Ok, I thought it had something to do with cases introduced by reflink,
but I can't keep that all in my head. So IIUC essentially the refcountbt
can grow aggressively based on splits of shared extents and whatnot and
the reservation primarily guarantees we'll have blocks in the AG for
those cases.

> > We've since applied it to things like the finobt (which I'm still not
> > totally convinced was the right thing to do based on the vague
> > justification for it), which kind of blurs the line between where it's
> > a requirement vs. nice-to-have/band-aid for me.
> 
> I think the finobt reservation is required: Suppose you have a
> filesystem with a lot of empty files, a lot of single-block files, and a
> lot of big files such that there's no free space anywhere.  Suppose
> further that there's an AG where every finobt block is exactly full,
> there's an inode chunk with exactly 64 inodes in use, and every block in
> that AG is in use (meaning zero AGFL blocks).  Now find one of the empty
> inodes in that totally-full chunk and try to free it.  Truncation
> doesn't free up any blocks, but we have to expand the finobt to add the
> record for the chunk.  We can't find any blocks in that AG so we shut
> down.
> 

Yes, I suppose the problem makes sense (I wish the original commit had
such an explanation :/). We do have the transaction block reservation in
the !perag res case, but I suppose we're susceptible to the same global
reservation problem as above.

Have we considered a per-ag + per-transaction mechanism at any point
through all of this? I ask because something that has been in the back
of my mind (which I think was an idea from Dave originally) for a while
is to simply queue inactive inode processing when it can't run at a
particular point in time, but that depends on actually knowing whether
we can proceed to inactivate an inode or not.

> > > I /think/ the answer to that question is that we need to try to reserve
> > > enough blocks to satisfy the maximum size of the btree (if there is
> > > one), not just a fixed-size delayed free space.  We already hand out
> > > blocks if we get desperate, and we replenish the reservation when we
> > > can.  Lastly, we need the reservation type tracking so that we avoid
> > > things like the refcount btree stealing blocks from the agfl in
> > > desperation.  I've also been wondering if each reservation should get
> > > its own type so that we track refcountbt and finobt blocks separately.
> > > 
> > 
> > In principle, I'm not really for or against the existing mechanism,
> > further granularizing it or changing it out for something else entirely.
> > I'm confused about what exactly we're trying to fix in the first place,
> > though. Is there a larger fundamental problem at play here? Is the
> > reservation (or complexity of the mechanism) simply overkill? Something
> > else..?
> 
> Mostly my insecurity over "Was this the right thing for the job?  What
> even /was/ the job?" :)
> 

Heh, Ok.. fair enough. I guess I do share the sentiment with regard to
certain aspects of the reservation like how it is applied to the finobt,
where I think reserving every possible finobt block up front may be
overkill with respect to the underlying/fundamental problem. I'd have to
think more about the broader/!finobt aspects, though.

Brian

> --D
> 
> > > Otherwise I think this series looks ok, but I'll go over it in more
> > > detail tomorrow.
> > > 
> > 
> > Thanks. No rush..
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_ag_resv.c    |  4 ++++
> > > >  fs/xfs/libxfs/xfs_ag_resv.h    | 31 +++++++++++++++++++++++++++++++
> > > >  fs/xfs/libxfs/xfs_alloc.c      | 18 ++++++------------
> > > >  fs/xfs/libxfs/xfs_rmap_btree.c |  4 ++++
> > > >  fs/xfs/xfs_mount.h             |  1 +
> > > >  5 files changed, 46 insertions(+), 12 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> > > > index 0ca2e680034a..03885a968de8 100644
> > > > --- a/fs/xfs/libxfs/xfs_ag_resv.c
> > > > +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> > > > @@ -326,6 +326,8 @@ xfs_ag_resv_alloc_extent(
> > > >  	trace_xfs_ag_resv_alloc_extent(pag, type, args->len);
> > > >  
> > > >  	switch (type) {
> > > > +	case XFS_AG_RESV_AGFL:
> > > > +		return;
> > > >  	case XFS_AG_RESV_METADATA:
> > > >  	case XFS_AG_RESV_RMAPBT:
> > > >  		resv = xfs_perag_resv(pag, type);
> > > > @@ -366,6 +368,8 @@ xfs_ag_resv_free_extent(
> > > >  	trace_xfs_ag_resv_free_extent(pag, type, len);
> > > >  
> > > >  	switch (type) {
> > > > +	case XFS_AG_RESV_AGFL:
> > > > +		return;
> > > >  	case XFS_AG_RESV_METADATA:
> > > >  	case XFS_AG_RESV_RMAPBT:
> > > >  		resv = xfs_perag_resv(pag, type);
> > > > diff --git a/fs/xfs/libxfs/xfs_ag_resv.h b/fs/xfs/libxfs/xfs_ag_resv.h
> > > > index 8d6c687deef3..938f2f96c5e8 100644
> > > > --- a/fs/xfs/libxfs/xfs_ag_resv.h
> > > > +++ b/fs/xfs/libxfs/xfs_ag_resv.h
> > > > @@ -32,4 +32,35 @@ void xfs_ag_resv_alloc_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
> > > >  void xfs_ag_resv_free_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
> > > >  		struct xfs_trans *tp, xfs_extlen_t len);
> > > >  
> > > > +/*
> > > > + * RMAPBT reservation accounting wrappers. Since rmapbt blocks are sourced from
> > > > + * the AGFL, they are allocated one at a time and the reservation updates don't
> > > > + * require a transaction.
> > > > + */
> > > > +static inline void
> > > > +xfs_ag_resv_rmapbt_alloc(
> > > > +	struct xfs_mount	*mp,
> > > > +	xfs_agnumber_t		agno)
> > > > +{
> > > > +	struct xfs_alloc_arg	args = {0};
> > > > +	struct xfs_perag	*pag;
> > > > +
> > > > +	args.len = 1;
> > > > +	pag = xfs_perag_get(mp, agno);
> > > > +	xfs_ag_resv_alloc_extent(pag, XFS_AG_RESV_RMAPBT, &args);
> > > > +	xfs_perag_put(pag);
> > > > +}
> > > > +
> > > > +static inline void
> > > > +xfs_ag_resv_rmapbt_free(
> > > > +	struct xfs_mount	*mp,
> > > > +	xfs_agnumber_t		agno)
> > > > +{
> > > > +	struct xfs_perag	*pag;
> > > > +
> > > > +	pag = xfs_perag_get(mp, agno);
> > > > +	xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT, NULL, 1);
> > > > +	xfs_perag_put(pag);
> > > > +}
> > > > +
> > > >  #endif	/* __XFS_AG_RESV_H__ */
> > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > > index e53d55453dc5..8c401efb2d74 100644
> > > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > > @@ -711,7 +711,7 @@ xfs_alloc_ag_vextent(
> > > >  
> > > >  	ASSERT(args->len >= args->minlen);
> > > >  	ASSERT(args->len <= args->maxlen);
> > > > -	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_RMAPBT);
> > > > +	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
> > > >  	ASSERT(args->agbno % args->alignment == 0);
> > > >  
> > > >  	/* if not file data, insert new block into the reverse map btree */
> > > > @@ -1564,7 +1564,6 @@ xfs_alloc_ag_vextent_small(
> > > >  	int		*stat)	/* status: 0-freelist, 1-normal/none */
> > > >  {
> > > >  	struct xfs_owner_info	oinfo;
> > > > -	struct xfs_perag	*pag;
> > > >  	int		error;
> > > >  	xfs_agblock_t	fbno;
> > > >  	xfs_extlen_t	flen;
> > > > @@ -1583,7 +1582,7 @@ xfs_alloc_ag_vextent_small(
> > > >  	 * freelist.
> > > >  	 */
> > > >  	else if (args->minlen == 1 && args->alignment == 1 &&
> > > > -		 args->resv != XFS_AG_RESV_RMAPBT &&
> > > > +		 args->resv != XFS_AG_RESV_AGFL &&
> > > >  		 (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount)
> > > >  		  > args->minleft)) {
> > > >  		error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0);
> > > > @@ -1616,18 +1615,13 @@ xfs_alloc_ag_vextent_small(
> > > >  			/*
> > > >  			 * If we're feeding an AGFL block to something that
> > > >  			 * doesn't live in the free space, we need to clear
> > > > -			 * out the OWN_AG rmap and add the block back to
> > > > -			 * the RMAPBT per-AG reservation.
> > > > +			 * out the OWN_AG rmap.
> > > >  			 */
> > > >  			xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > >  			error = xfs_rmap_free(args->tp, args->agbp, args->agno,
> > > >  					fbno, 1, &oinfo);
> > > >  			if (error)
> > > >  				goto error0;
> > > > -			pag = xfs_perag_get(args->mp, args->agno);
> > > > -			xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT,
> > > > -					args->tp, 1);
> > > > -			xfs_perag_put(pag);
> > > >  
> > > >  			*stat = 0;
> > > >  			return 0;
> > > > @@ -2153,7 +2147,7 @@ xfs_alloc_fix_freelist(
> > > >  		if (error)
> > > >  			goto out_agbp_relse;
> > > >  		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
> > > > -					   &targs.oinfo, XFS_AG_RESV_RMAPBT);
> > > > +					   &targs.oinfo, XFS_AG_RESV_AGFL);
> > > >  		if (error)
> > > >  			goto out_agbp_relse;
> > > >  		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
> > > > @@ -2179,7 +2173,7 @@ xfs_alloc_fix_freelist(
> > > >  	while (pag->pagf_flcount < need) {
> > > >  		targs.agbno = 0;
> > > >  		targs.maxlen = need - pag->pagf_flcount;
> > > > -		targs.resv = XFS_AG_RESV_RMAPBT;
> > > > +		targs.resv = XFS_AG_RESV_AGFL;
> > > >  
> > > >  		/* Allocate as many blocks as possible at once. */
> > > >  		error = xfs_alloc_ag_vextent(&targs);
> > > > @@ -2860,7 +2854,7 @@ xfs_free_extent(
> > > >  	int			error;
> > > >  
> > > >  	ASSERT(len != 0);
> > > > -	ASSERT(type != XFS_AG_RESV_RMAPBT);
> > > > +	ASSERT(type != XFS_AG_RESV_AGFL);
> > > >  
> > > >  	if (XFS_TEST_ERROR(false, mp,
> > > >  			XFS_ERRTAG_FREE_EXTENT))
> > > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > > index e829c3e489ea..8560091554e0 100644
> > > > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > > @@ -130,6 +130,8 @@ xfs_rmapbt_alloc_block(
> > > >  	be32_add_cpu(&agf->agf_rmap_blocks, 1);
> > > >  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_RMAP_BLOCKS);
> > > >  
> > > > +	xfs_ag_resv_rmapbt_alloc(cur->bc_mp, cur->bc_private.a.agno);
> > > > +
> > > >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
> > > >  	*stat = 1;
> > > >  	return 0;
> > > > @@ -158,6 +160,8 @@ xfs_rmapbt_free_block(
> > > >  			      XFS_EXTENT_BUSY_SKIP_DISCARD);
> > > >  	xfs_trans_agbtree_delta(cur->bc_tp, -1);
> > > >  
> > > > +	xfs_ag_resv_rmapbt_free(cur->bc_mp, cur->bc_private.a.agno);
> > > > +
> > > >  	return 0;
> > > >  }
> > > >  
> > > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > > > index cf84288b65ca..7aae5af71aba 100644
> > > > --- a/fs/xfs/xfs_mount.h
> > > > +++ b/fs/xfs/xfs_mount.h
> > > > @@ -326,6 +326,7 @@ xfs_daddr_to_agbno(struct xfs_mount *mp, xfs_daddr_t d)
> > > >  /* per-AG block reservation data structures*/
> > > >  enum xfs_ag_resv_type {
> > > >  	XFS_AG_RESV_NONE = 0,
> > > > +	XFS_AG_RESV_AGFL,
> > > >  	XFS_AG_RESV_METADATA,
> > > >  	XFS_AG_RESV_RMAPBT,
> > > >  };
> > > > -- 
> > > > 2.13.6
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Feb. 8, 2018, 10:49 p.m. UTC | #5
On Thu, Feb 08, 2018 at 08:19:38AM -0500, Brian Foster wrote:
> On Wed, Feb 07, 2018 at 06:20:37PM -0800, Darrick J. Wong wrote:
> > On Wed, Feb 07, 2018 at 09:49:35AM -0500, Brian Foster wrote:
> > > On Tue, Feb 06, 2018 at 04:03:50PM -0800, Darrick J. Wong wrote:
> > > We've since applied it to things like the finobt (which I'm still not
> > > totally convinced was the right thing to do based on the vague
> > > justification for it), which kind of blurs the line between where it's
> > > a requirement vs. nice-to-have/band-aid for me.
> > 
> > I think the finobt reservation is required: Suppose you have a
> > filesystem with a lot of empty files, a lot of single-block files, and a
> > lot of big files such that there's no free space anywhere.  Suppose
> > further that there's an AG where every finobt block is exactly full,
> > there's an inode chunk with exactly 64 inodes in use, and every block in
> > that AG is in use (meaning zero AGFL blocks).  Now find one of the empty
> > inodes in that totally-full chunk and try to free it.  Truncation
> > doesn't free up any blocks, but we have to expand the finobt to add the
> > record for the chunk.  We can't find any blocks in that AG so we shut
> > down.
> > 
> 
> Yes, I suppose the problem makes sense (I wish the original commit had
> such an explanation :/). We do have the transaction block reservation in
> the !perag res case, but I suppose we're susceptible to the same global
> reservation problem as above.
> 
> Have we considered a per-ag + per-transaction mechanism at any point
> through all of this?

That's kind of what I was suggesting to Darrick on IRC a while back.
i.e. the per-ag reservation of at least 4-8MB of space similar to
the global reservation pool we have, and when it dips below that
threshold we reserve more free space.

But yeah, it doesn't completely solve the finobt growth at ENOSPC
problem, but then again the global reservation pool doesn't
completely solve the "run out of free space for IO completion
processing at ENOSPC" problem either. That mechanism is just a
simple solution that is good enough for 99.99% of XFS users, and if
you are outside this there's a way to increase the pool size to make
it more robust (e.g. for those 4096 CPU MPI jobs all doing
concurrent DIO writeback at ENOSPC).

So the question I'm asking here is this: do we need a "perfect
solution" or does a simple, small, dynamic reservation pool provide
"good enough protection" for the vast majority of our users?

> I ask because something that has been in the back
> of my mind (which I think was an idea from Dave originally) for a while
> is to simply queue inactive inode processing when it can't run at a
> particular point in time, but that depends on actually knowing whether
> we can proceed to inactivate an inode or not.

Essentially defer the xfs_ifree() processing step from
xfs_inactive(), right? i.e. leave the inode on the unlinked list
until we've got space to free it? This could be determined by a
simple AG space/resv space check before removign the inode from
the unlinked list...

FWIW, if we keep a list of inactivated but not yet freed inodes for
background processing, we could allocate inodes from that list, too,
simply by removing them from the unlinked list...

Cheers,

Dave.
Brian Foster Feb. 9, 2018, 1:37 p.m. UTC | #6
On Fri, Feb 09, 2018 at 09:49:30AM +1100, Dave Chinner wrote:
> On Thu, Feb 08, 2018 at 08:19:38AM -0500, Brian Foster wrote:
> > On Wed, Feb 07, 2018 at 06:20:37PM -0800, Darrick J. Wong wrote:
> > > On Wed, Feb 07, 2018 at 09:49:35AM -0500, Brian Foster wrote:
> > > > On Tue, Feb 06, 2018 at 04:03:50PM -0800, Darrick J. Wong wrote:
> > > > We've since applied it to things like the finobt (which I'm still not
> > > > totally convinced was the right thing to do based on the vague
> > > > justification for it), which kind of blurs the line between where it's
> > > > a requirement vs. nice-to-have/band-aid for me.
> > > 
> > > I think the finobt reservation is required: Suppose you have a
> > > filesystem with a lot of empty files, a lot of single-block files, and a
> > > lot of big files such that there's no free space anywhere.  Suppose
> > > further that there's an AG where every finobt block is exactly full,
> > > there's an inode chunk with exactly 64 inodes in use, and every block in
> > > that AG is in use (meaning zero AGFL blocks).  Now find one of the empty
> > > inodes in that totally-full chunk and try to free it.  Truncation
> > > doesn't free up any blocks, but we have to expand the finobt to add the
> > > record for the chunk.  We can't find any blocks in that AG so we shut
> > > down.
> > > 
> > 
> > Yes, I suppose the problem makes sense (I wish the original commit had
> > such an explanation :/). We do have the transaction block reservation in
> > the !perag res case, but I suppose we're susceptible to the same global
> > reservation problem as above.
> > 
> > Have we considered a per-ag + per-transaction mechanism at any point
> > through all of this?
> 
> That's kind of what I was suggesting to Darrick on IRC a while back.
> i.e. the per-ag reservation of at least 4-8MB of space similar to
> the global reservation pool we have, and when it dips below that
> threshold we reserve more free space.
> 

I see. FWIW, that's not exactly what I was thinking, but I suspect it's
a very similar idea in terms of just trying to provide a "good enough"
solution (IOW, we may just be quibbling on implementation details here).

Is this new block pool essentially just a low-end pool to be used in
certain, fixed contexts that are at risk of catastrophic failure in
event of ENOSPC? Essentially a backup plan to cover the case where: a
transaction reserves N global blocks, the operation results in some
refcountbt operations, AG is out of blocks so allocations fail, dip into
special AG reserve pool so we don't explode.

That sounds plausible to me, almost like a high level AGFL for
particular operations.

> But yeah, it doesn't completely solve the finobt growth at ENOSPC
> problem, but then again the global reservation pool doesn't
> completely solve the "run out of free space for IO completion
> processing at ENOSPC" problem either. That mechanism is just a
> simple solution that is good enough for 99.99% of XFS users, and if
> you are outside this there's a way to increase the pool size to make
> it more robust (e.g. for those 4096 CPU MPI jobs all doing
> concurrent DIO writeback at ENOSPC).
> 

Yeah. I'm wondering if just refactoring the existing global reservation
pool into something that is AG granular would be a reasonable approach.
E.g., rather than reserve N blocks out of the filesystem that as far as
we know could all source from the same AG, implement a reservation that
reserves roughly N/agcount number of blocks from each AG.

> So the question I'm asking here is this: do we need a "perfect
> solution" or does a simple, small, dynamic reservation pool provide
> "good enough protection" for the vast majority of our users?
> 

I'm all for simple + good enough. :) That's primarily been my concern
from the perag + finobt angle... that a full worst-case tree reservation
is overkill when it might just be sufficient to return -ENOSPC in the
nebulous case that motivated it in the first place. IOW, do we really
need these worst case reservations or to worry about how big a reserve
pool needs to be if we can implement a reliable enough mechanism to
reserve blocks for a particular operation to either proceed or fail
gracefully?

Thinking out loud a bit... the idea that popped in my mind was slightly
different in that I was thinking about a per-transaction AG blocks
reservation. E.g., for the purpose of discussion, consider an
xfs_trans_reserve_agblocks() function that reserved N blocks out of a
particular AG for a given tx. This of course requires knowing what AG to
reserve from, so is limited to certain operations, but I think that
should be doable for inode operations. This also requires knowing what
allocations shall be accounted against such tx reservation, so we'd have
to somehow flag/tag inode chunks, inobt blocks, etc. as being AG
reserved allocations (perhaps similar to how delalloc blocks are already
reserved for bmaps).

I was originally thinking more about inode processing so it's not clear
to me if/how useful this would be for something like reflink. Scanning
through that code a bit, I guess we have to worry about copy-on-write at
writeback time so a transaction is not terribly useful here. What we'd
want is a tx-less reservation at write() time that guarantees we'd have
enough blocks for refcount updates based on the underlying shared
extents that may need to split, etc., right?

Hmm, I'm not sure what the right answer is there. I can certainly see
how a fixed size pool would make this easier, I just wonder if that
could blow up just the same as without it because we're back to just
guessing/hoping it's enough. I think an underlying AG block reservation
mechanism is still potentially useful here, the question is just where
to store/track blocks that would be reserved at cow write() time and
consumed (and/or released) at writeback time? (Then again, I suppose the
same underlying reservation mechanism could be shared to feed tx AG
reservations _and_ a special pool for reflink handling.)

I suppose we could try to design a structure that tracked reservations
associated with a cow fork or something, but that could get hairy
because IIUC the shared extents in the data fork for a particular write
could span multiple AGs (and thus require multiple reservations). It
would be nice if we could define a reservation relative to each extent
or some such that was fixed across this whole cow extent lifecycle and
thus easy to track, consume or release, but the simplest things that
come to mind (e.g., a refcount btree split reservation per shared block)
are too excessive to be realistic. I'd have to think about that some
more... thoughts on any of this?

> > I ask because something that has been in the back
> > of my mind (which I think was an idea from Dave originally) for a while
> > is to simply queue inactive inode processing when it can't run at a
> > particular point in time, but that depends on actually knowing whether
> > we can proceed to inactivate an inode or not.
> 
> Essentially defer the xfs_ifree() processing step from
> xfs_inactive(), right? i.e. leave the inode on the unlinked list
> until we've got space to free it? This could be determined by a
> simple AG space/resv space check before removign the inode from
> the unlinked list...
> 

Yeah, that's essentially what I'm after. A mechanism to tell us reliably
whether an operation will suceed or not is sufficient here so long as we
support the ability to defer freeing unlinked inodes until it is safe to
do so. I don't think the reservation is that difficult for inactive
operations because we 1.) have a transaction to track actual block
consumption and 2.) have a deterministic worst case reservation
requirement.

> FWIW, if we keep a list of inactivated but not yet freed inodes for
> background processing, we could allocate inodes from that list, too,
> simply by removing them from the unlinked list...
> 

Yep. A neat enhancement, indeed.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index 0ca2e680034a..03885a968de8 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -326,6 +326,8 @@  xfs_ag_resv_alloc_extent(
 	trace_xfs_ag_resv_alloc_extent(pag, type, args->len);
 
 	switch (type) {
+	case XFS_AG_RESV_AGFL:
+		return;
 	case XFS_AG_RESV_METADATA:
 	case XFS_AG_RESV_RMAPBT:
 		resv = xfs_perag_resv(pag, type);
@@ -366,6 +368,8 @@  xfs_ag_resv_free_extent(
 	trace_xfs_ag_resv_free_extent(pag, type, len);
 
 	switch (type) {
+	case XFS_AG_RESV_AGFL:
+		return;
 	case XFS_AG_RESV_METADATA:
 	case XFS_AG_RESV_RMAPBT:
 		resv = xfs_perag_resv(pag, type);
diff --git a/fs/xfs/libxfs/xfs_ag_resv.h b/fs/xfs/libxfs/xfs_ag_resv.h
index 8d6c687deef3..938f2f96c5e8 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.h
+++ b/fs/xfs/libxfs/xfs_ag_resv.h
@@ -32,4 +32,35 @@  void xfs_ag_resv_alloc_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
 void xfs_ag_resv_free_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
 		struct xfs_trans *tp, xfs_extlen_t len);
 
+/*
+ * RMAPBT reservation accounting wrappers. Since rmapbt blocks are sourced from
+ * the AGFL, they are allocated one at a time and the reservation updates don't
+ * require a transaction.
+ */
+static inline void
+xfs_ag_resv_rmapbt_alloc(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	struct xfs_alloc_arg	args = {0};
+	struct xfs_perag	*pag;
+
+	args.len = 1;
+	pag = xfs_perag_get(mp, agno);
+	xfs_ag_resv_alloc_extent(pag, XFS_AG_RESV_RMAPBT, &args);
+	xfs_perag_put(pag);
+}
+
+static inline void
+xfs_ag_resv_rmapbt_free(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	struct xfs_perag	*pag;
+
+	pag = xfs_perag_get(mp, agno);
+	xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT, NULL, 1);
+	xfs_perag_put(pag);
+}
+
 #endif	/* __XFS_AG_RESV_H__ */
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index e53d55453dc5..8c401efb2d74 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -711,7 +711,7 @@  xfs_alloc_ag_vextent(
 
 	ASSERT(args->len >= args->minlen);
 	ASSERT(args->len <= args->maxlen);
-	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_RMAPBT);
+	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
 	ASSERT(args->agbno % args->alignment == 0);
 
 	/* if not file data, insert new block into the reverse map btree */
@@ -1564,7 +1564,6 @@  xfs_alloc_ag_vextent_small(
 	int		*stat)	/* status: 0-freelist, 1-normal/none */
 {
 	struct xfs_owner_info	oinfo;
-	struct xfs_perag	*pag;
 	int		error;
 	xfs_agblock_t	fbno;
 	xfs_extlen_t	flen;
@@ -1583,7 +1582,7 @@  xfs_alloc_ag_vextent_small(
 	 * freelist.
 	 */
 	else if (args->minlen == 1 && args->alignment == 1 &&
-		 args->resv != XFS_AG_RESV_RMAPBT &&
+		 args->resv != XFS_AG_RESV_AGFL &&
 		 (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount)
 		  > args->minleft)) {
 		error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0);
@@ -1616,18 +1615,13 @@  xfs_alloc_ag_vextent_small(
 			/*
 			 * If we're feeding an AGFL block to something that
 			 * doesn't live in the free space, we need to clear
-			 * out the OWN_AG rmap and add the block back to
-			 * the RMAPBT per-AG reservation.
+			 * out the OWN_AG rmap.
 			 */
 			xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
 			error = xfs_rmap_free(args->tp, args->agbp, args->agno,
 					fbno, 1, &oinfo);
 			if (error)
 				goto error0;
-			pag = xfs_perag_get(args->mp, args->agno);
-			xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT,
-					args->tp, 1);
-			xfs_perag_put(pag);
 
 			*stat = 0;
 			return 0;
@@ -2153,7 +2147,7 @@  xfs_alloc_fix_freelist(
 		if (error)
 			goto out_agbp_relse;
 		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
-					   &targs.oinfo, XFS_AG_RESV_RMAPBT);
+					   &targs.oinfo, XFS_AG_RESV_AGFL);
 		if (error)
 			goto out_agbp_relse;
 		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
@@ -2179,7 +2173,7 @@  xfs_alloc_fix_freelist(
 	while (pag->pagf_flcount < need) {
 		targs.agbno = 0;
 		targs.maxlen = need - pag->pagf_flcount;
-		targs.resv = XFS_AG_RESV_RMAPBT;
+		targs.resv = XFS_AG_RESV_AGFL;
 
 		/* Allocate as many blocks as possible at once. */
 		error = xfs_alloc_ag_vextent(&targs);
@@ -2860,7 +2854,7 @@  xfs_free_extent(
 	int			error;
 
 	ASSERT(len != 0);
-	ASSERT(type != XFS_AG_RESV_RMAPBT);
+	ASSERT(type != XFS_AG_RESV_AGFL);
 
 	if (XFS_TEST_ERROR(false, mp,
 			XFS_ERRTAG_FREE_EXTENT))
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index e829c3e489ea..8560091554e0 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -130,6 +130,8 @@  xfs_rmapbt_alloc_block(
 	be32_add_cpu(&agf->agf_rmap_blocks, 1);
 	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_RMAP_BLOCKS);
 
+	xfs_ag_resv_rmapbt_alloc(cur->bc_mp, cur->bc_private.a.agno);
+
 	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
 	*stat = 1;
 	return 0;
@@ -158,6 +160,8 @@  xfs_rmapbt_free_block(
 			      XFS_EXTENT_BUSY_SKIP_DISCARD);
 	xfs_trans_agbtree_delta(cur->bc_tp, -1);
 
+	xfs_ag_resv_rmapbt_free(cur->bc_mp, cur->bc_private.a.agno);
+
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index cf84288b65ca..7aae5af71aba 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -326,6 +326,7 @@  xfs_daddr_to_agbno(struct xfs_mount *mp, xfs_daddr_t d)
 /* per-AG block reservation data structures*/
 enum xfs_ag_resv_type {
 	XFS_AG_RESV_NONE = 0,
+	XFS_AG_RESV_AGFL,
 	XFS_AG_RESV_METADATA,
 	XFS_AG_RESV_RMAPBT,
 };