diff mbox

[RFC] xfs: skip discard of unwritten extents

Message ID 20180426180624.6134-1-bfoster@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Brian Foster April 26, 2018, 6:06 p.m. UTC
Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi all,

What do folks think of something like this? The motivation here is that
the VDO (dedup) devs had reported seeing online discards during
write-only workloads. These turn out to be related to trimming post-eof
preallocation blocks after large file copies. To my knowledge, this
isn't really a prevalent or serious issue, but I think that technically
these discards are unnecessary and so I was looking into how we could
avoid them.

This behavior is of course not directly related to unwritten extents,
but the immediate/obvious solution to bubble up a bmapi flag of some
kind to xfs_free_eofblocks() seemed rather crude. From there, I figured
that we technically don't need to discard any unwritten extents (within
or beyond EOF) because they haven't been written to since being
allocated. In fact, I'm not sure we have to even busy them, but it's
roughly equivalent logic either way and I'm trying to avoid getting too
clever.

I also recall that we've discussed using unwritten extents for delalloc
-> real conversion to avoid the small stale data exposure window that
exists in writeback. Without getting too deep into the reason we don't
currently do an initial unwritten allocation [1], I don't think there's
anything blocking us from converting any post-eof blocks that happen to
be part of the resulting normal allocation. As it is, the imap is
already trimmed to EOF by the writeback code for coherency reasons. If
we were to convert post-eof blocks (not part of this patch) along with
something like this patch, then we'd indirectly prevent discards for
eofblocks trims.

Beyond the whole discard thing, conversion of post-eof blocks may have a
couple other advantages. First, we eliminate the aforementioned
writeback stale data exposure problem for writes over preallocated
blocks (which doesn't solve the fundamental problem, but closes the
gap). Second, the zeroing required for post-eof writes that jump over
eofblocks (see xfs_file_aio_write_checks()) becomes a much lighter
weight operation. Normal blocks are zeroed using buffered writes whereas
this is essentially a no-op for unwritten extents.

Thoughts? Flames? Other ideas?

Brian

[1] I think I've actually attempted this change in the past, but I
haven't dug through my old git branches as of yet to completely jog my
memory. IIRC, this may have been held up by the remnants of buffer_heads
being used to track state for the writeback code.

 fs/xfs/libxfs/xfs_alloc.c          | 8 ++++++--
 fs/xfs/libxfs/xfs_alloc.h          | 3 ++-
 fs/xfs/libxfs/xfs_bmap.c           | 9 ++++++---
 fs/xfs/libxfs/xfs_bmap.h           | 3 ++-
 fs/xfs/libxfs/xfs_bmap_btree.c     | 2 +-
 fs/xfs/libxfs/xfs_ialloc.c         | 4 ++--
 fs/xfs/libxfs/xfs_ialloc_btree.c   | 2 +-
 fs/xfs/libxfs/xfs_refcount.c       | 7 ++++---
 fs/xfs/libxfs/xfs_refcount_btree.c | 2 +-
 fs/xfs/xfs_extfree_item.c          | 2 +-
 fs/xfs/xfs_fsops.c                 | 2 +-
 fs/xfs/xfs_reflink.c               | 2 +-
 fs/xfs/xfs_trans.h                 | 3 ++-
 fs/xfs/xfs_trans_extfree.c         | 7 ++++---
 14 files changed, 34 insertions(+), 22 deletions(-)

Comments

Carlos Maiolino April 30, 2018, 9:06 a.m. UTC | #1
Hi,

On Thu, Apr 26, 2018 at 02:06:24PM -0400, Brian Foster wrote:
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> Hi all,
> 
> What do folks think of something like this? The motivation here is that
> the VDO (dedup) devs had reported seeing online discards during
> write-only workloads. These turn out to be related to trimming post-eof
> preallocation blocks after large file copies. To my knowledge, this
> isn't really a prevalent or serious issue, but I think that technically
> these discards are unnecessary and so I was looking into how we could
> avoid them.
> 
> This behavior is of course not directly related to unwritten extents,
> but the immediate/obvious solution to bubble up a bmapi flag of some
> kind to xfs_free_eofblocks() seemed rather crude. From there, I figured
> that we technically don't need to discard any unwritten extents (within
> or beyond EOF) because they haven't been written to since being
> allocated. In fact, I'm not sure we have to even busy them, but it's
> roughly equivalent logic either way and I'm trying to avoid getting too
> clever.
> 

There is one thing I'm wondering here, pardon me if my assumptions are wrong.

Regarding the discard of post-eof blocks, one thing that comes to my mind, is
what would happen after let's say, a failed fallocate (I've run into this issued
a few months ago, and it just came to my mind while looking into this patch).
Currently, after a failed fallocate (issued beyond EOF), we do not rollback any
successfully allocated extents, keeping such extents allocated to the file,
consuming filesystem's space.

Today, such extents are freed when xfs_free_eofblocks() is called, freeing up
the space partially reserved by the failed fallocate.

I honestly do not remember if in such situation, we might already have issued
any write to the underlying device or not, if we did, issuing a discard here is
still useful.

I do *think* no writes have been issued to the block device so skipping discards
is not an issue, but I thought it might be worth to bring up such case.

The right approach from user is still to rollback its failed fallocate, but I
still think post eof block trim is useful in this case. Though I'm not sure if
the discard itself is useful or not.

Thoughts?

Cheers
> I also recall that we've discussed using unwritten extents for delalloc
> -> real conversion to avoid the small stale data exposure window that
> exists in writeback. Without getting too deep into the reason we don't
> currently do an initial unwritten allocation [1], I don't think there's
> anything blocking us from converting any post-eof blocks that happen to
> be part of the resulting normal allocation. As it is, the imap is
> already trimmed to EOF by the writeback code for coherency reasons. If
> we were to convert post-eof blocks (not part of this patch) along with
> something like this patch, then we'd indirectly prevent discards for
> eofblocks trims.
> 
> Beyond the whole discard thing, conversion of post-eof blocks may have a
> couple other advantages. First, we eliminate the aforementioned
> writeback stale data exposure problem for writes over preallocated
> blocks (which doesn't solve the fundamental problem, but closes the
> gap). Second, the zeroing required for post-eof writes that jump over
> eofblocks (see xfs_file_aio_write_checks()) becomes a much lighter
> weight operation. Normal blocks are zeroed using buffered writes whereas
> this is essentially a no-op for unwritten extents.
> 
> Thoughts? Flames? Other ideas?
> 
> Brian
> 
> [1] I think I've actually attempted this change in the past, but I
> haven't dug through my old git branches as of yet to completely jog my
> memory. IIRC, this may have been held up by the remnants of buffer_heads
> being used to track state for the writeback code.
> 
>  fs/xfs/libxfs/xfs_alloc.c          | 8 ++++++--
>  fs/xfs/libxfs/xfs_alloc.h          | 3 ++-
>  fs/xfs/libxfs/xfs_bmap.c           | 9 ++++++---
>  fs/xfs/libxfs/xfs_bmap.h           | 3 ++-
>  fs/xfs/libxfs/xfs_bmap_btree.c     | 2 +-
>  fs/xfs/libxfs/xfs_ialloc.c         | 4 ++--
>  fs/xfs/libxfs/xfs_ialloc_btree.c   | 2 +-
>  fs/xfs/libxfs/xfs_refcount.c       | 7 ++++---
>  fs/xfs/libxfs/xfs_refcount_btree.c | 2 +-
>  fs/xfs/xfs_extfree_item.c          | 2 +-
>  fs/xfs/xfs_fsops.c                 | 2 +-
>  fs/xfs/xfs_reflink.c               | 2 +-
>  fs/xfs/xfs_trans.h                 | 3 ++-
>  fs/xfs/xfs_trans_extfree.c         | 7 ++++---
>  14 files changed, 34 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 4bcc095fe44a..942c90ec6747 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2954,13 +2954,15 @@ xfs_free_extent(
>  	xfs_fsblock_t		bno,	/* starting block number of extent */
>  	xfs_extlen_t		len,	/* length of extent */
>  	struct xfs_owner_info	*oinfo,	/* extent owner */
> -	enum xfs_ag_resv_type	type)	/* block reservation type */
> +	enum xfs_ag_resv_type	type,	/* block reservation type */
> +	bool			skip_discard)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_buf		*agbp;
>  	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, bno);
>  	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(mp, bno);
>  	int			error;
> +	unsigned int		busy_flags = 0;
>  
>  	ASSERT(len != 0);
>  	ASSERT(type != XFS_AG_RESV_AGFL);
> @@ -2984,7 +2986,9 @@ xfs_free_extent(
>  	if (error)
>  		goto err;
>  
> -	xfs_extent_busy_insert(tp, agno, agbno, len, 0);
> +	if (skip_discard)
> +		busy_flags |= XFS_EXTENT_BUSY_SKIP_DISCARD;
> +	xfs_extent_busy_insert(tp, agno, agbno, len, busy_flags);
>  	return 0;
>  
>  err:
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index cbf789ea5a4e..5c7d8391edc4 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -196,7 +196,8 @@ xfs_free_extent(
>  	xfs_fsblock_t		bno,	/* starting block number of extent */
>  	xfs_extlen_t		len,	/* length of extent */
>  	struct xfs_owner_info	*oinfo,	/* extent owner */
> -	enum xfs_ag_resv_type	type);	/* block reservation type */
> +	enum xfs_ag_resv_type	type,	/* block reservation type */
> +	bool			skip_discard);
>  
>  int				/* error */
>  xfs_alloc_lookup_le(
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 040eeda8426f..a5a37803f589 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -547,7 +547,8 @@ xfs_bmap_add_free(
>  	struct xfs_defer_ops		*dfops,
>  	xfs_fsblock_t			bno,
>  	xfs_filblks_t			len,
> -	struct xfs_owner_info		*oinfo)
> +	struct xfs_owner_info		*oinfo,
> +	bool				skip_discard)
>  {
>  	struct xfs_extent_free_item	*new;		/* new element */
>  #ifdef DEBUG
> @@ -574,6 +575,7 @@ xfs_bmap_add_free(
>  		new->xefi_oinfo = *oinfo;
>  	else
>  		xfs_rmap_skip_owner_update(&new->xefi_oinfo);
> +	new->xefi_skip_discard = skip_discard;
>  	trace_xfs_bmap_free_defer(mp, XFS_FSB_TO_AGNO(mp, bno), 0,
>  			XFS_FSB_TO_AGBNO(mp, bno), len);
>  	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_FREE, &new->xefi_list);
> @@ -632,7 +634,7 @@ xfs_bmap_btree_to_extents(
>  	if ((error = xfs_btree_check_block(cur, cblock, 0, cbp)))
>  		return error;
>  	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork);
> -	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1, &oinfo);
> +	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1, &oinfo, false);
>  	ip->i_d.di_nblocks--;
>  	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, -1L);
>  	xfs_trans_binval(tp, cbp);
> @@ -5106,7 +5108,8 @@ xfs_bmap_del_extent_real(
>  				goto done;
>  		} else
>  			xfs_bmap_add_free(mp, dfops, del->br_startblock,
> -					del->br_blockcount, NULL);
> +					del->br_blockcount, NULL,
> +					del->br_state == XFS_EXT_UNWRITTEN);
>  	}
>  
>  	/*
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 2b766b37096d..0d2de22d143e 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -68,6 +68,7 @@ struct xfs_extent_free_item
>  	xfs_extlen_t		xefi_blockcount;/* number of blocks in extent */
>  	struct list_head	xefi_list;
>  	struct xfs_owner_info	xefi_oinfo;	/* extent owner */
> +	bool			xefi_skip_discard;
>  };
>  
>  #define	XFS_BMAP_MAX_NMAP	4
> @@ -194,7 +195,7 @@ int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
>  void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
>  void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
>  			  xfs_fsblock_t bno, xfs_filblks_t len,
> -			  struct xfs_owner_info *oinfo);
> +			  struct xfs_owner_info *oinfo, bool skip_discard);
>  void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
>  int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
>  		xfs_extlen_t len, xfs_fileoff_t *unused, int whichfork);
> diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
> index d89d06bea6e3..cecddfcbe11e 100644
> --- a/fs/xfs/libxfs/xfs_bmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
> @@ -305,7 +305,7 @@ xfs_bmbt_free_block(
>  	struct xfs_owner_info	oinfo;
>  
>  	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, cur->bc_private.b.whichfork);
> -	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1, &oinfo);
> +	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1, &oinfo, false);
>  	ip->i_d.di_nblocks--;
>  
>  	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index de627fa19168..854ebe04c86f 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -1837,7 +1837,7 @@ xfs_difree_inode_chunk(
>  	if (!xfs_inobt_issparse(rec->ir_holemask)) {
>  		/* not sparse, calculate extent info directly */
>  		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, sagbno),
> -				  mp->m_ialloc_blks, &oinfo);
> +				  mp->m_ialloc_blks, &oinfo, false);
>  		return;
>  	}
>  
> @@ -1881,7 +1881,7 @@ xfs_difree_inode_chunk(
>  		ASSERT(agbno % mp->m_sb.sb_spino_align == 0);
>  		ASSERT(contigblk % mp->m_sb.sb_spino_align == 0);
>  		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, agbno),
> -				  contigblk, &oinfo);
> +				  contigblk, &oinfo, false);
>  
>  		/* reset range to current bit and carry on... */
>  		startidx = endidx = nextbit;
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> index 367e9a0726e6..977a33cc60d3 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -153,7 +153,7 @@ __xfs_inobt_free_block(
>  	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
>  	return xfs_free_extent(cur->bc_tp,
>  			XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp)), 1,
> -			&oinfo, resv);
> +			&oinfo, resv, false);
>  }
>  
>  STATIC int
> diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> index 560e28473024..e5cfbe2534b1 100644
> --- a/fs/xfs/libxfs/xfs_refcount.c
> +++ b/fs/xfs/libxfs/xfs_refcount.c
> @@ -883,7 +883,8 @@ xfs_refcount_adjust_extents(
>  						cur->bc_private.a.agno,
>  						tmp.rc_startblock);
>  				xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
> -						tmp.rc_blockcount, oinfo);
> +						tmp.rc_blockcount, oinfo,
> +						false);
>  			}
>  
>  			(*agbno) += tmp.rc_blockcount;
> @@ -926,7 +927,7 @@ xfs_refcount_adjust_extents(
>  					cur->bc_private.a.agno,
>  					ext.rc_startblock);
>  			xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
> -					ext.rc_blockcount, oinfo);
> +					ext.rc_blockcount, oinfo, false);
>  		}
>  
>  skip:
> @@ -1658,7 +1659,7 @@ xfs_refcount_recover_cow_leftovers(
>  
>  		/* Free the block. */
>  		xfs_bmap_add_free(mp, &dfops, fsb,
> -				rr->rr_rrec.rc_blockcount, NULL);
> +				rr->rr_rrec.rc_blockcount, NULL, false);
>  
>  		error = xfs_defer_finish(&tp, &dfops);
>  		if (error)
> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> index 375abfeb6267..bb0bdc6d97b5 100644
> --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> @@ -131,7 +131,7 @@ xfs_refcountbt_free_block(
>  	be32_add_cpu(&agf->agf_refcount_blocks, -1);
>  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
>  	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
> -			XFS_AG_RESV_METADATA);
> +			XFS_AG_RESV_METADATA, false);
>  	if (error)
>  		return error;
>  
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index b5b1e567b9f4..4735a31793b0 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -542,7 +542,7 @@ xfs_efi_recover(
>  	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
>  		extp = &efip->efi_format.efi_extents[i];
>  		error = xfs_trans_free_extent(tp, efdp, extp->ext_start,
> -					      extp->ext_len, &oinfo);
> +					      extp->ext_len, &oinfo, false);
>  		if (error)
>  			goto abort_error;
>  
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 523792768080..9c555f81431e 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -502,7 +502,7 @@ xfs_growfs_data_private(
>  		error = xfs_free_extent(tp,
>  				XFS_AGB_TO_FSB(mp, agno,
>  					be32_to_cpu(agf->agf_length) - new),
> -				new, &oinfo, XFS_AG_RESV_NONE);
> +				new, &oinfo, XFS_AG_RESV_NONE, false);
>  		if (error)
>  			goto error0;
>  	}
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index cdbd342a5249..08381266ad85 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -604,7 +604,7 @@ xfs_reflink_cancel_cow_blocks(
>  
>  			xfs_bmap_add_free(ip->i_mount, &dfops,
>  					del.br_startblock, del.br_blockcount,
> -					NULL);
> +					NULL, false);
>  
>  			/* Roll the transaction */
>  			xfs_defer_ijoin(&dfops, ip);
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 9d542dfe0052..d5be3f6a3e8f 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -228,7 +228,8 @@ struct xfs_efd_log_item	*xfs_trans_get_efd(struct xfs_trans *,
>  				  uint);
>  int		xfs_trans_free_extent(struct xfs_trans *,
>  				      struct xfs_efd_log_item *, xfs_fsblock_t,
> -				      xfs_extlen_t, struct xfs_owner_info *);
> +				      xfs_extlen_t, struct xfs_owner_info *,
> +				      bool);
>  int		xfs_trans_commit(struct xfs_trans *);
>  int		xfs_trans_roll(struct xfs_trans **);
>  int		xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *);
> diff --git a/fs/xfs/xfs_trans_extfree.c b/fs/xfs/xfs_trans_extfree.c
> index ab438647592a..cd2acfa4e562 100644
> --- a/fs/xfs/xfs_trans_extfree.c
> +++ b/fs/xfs/xfs_trans_extfree.c
> @@ -68,7 +68,8 @@ xfs_trans_free_extent(
>  	struct xfs_efd_log_item	*efdp,
>  	xfs_fsblock_t		start_block,
>  	xfs_extlen_t		ext_len,
> -	struct xfs_owner_info	*oinfo)
> +	struct xfs_owner_info	*oinfo,
> +	bool			skip_discard)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	uint			next_extent;
> @@ -80,7 +81,7 @@ xfs_trans_free_extent(
>  	trace_xfs_bmap_free_deferred(tp->t_mountp, agno, 0, agbno, ext_len);
>  
>  	error = xfs_free_extent(tp, start_block, ext_len, oinfo,
> -			XFS_AG_RESV_NONE);
> +			XFS_AG_RESV_NONE, skip_discard);
>  
>  	/*
>  	 * Mark the transaction dirty, even on error. This ensures the
> @@ -195,7 +196,7 @@ xfs_extent_free_finish_item(
>  	error = xfs_trans_free_extent(tp, done_item,
>  			free->xefi_startblock,
>  			free->xefi_blockcount,
> -			&free->xefi_oinfo);
> +			&free->xefi_oinfo, free->xefi_skip_discard);
>  	kmem_free(free);
>  	return error;
>  }
> -- 
> 2.13.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster April 30, 2018, 12:17 p.m. UTC | #2
On Mon, Apr 30, 2018 at 11:06:35AM +0200, Carlos Maiolino wrote:
> Hi,
> 
> On Thu, Apr 26, 2018 at 02:06:24PM -0400, Brian Foster wrote:
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > Hi all,
> > 
> > What do folks think of something like this? The motivation here is that
> > the VDO (dedup) devs had reported seeing online discards during
> > write-only workloads. These turn out to be related to trimming post-eof
> > preallocation blocks after large file copies. To my knowledge, this
> > isn't really a prevalent or serious issue, but I think that technically
> > these discards are unnecessary and so I was looking into how we could
> > avoid them.
> > 
> > This behavior is of course not directly related to unwritten extents,
> > but the immediate/obvious solution to bubble up a bmapi flag of some
> > kind to xfs_free_eofblocks() seemed rather crude. From there, I figured
> > that we technically don't need to discard any unwritten extents (within
> > or beyond EOF) because they haven't been written to since being
> > allocated. In fact, I'm not sure we have to even busy them, but it's
> > roughly equivalent logic either way and I'm trying to avoid getting too
> > clever.
> > 
> 
> There is one thing I'm wondering here, pardon me if my assumptions are wrong.
> 
> Regarding the discard of post-eof blocks, one thing that comes to my mind, is
> what would happen after let's say, a failed fallocate (I've run into this issued
> a few months ago, and it just came to my mind while looking into this patch).
> Currently, after a failed fallocate (issued beyond EOF), we do not rollback any
> successfully allocated extents, keeping such extents allocated to the file,
> consuming filesystem's space.
> 

Right.. I think this is fairly generic behavior, regardless of whether
some of the extents extend or run past eof. IOW, a failed falloc is
simply a partial falloc operation.

> Today, such extents are freed when xfs_free_eofblocks() is called, freeing up
> the space partially reserved by the failed fallocate.
> 

Hmm, I suspect this is more of an implementation detail than a hard and
fast rule. The eofblocks trimming should not occur on any inode that has
user-driven preallocation, which is marked via XFS_DIFLAG_PREALLOC. I
haven't tested behavior, but on a quick look it appears we set this flag
only after successful completion of xfs_alloc_file_space() in the
fallocate path.

I think that means it's possible for a first fallocate on an inode to
partially succeed allocating blocks and fail to set the flag. This means
an eofblocks trim could affect such blocks. I also think that means it's
possible for a subsequent fallocate to partially succeed where the flag
may have already been set on the inode..? If so, that means a subsequent
eofblocks trim would not affect such blocks.

FWIW, I'm not terribly concerned with error handling behavior in this
case. Ultimately the user will need to fix up the file one way or
another and this patch is more concerned about discard of the resulting
blocks if they are freed (independent of whether they should have been
freed in the first place).

> I honestly do not remember if in such situation, we might already have issued
> any write to the underlying device or not, if we did, issuing a discard here is
> still useful.
> 

eofblocks aside, I think it's always possible for the fs to send writes
and then crash after some of those writes have completed but before
we've converted unwritten extents. If the latter transaction is lost,
the blocks will have been written by an underlying thin device (for
e.g.) but not tracked as such in the fs. Failure to discard such blocks
if they are eventually freed means the underlying device would not
release them.

> I do *think* no writes have been issued to the block device so skipping discards
> is not an issue, but I thought it might be worth to bring up such case.
> 

I think it's a good point wrt to the discard. I'm not totally sure what
the right answer is. On one hand, I think the discards are technically
advisory so it might be a worthwhile tradeoff for a rare situation that
can already be introduced in other ways (such as a discard being ignored
or an intermediate mount without online discard enabled). We could
always recommend an fstrim after a log recovery, for example.

On the other hand, others might think that it's more appropriate to be
aggressive with online discards in the fs as we are today to provide the
device with as much information as possible.

Brian

> The right approach from user is still to rollback its failed fallocate, but I
> still think post eof block trim is useful in this case. Though I'm not sure if
> the discard itself is useful or not.
> 
> Thoughts?
> 
> Cheers
> > I also recall that we've discussed using unwritten extents for delalloc
> > -> real conversion to avoid the small stale data exposure window that
> > exists in writeback. Without getting too deep into the reason we don't
> > currently do an initial unwritten allocation [1], I don't think there's
> > anything blocking us from converting any post-eof blocks that happen to
> > be part of the resulting normal allocation. As it is, the imap is
> > already trimmed to EOF by the writeback code for coherency reasons. If
> > we were to convert post-eof blocks (not part of this patch) along with
> > something like this patch, then we'd indirectly prevent discards for
> > eofblocks trims.
> > 
> > Beyond the whole discard thing, conversion of post-eof blocks may have a
> > couple other advantages. First, we eliminate the aforementioned
> > writeback stale data exposure problem for writes over preallocated
> > blocks (which doesn't solve the fundamental problem, but closes the
> > gap). Second, the zeroing required for post-eof writes that jump over
> > eofblocks (see xfs_file_aio_write_checks()) becomes a much lighter
> > weight operation. Normal blocks are zeroed using buffered writes whereas
> > this is essentially a no-op for unwritten extents.
> > 
> > Thoughts? Flames? Other ideas?
> > 
> > Brian
> > 
> > [1] I think I've actually attempted this change in the past, but I
> > haven't dug through my old git branches as of yet to completely jog my
> > memory. IIRC, this may have been held up by the remnants of buffer_heads
> > being used to track state for the writeback code.
> > 
> >  fs/xfs/libxfs/xfs_alloc.c          | 8 ++++++--
> >  fs/xfs/libxfs/xfs_alloc.h          | 3 ++-
> >  fs/xfs/libxfs/xfs_bmap.c           | 9 ++++++---
> >  fs/xfs/libxfs/xfs_bmap.h           | 3 ++-
> >  fs/xfs/libxfs/xfs_bmap_btree.c     | 2 +-
> >  fs/xfs/libxfs/xfs_ialloc.c         | 4 ++--
> >  fs/xfs/libxfs/xfs_ialloc_btree.c   | 2 +-
> >  fs/xfs/libxfs/xfs_refcount.c       | 7 ++++---
> >  fs/xfs/libxfs/xfs_refcount_btree.c | 2 +-
> >  fs/xfs/xfs_extfree_item.c          | 2 +-
> >  fs/xfs/xfs_fsops.c                 | 2 +-
> >  fs/xfs/xfs_reflink.c               | 2 +-
> >  fs/xfs/xfs_trans.h                 | 3 ++-
> >  fs/xfs/xfs_trans_extfree.c         | 7 ++++---
> >  14 files changed, 34 insertions(+), 22 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 4bcc095fe44a..942c90ec6747 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -2954,13 +2954,15 @@ xfs_free_extent(
> >  	xfs_fsblock_t		bno,	/* starting block number of extent */
> >  	xfs_extlen_t		len,	/* length of extent */
> >  	struct xfs_owner_info	*oinfo,	/* extent owner */
> > -	enum xfs_ag_resv_type	type)	/* block reservation type */
> > +	enum xfs_ag_resv_type	type,	/* block reservation type */
> > +	bool			skip_discard)
> >  {
> >  	struct xfs_mount	*mp = tp->t_mountp;
> >  	struct xfs_buf		*agbp;
> >  	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, bno);
> >  	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(mp, bno);
> >  	int			error;
> > +	unsigned int		busy_flags = 0;
> >  
> >  	ASSERT(len != 0);
> >  	ASSERT(type != XFS_AG_RESV_AGFL);
> > @@ -2984,7 +2986,9 @@ xfs_free_extent(
> >  	if (error)
> >  		goto err;
> >  
> > -	xfs_extent_busy_insert(tp, agno, agbno, len, 0);
> > +	if (skip_discard)
> > +		busy_flags |= XFS_EXTENT_BUSY_SKIP_DISCARD;
> > +	xfs_extent_busy_insert(tp, agno, agbno, len, busy_flags);
> >  	return 0;
> >  
> >  err:
> > diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> > index cbf789ea5a4e..5c7d8391edc4 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.h
> > +++ b/fs/xfs/libxfs/xfs_alloc.h
> > @@ -196,7 +196,8 @@ xfs_free_extent(
> >  	xfs_fsblock_t		bno,	/* starting block number of extent */
> >  	xfs_extlen_t		len,	/* length of extent */
> >  	struct xfs_owner_info	*oinfo,	/* extent owner */
> > -	enum xfs_ag_resv_type	type);	/* block reservation type */
> > +	enum xfs_ag_resv_type	type,	/* block reservation type */
> > +	bool			skip_discard);
> >  
> >  int				/* error */
> >  xfs_alloc_lookup_le(
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 040eeda8426f..a5a37803f589 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -547,7 +547,8 @@ xfs_bmap_add_free(
> >  	struct xfs_defer_ops		*dfops,
> >  	xfs_fsblock_t			bno,
> >  	xfs_filblks_t			len,
> > -	struct xfs_owner_info		*oinfo)
> > +	struct xfs_owner_info		*oinfo,
> > +	bool				skip_discard)
> >  {
> >  	struct xfs_extent_free_item	*new;		/* new element */
> >  #ifdef DEBUG
> > @@ -574,6 +575,7 @@ xfs_bmap_add_free(
> >  		new->xefi_oinfo = *oinfo;
> >  	else
> >  		xfs_rmap_skip_owner_update(&new->xefi_oinfo);
> > +	new->xefi_skip_discard = skip_discard;
> >  	trace_xfs_bmap_free_defer(mp, XFS_FSB_TO_AGNO(mp, bno), 0,
> >  			XFS_FSB_TO_AGBNO(mp, bno), len);
> >  	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_FREE, &new->xefi_list);
> > @@ -632,7 +634,7 @@ xfs_bmap_btree_to_extents(
> >  	if ((error = xfs_btree_check_block(cur, cblock, 0, cbp)))
> >  		return error;
> >  	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork);
> > -	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1, &oinfo);
> > +	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1, &oinfo, false);
> >  	ip->i_d.di_nblocks--;
> >  	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, -1L);
> >  	xfs_trans_binval(tp, cbp);
> > @@ -5106,7 +5108,8 @@ xfs_bmap_del_extent_real(
> >  				goto done;
> >  		} else
> >  			xfs_bmap_add_free(mp, dfops, del->br_startblock,
> > -					del->br_blockcount, NULL);
> > +					del->br_blockcount, NULL,
> > +					del->br_state == XFS_EXT_UNWRITTEN);
> >  	}
> >  
> >  	/*
> > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > index 2b766b37096d..0d2de22d143e 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.h
> > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > @@ -68,6 +68,7 @@ struct xfs_extent_free_item
> >  	xfs_extlen_t		xefi_blockcount;/* number of blocks in extent */
> >  	struct list_head	xefi_list;
> >  	struct xfs_owner_info	xefi_oinfo;	/* extent owner */
> > +	bool			xefi_skip_discard;
> >  };
> >  
> >  #define	XFS_BMAP_MAX_NMAP	4
> > @@ -194,7 +195,7 @@ int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
> >  void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
> >  void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> >  			  xfs_fsblock_t bno, xfs_filblks_t len,
> > -			  struct xfs_owner_info *oinfo);
> > +			  struct xfs_owner_info *oinfo, bool skip_discard);
> >  void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
> >  int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
> >  		xfs_extlen_t len, xfs_fileoff_t *unused, int whichfork);
> > diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
> > index d89d06bea6e3..cecddfcbe11e 100644
> > --- a/fs/xfs/libxfs/xfs_bmap_btree.c
> > +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
> > @@ -305,7 +305,7 @@ xfs_bmbt_free_block(
> >  	struct xfs_owner_info	oinfo;
> >  
> >  	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, cur->bc_private.b.whichfork);
> > -	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1, &oinfo);
> > +	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1, &oinfo, false);
> >  	ip->i_d.di_nblocks--;
> >  
> >  	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > index de627fa19168..854ebe04c86f 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -1837,7 +1837,7 @@ xfs_difree_inode_chunk(
> >  	if (!xfs_inobt_issparse(rec->ir_holemask)) {
> >  		/* not sparse, calculate extent info directly */
> >  		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, sagbno),
> > -				  mp->m_ialloc_blks, &oinfo);
> > +				  mp->m_ialloc_blks, &oinfo, false);
> >  		return;
> >  	}
> >  
> > @@ -1881,7 +1881,7 @@ xfs_difree_inode_chunk(
> >  		ASSERT(agbno % mp->m_sb.sb_spino_align == 0);
> >  		ASSERT(contigblk % mp->m_sb.sb_spino_align == 0);
> >  		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, agbno),
> > -				  contigblk, &oinfo);
> > +				  contigblk, &oinfo, false);
> >  
> >  		/* reset range to current bit and carry on... */
> >  		startidx = endidx = nextbit;
> > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > index 367e9a0726e6..977a33cc60d3 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > @@ -153,7 +153,7 @@ __xfs_inobt_free_block(
> >  	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
> >  	return xfs_free_extent(cur->bc_tp,
> >  			XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp)), 1,
> > -			&oinfo, resv);
> > +			&oinfo, resv, false);
> >  }
> >  
> >  STATIC int
> > diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> > index 560e28473024..e5cfbe2534b1 100644
> > --- a/fs/xfs/libxfs/xfs_refcount.c
> > +++ b/fs/xfs/libxfs/xfs_refcount.c
> > @@ -883,7 +883,8 @@ xfs_refcount_adjust_extents(
> >  						cur->bc_private.a.agno,
> >  						tmp.rc_startblock);
> >  				xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
> > -						tmp.rc_blockcount, oinfo);
> > +						tmp.rc_blockcount, oinfo,
> > +						false);
> >  			}
> >  
> >  			(*agbno) += tmp.rc_blockcount;
> > @@ -926,7 +927,7 @@ xfs_refcount_adjust_extents(
> >  					cur->bc_private.a.agno,
> >  					ext.rc_startblock);
> >  			xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
> > -					ext.rc_blockcount, oinfo);
> > +					ext.rc_blockcount, oinfo, false);
> >  		}
> >  
> >  skip:
> > @@ -1658,7 +1659,7 @@ xfs_refcount_recover_cow_leftovers(
> >  
> >  		/* Free the block. */
> >  		xfs_bmap_add_free(mp, &dfops, fsb,
> > -				rr->rr_rrec.rc_blockcount, NULL);
> > +				rr->rr_rrec.rc_blockcount, NULL, false);
> >  
> >  		error = xfs_defer_finish(&tp, &dfops);
> >  		if (error)
> > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> > index 375abfeb6267..bb0bdc6d97b5 100644
> > --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> > +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> > @@ -131,7 +131,7 @@ xfs_refcountbt_free_block(
> >  	be32_add_cpu(&agf->agf_refcount_blocks, -1);
> >  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
> >  	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
> > -			XFS_AG_RESV_METADATA);
> > +			XFS_AG_RESV_METADATA, false);
> >  	if (error)
> >  		return error;
> >  
> > diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> > index b5b1e567b9f4..4735a31793b0 100644
> > --- a/fs/xfs/xfs_extfree_item.c
> > +++ b/fs/xfs/xfs_extfree_item.c
> > @@ -542,7 +542,7 @@ xfs_efi_recover(
> >  	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> >  		extp = &efip->efi_format.efi_extents[i];
> >  		error = xfs_trans_free_extent(tp, efdp, extp->ext_start,
> > -					      extp->ext_len, &oinfo);
> > +					      extp->ext_len, &oinfo, false);
> >  		if (error)
> >  			goto abort_error;
> >  
> > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > index 523792768080..9c555f81431e 100644
> > --- a/fs/xfs/xfs_fsops.c
> > +++ b/fs/xfs/xfs_fsops.c
> > @@ -502,7 +502,7 @@ xfs_growfs_data_private(
> >  		error = xfs_free_extent(tp,
> >  				XFS_AGB_TO_FSB(mp, agno,
> >  					be32_to_cpu(agf->agf_length) - new),
> > -				new, &oinfo, XFS_AG_RESV_NONE);
> > +				new, &oinfo, XFS_AG_RESV_NONE, false);
> >  		if (error)
> >  			goto error0;
> >  	}
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index cdbd342a5249..08381266ad85 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -604,7 +604,7 @@ xfs_reflink_cancel_cow_blocks(
> >  
> >  			xfs_bmap_add_free(ip->i_mount, &dfops,
> >  					del.br_startblock, del.br_blockcount,
> > -					NULL);
> > +					NULL, false);
> >  
> >  			/* Roll the transaction */
> >  			xfs_defer_ijoin(&dfops, ip);
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index 9d542dfe0052..d5be3f6a3e8f 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -228,7 +228,8 @@ struct xfs_efd_log_item	*xfs_trans_get_efd(struct xfs_trans *,
> >  				  uint);
> >  int		xfs_trans_free_extent(struct xfs_trans *,
> >  				      struct xfs_efd_log_item *, xfs_fsblock_t,
> > -				      xfs_extlen_t, struct xfs_owner_info *);
> > +				      xfs_extlen_t, struct xfs_owner_info *,
> > +				      bool);
> >  int		xfs_trans_commit(struct xfs_trans *);
> >  int		xfs_trans_roll(struct xfs_trans **);
> >  int		xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *);
> > diff --git a/fs/xfs/xfs_trans_extfree.c b/fs/xfs/xfs_trans_extfree.c
> > index ab438647592a..cd2acfa4e562 100644
> > --- a/fs/xfs/xfs_trans_extfree.c
> > +++ b/fs/xfs/xfs_trans_extfree.c
> > @@ -68,7 +68,8 @@ xfs_trans_free_extent(
> >  	struct xfs_efd_log_item	*efdp,
> >  	xfs_fsblock_t		start_block,
> >  	xfs_extlen_t		ext_len,
> > -	struct xfs_owner_info	*oinfo)
> > +	struct xfs_owner_info	*oinfo,
> > +	bool			skip_discard)
> >  {
> >  	struct xfs_mount	*mp = tp->t_mountp;
> >  	uint			next_extent;
> > @@ -80,7 +81,7 @@ xfs_trans_free_extent(
> >  	trace_xfs_bmap_free_deferred(tp->t_mountp, agno, 0, agbno, ext_len);
> >  
> >  	error = xfs_free_extent(tp, start_block, ext_len, oinfo,
> > -			XFS_AG_RESV_NONE);
> > +			XFS_AG_RESV_NONE, skip_discard);
> >  
> >  	/*
> >  	 * Mark the transaction dirty, even on error. This ensures the
> > @@ -195,7 +196,7 @@ xfs_extent_free_finish_item(
> >  	error = xfs_trans_free_extent(tp, done_item,
> >  			free->xefi_startblock,
> >  			free->xefi_blockcount,
> > -			&free->xefi_oinfo);
> > +			&free->xefi_oinfo, free->xefi_skip_discard);
> >  	kmem_free(free);
> >  	return error;
> >  }
> > -- 
> > 2.13.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Carlos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Carlos Maiolino April 30, 2018, 1:26 p.m. UTC | #3
> > > This behavior is of course not directly related to unwritten extents,
> > > but the immediate/obvious solution to bubble up a bmapi flag of some
> > > kind to xfs_free_eofblocks() seemed rather crude. From there, I figured
> > > that we technically don't need to discard any unwritten extents (within
> > > or beyond EOF) because they haven't been written to since being
> > > allocated. In fact, I'm not sure we have to even busy them, but it's
> > > roughly equivalent logic either way and I'm trying to avoid getting too
> > > clever.
> > > 
> > 
> > There is one thing I'm wondering here, pardon me if my assumptions are wrong.
> > 
> > Regarding the discard of post-eof blocks, one thing that comes to my mind, is
> > what would happen after let's say, a failed fallocate (I've run into this issued
> > a few months ago, and it just came to my mind while looking into this patch).
> > Currently, after a failed fallocate (issued beyond EOF), we do not rollback any
> > successfully allocated extents, keeping such extents allocated to the file,
> > consuming filesystem's space.
> > 
> 
> Right.. I think this is fairly generic behavior, regardless of whether
> some of the extents extend or run past eof. IOW, a failed falloc is
> simply a partial falloc operation.
> 
> > Today, such extents are freed when xfs_free_eofblocks() is called, freeing up
> > the space partially reserved by the failed fallocate.
> > 
> 
> Hmm, I suspect this is more of an implementation detail than a hard and
> fast rule. The eofblocks trimming should not occur on any inode that has
> user-driven preallocation, which is marked via XFS_DIFLAG_PREALLOC. I
> haven't tested behavior, but on a quick look it appears we set this flag
> only after successful completion of xfs_alloc_file_space() in the
> fallocate path.
> 

IIRC for what I've worked/studied during the time I faced this issue, we can end
up with a partial allocation from the failed fallocate whenever we need to
allocate more than one extent, so the successfully allocated extent will be
marked as unwritten, and if such extent is post-eof, a simple close() to the
file will end up calling xfs_free_eofblocks() on such extent.

> I think that means it's possible for a first fallocate on an inode to
> partially succeed allocating blocks and fail to set the flag. This means
> an eofblocks trim could affect such blocks. I also think that means it's
> possible for a subsequent fallocate to partially succeed where the flag
> may have already been set on the inode..?

If the flag has not been already set, I believe it will depends on if at least
one extent will be successfully pre-allocated by the fallocate, I'm not 100%
sure though, I'm writing it trusting on my bad memory :)

>If so, that means a subsequent
> eofblocks trim would not affect such blocks.
> 
> FWIW, I'm not terribly concerned with error handling behavior in this
> case. Ultimately the user will need to fix up the file one way or
> another and this patch is more concerned about discard of the resulting
> blocks if they are freed (independent of whether they should have been
> freed in the first place).
>
Indeed

> > I honestly do not remember if in such situation, we might already have issued
> > any write to the underlying device or not, if we did, issuing a discard here is
> > still useful.
> > 
> 
> eofblocks aside, I think it's always possible for the fs to send writes
> and then crash after some of those writes have completed but before
> we've converted unwritten extents. If the latter transaction is lost,
> the blocks will have been written by an underlying thin device (for
> e.g.) but not tracked as such in the fs. Failure to discard such blocks
> if they are eventually freed means the underlying device would not
> release them.
> 
> > I do *think* no writes have been issued to the block device so skipping discards
> > is not an issue, but I thought it might be worth to bring up such case.
> > 
> 
> I think it's a good point wrt to the discard. I'm not totally sure what
> the right answer is. On one hand, I think the discards are technically
> advisory so it might be a worthwhile tradeoff for a rare situation that
> can already be introduced in other ways (such as a discard being ignored
> or an intermediate mount without online discard enabled). We could
> always recommend an fstrim after a log recovery, for example.
> 
> On the other hand, others might think that it's more appropriate to be
> aggressive with online discards in the fs as we are today to provide the
> device with as much information as possible.
> 

Well, it ends up depending on user preferences, but as I said, it's just a
situation I faced a few months ago and I thought it would be worth to mention,
as you said, in any failed fallocate situation, the user should roll it back,
the post-eof blocks would be freed only in certain situations though, and still,
such blocks will only be automatically freed if and only if they are post EOF,
anything before EOF we won't ever touch unless requested by the user.

> Brian
> 
> > The right approach from user is still to rollback its failed fallocate, but I
> > still think post eof block trim is useful in this case. Though I'm not sure if
> > the discard itself is useful or not.
> > 
> > Thoughts?
> > 
> > Cheers
> > > I also recall that we've discussed using unwritten extents for delalloc
> > > -> real conversion to avoid the small stale data exposure window that
> > > exists in writeback. Without getting too deep into the reason we don't
> > > currently do an initial unwritten allocation [1], I don't think there's
> > > anything blocking us from converting any post-eof blocks that happen to
> > > be part of the resulting normal allocation. As it is, the imap is
> > > already trimmed to EOF by the writeback code for coherency reasons. If
> > > we were to convert post-eof blocks (not part of this patch) along with
> > > something like this patch, then we'd indirectly prevent discards for
> > > eofblocks trims.
> > > 
> > > Beyond the whole discard thing, conversion of post-eof blocks may have a
> > > couple other advantages. First, we eliminate the aforementioned
> > > writeback stale data exposure problem for writes over preallocated
> > > blocks (which doesn't solve the fundamental problem, but closes the
> > > gap). Second, the zeroing required for post-eof writes that jump over
> > > eofblocks (see xfs_file_aio_write_checks()) becomes a much lighter
> > > weight operation. Normal blocks are zeroed using buffered writes whereas
> > > this is essentially a no-op for unwritten extents.
> > > 
> > > Thoughts? Flames? Other ideas?
> > > 
> > > Brian
> > > 
> > > [1] I think I've actually attempted this change in the past, but I
> > > haven't dug through my old git branches as of yet to completely jog my
> > > memory. IIRC, this may have been held up by the remnants of buffer_heads
> > > being used to track state for the writeback code.
> > > 
> > >  fs/xfs/libxfs/xfs_alloc.c          | 8 ++++++--
> > >  fs/xfs/libxfs/xfs_alloc.h          | 3 ++-
> > >  fs/xfs/libxfs/xfs_bmap.c           | 9 ++++++---
> > >  fs/xfs/libxfs/xfs_bmap.h           | 3 ++-
> > >  fs/xfs/libxfs/xfs_bmap_btree.c     | 2 +-
> > >  fs/xfs/libxfs/xfs_ialloc.c         | 4 ++--
> > >  fs/xfs/libxfs/xfs_ialloc_btree.c   | 2 +-
> > >  fs/xfs/libxfs/xfs_refcount.c       | 7 ++++---
> > >  fs/xfs/libxfs/xfs_refcount_btree.c | 2 +-
> > >  fs/xfs/xfs_extfree_item.c          | 2 +-
> > >  fs/xfs/xfs_fsops.c                 | 2 +-
> > >  fs/xfs/xfs_reflink.c               | 2 +-
> > >  fs/xfs/xfs_trans.h                 | 3 ++-
> > >  fs/xfs/xfs_trans_extfree.c         | 7 ++++---
> > >  14 files changed, 34 insertions(+), 22 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > index 4bcc095fe44a..942c90ec6747 100644
> > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > @@ -2954,13 +2954,15 @@ xfs_free_extent(
> > >  	xfs_fsblock_t		bno,	/* starting block number of extent */
> > >  	xfs_extlen_t		len,	/* length of extent */
> > >  	struct xfs_owner_info	*oinfo,	/* extent owner */
> > > -	enum xfs_ag_resv_type	type)	/* block reservation type */
> > > +	enum xfs_ag_resv_type	type,	/* block reservation type */
> > > +	bool			skip_discard)
> > >  {
> > >  	struct xfs_mount	*mp = tp->t_mountp;
> > >  	struct xfs_buf		*agbp;
> > >  	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, bno);
> > >  	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(mp, bno);
> > >  	int			error;
> > > +	unsigned int		busy_flags = 0;
> > >  
> > >  	ASSERT(len != 0);
> > >  	ASSERT(type != XFS_AG_RESV_AGFL);
> > > @@ -2984,7 +2986,9 @@ xfs_free_extent(
> > >  	if (error)
> > >  		goto err;
> > >  
> > > -	xfs_extent_busy_insert(tp, agno, agbno, len, 0);
> > > +	if (skip_discard)
> > > +		busy_flags |= XFS_EXTENT_BUSY_SKIP_DISCARD;
> > > +	xfs_extent_busy_insert(tp, agno, agbno, len, busy_flags);
> > >  	return 0;
> > >  
> > >  err:
> > > diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> > > index cbf789ea5a4e..5c7d8391edc4 100644
> > > --- a/fs/xfs/libxfs/xfs_alloc.h
> > > +++ b/fs/xfs/libxfs/xfs_alloc.h
> > > @@ -196,7 +196,8 @@ xfs_free_extent(
> > >  	xfs_fsblock_t		bno,	/* starting block number of extent */
> > >  	xfs_extlen_t		len,	/* length of extent */
> > >  	struct xfs_owner_info	*oinfo,	/* extent owner */
> > > -	enum xfs_ag_resv_type	type);	/* block reservation type */
> > > +	enum xfs_ag_resv_type	type,	/* block reservation type */
> > > +	bool			skip_discard);
> > >  
> > >  int				/* error */
> > >  xfs_alloc_lookup_le(
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index 040eeda8426f..a5a37803f589 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -547,7 +547,8 @@ xfs_bmap_add_free(
> > >  	struct xfs_defer_ops		*dfops,
> > >  	xfs_fsblock_t			bno,
> > >  	xfs_filblks_t			len,
> > > -	struct xfs_owner_info		*oinfo)
> > > +	struct xfs_owner_info		*oinfo,
> > > +	bool				skip_discard)
> > >  {
> > >  	struct xfs_extent_free_item	*new;		/* new element */
> > >  #ifdef DEBUG
> > > @@ -574,6 +575,7 @@ xfs_bmap_add_free(
> > >  		new->xefi_oinfo = *oinfo;
> > >  	else
> > >  		xfs_rmap_skip_owner_update(&new->xefi_oinfo);
> > > +	new->xefi_skip_discard = skip_discard;
> > >  	trace_xfs_bmap_free_defer(mp, XFS_FSB_TO_AGNO(mp, bno), 0,
> > >  			XFS_FSB_TO_AGBNO(mp, bno), len);
> > >  	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_FREE, &new->xefi_list);
> > > @@ -632,7 +634,7 @@ xfs_bmap_btree_to_extents(
> > >  	if ((error = xfs_btree_check_block(cur, cblock, 0, cbp)))
> > >  		return error;
> > >  	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork);
> > > -	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1, &oinfo);
> > > +	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1, &oinfo, false);
> > >  	ip->i_d.di_nblocks--;
> > >  	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, -1L);
> > >  	xfs_trans_binval(tp, cbp);
> > > @@ -5106,7 +5108,8 @@ xfs_bmap_del_extent_real(
> > >  				goto done;
> > >  		} else
> > >  			xfs_bmap_add_free(mp, dfops, del->br_startblock,
> > > -					del->br_blockcount, NULL);
> > > +					del->br_blockcount, NULL,
> > > +					del->br_state == XFS_EXT_UNWRITTEN);
> > >  	}
> > >  
> > >  	/*
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > > index 2b766b37096d..0d2de22d143e 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.h
> > > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > > @@ -68,6 +68,7 @@ struct xfs_extent_free_item
> > >  	xfs_extlen_t		xefi_blockcount;/* number of blocks in extent */
> > >  	struct list_head	xefi_list;
> > >  	struct xfs_owner_info	xefi_oinfo;	/* extent owner */
> > > +	bool			xefi_skip_discard;
> > >  };
> > >  
> > >  #define	XFS_BMAP_MAX_NMAP	4
> > > @@ -194,7 +195,7 @@ int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
> > >  void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
> > >  void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > >  			  xfs_fsblock_t bno, xfs_filblks_t len,
> > > -			  struct xfs_owner_info *oinfo);
> > > +			  struct xfs_owner_info *oinfo, bool skip_discard);
> > >  void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
> > >  int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
> > >  		xfs_extlen_t len, xfs_fileoff_t *unused, int whichfork);
> > > diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
> > > index d89d06bea6e3..cecddfcbe11e 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
> > > @@ -305,7 +305,7 @@ xfs_bmbt_free_block(
> > >  	struct xfs_owner_info	oinfo;
> > >  
> > >  	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, cur->bc_private.b.whichfork);
> > > -	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1, &oinfo);
> > > +	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1, &oinfo, false);
> > >  	ip->i_d.di_nblocks--;
> > >  
> > >  	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > > index de627fa19168..854ebe04c86f 100644
> > > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > > @@ -1837,7 +1837,7 @@ xfs_difree_inode_chunk(
> > >  	if (!xfs_inobt_issparse(rec->ir_holemask)) {
> > >  		/* not sparse, calculate extent info directly */
> > >  		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, sagbno),
> > > -				  mp->m_ialloc_blks, &oinfo);
> > > +				  mp->m_ialloc_blks, &oinfo, false);
> > >  		return;
> > >  	}
> > >  
> > > @@ -1881,7 +1881,7 @@ xfs_difree_inode_chunk(
> > >  		ASSERT(agbno % mp->m_sb.sb_spino_align == 0);
> > >  		ASSERT(contigblk % mp->m_sb.sb_spino_align == 0);
> > >  		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, agbno),
> > > -				  contigblk, &oinfo);
> > > +				  contigblk, &oinfo, false);
> > >  
> > >  		/* reset range to current bit and carry on... */
> > >  		startidx = endidx = nextbit;
> > > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > index 367e9a0726e6..977a33cc60d3 100644
> > > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > @@ -153,7 +153,7 @@ __xfs_inobt_free_block(
> > >  	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
> > >  	return xfs_free_extent(cur->bc_tp,
> > >  			XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp)), 1,
> > > -			&oinfo, resv);
> > > +			&oinfo, resv, false);
> > >  }
> > >  
> > >  STATIC int
> > > diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> > > index 560e28473024..e5cfbe2534b1 100644
> > > --- a/fs/xfs/libxfs/xfs_refcount.c
> > > +++ b/fs/xfs/libxfs/xfs_refcount.c
> > > @@ -883,7 +883,8 @@ xfs_refcount_adjust_extents(
> > >  						cur->bc_private.a.agno,
> > >  						tmp.rc_startblock);
> > >  				xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
> > > -						tmp.rc_blockcount, oinfo);
> > > +						tmp.rc_blockcount, oinfo,
> > > +						false);
> > >  			}
> > >  
> > >  			(*agbno) += tmp.rc_blockcount;
> > > @@ -926,7 +927,7 @@ xfs_refcount_adjust_extents(
> > >  					cur->bc_private.a.agno,
> > >  					ext.rc_startblock);
> > >  			xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
> > > -					ext.rc_blockcount, oinfo);
> > > +					ext.rc_blockcount, oinfo, false);
> > >  		}
> > >  
> > >  skip:
> > > @@ -1658,7 +1659,7 @@ xfs_refcount_recover_cow_leftovers(
> > >  
> > >  		/* Free the block. */
> > >  		xfs_bmap_add_free(mp, &dfops, fsb,
> > > -				rr->rr_rrec.rc_blockcount, NULL);
> > > +				rr->rr_rrec.rc_blockcount, NULL, false);
> > >  
> > >  		error = xfs_defer_finish(&tp, &dfops);
> > >  		if (error)
> > > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> > > index 375abfeb6267..bb0bdc6d97b5 100644
> > > --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> > > @@ -131,7 +131,7 @@ xfs_refcountbt_free_block(
> > >  	be32_add_cpu(&agf->agf_refcount_blocks, -1);
> > >  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
> > >  	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
> > > -			XFS_AG_RESV_METADATA);
> > > +			XFS_AG_RESV_METADATA, false);
> > >  	if (error)
> > >  		return error;
> > >  
> > > diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> > > index b5b1e567b9f4..4735a31793b0 100644
> > > --- a/fs/xfs/xfs_extfree_item.c
> > > +++ b/fs/xfs/xfs_extfree_item.c
> > > @@ -542,7 +542,7 @@ xfs_efi_recover(
> > >  	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> > >  		extp = &efip->efi_format.efi_extents[i];
> > >  		error = xfs_trans_free_extent(tp, efdp, extp->ext_start,
> > > -					      extp->ext_len, &oinfo);
> > > +					      extp->ext_len, &oinfo, false);
> > >  		if (error)
> > >  			goto abort_error;
> > >  
> > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > index 523792768080..9c555f81431e 100644
> > > --- a/fs/xfs/xfs_fsops.c
> > > +++ b/fs/xfs/xfs_fsops.c
> > > @@ -502,7 +502,7 @@ xfs_growfs_data_private(
> > >  		error = xfs_free_extent(tp,
> > >  				XFS_AGB_TO_FSB(mp, agno,
> > >  					be32_to_cpu(agf->agf_length) - new),
> > > -				new, &oinfo, XFS_AG_RESV_NONE);
> > > +				new, &oinfo, XFS_AG_RESV_NONE, false);
> > >  		if (error)
> > >  			goto error0;
> > >  	}
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index cdbd342a5249..08381266ad85 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -604,7 +604,7 @@ xfs_reflink_cancel_cow_blocks(
> > >  
> > >  			xfs_bmap_add_free(ip->i_mount, &dfops,
> > >  					del.br_startblock, del.br_blockcount,
> > > -					NULL);
> > > +					NULL, false);
> > >  
> > >  			/* Roll the transaction */
> > >  			xfs_defer_ijoin(&dfops, ip);
> > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > index 9d542dfe0052..d5be3f6a3e8f 100644
> > > --- a/fs/xfs/xfs_trans.h
> > > +++ b/fs/xfs/xfs_trans.h
> > > @@ -228,7 +228,8 @@ struct xfs_efd_log_item	*xfs_trans_get_efd(struct xfs_trans *,
> > >  				  uint);
> > >  int		xfs_trans_free_extent(struct xfs_trans *,
> > >  				      struct xfs_efd_log_item *, xfs_fsblock_t,
> > > -				      xfs_extlen_t, struct xfs_owner_info *);
> > > +				      xfs_extlen_t, struct xfs_owner_info *,
> > > +				      bool);
> > >  int		xfs_trans_commit(struct xfs_trans *);
> > >  int		xfs_trans_roll(struct xfs_trans **);
> > >  int		xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *);
> > > diff --git a/fs/xfs/xfs_trans_extfree.c b/fs/xfs/xfs_trans_extfree.c
> > > index ab438647592a..cd2acfa4e562 100644
> > > --- a/fs/xfs/xfs_trans_extfree.c
> > > +++ b/fs/xfs/xfs_trans_extfree.c
> > > @@ -68,7 +68,8 @@ xfs_trans_free_extent(
> > >  	struct xfs_efd_log_item	*efdp,
> > >  	xfs_fsblock_t		start_block,
> > >  	xfs_extlen_t		ext_len,
> > > -	struct xfs_owner_info	*oinfo)
> > > +	struct xfs_owner_info	*oinfo,
> > > +	bool			skip_discard)
> > >  {
> > >  	struct xfs_mount	*mp = tp->t_mountp;
> > >  	uint			next_extent;
> > > @@ -80,7 +81,7 @@ xfs_trans_free_extent(
> > >  	trace_xfs_bmap_free_deferred(tp->t_mountp, agno, 0, agbno, ext_len);
> > >  
> > >  	error = xfs_free_extent(tp, start_block, ext_len, oinfo,
> > > -			XFS_AG_RESV_NONE);
> > > +			XFS_AG_RESV_NONE, skip_discard);
> > >  
> > >  	/*
> > >  	 * Mark the transaction dirty, even on error. This ensures the
> > > @@ -195,7 +196,7 @@ xfs_extent_free_finish_item(
> > >  	error = xfs_trans_free_extent(tp, done_item,
> > >  			free->xefi_startblock,
> > >  			free->xefi_blockcount,
> > > -			&free->xefi_oinfo);
> > > +			&free->xefi_oinfo, free->xefi_skip_discard);
> > >  	kmem_free(free);
> > >  	return error;
> > >  }
> > > -- 
> > > 2.13.6
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > -- 
> > Carlos
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner April 30, 2018, 10 p.m. UTC | #4
On Thu, Apr 26, 2018 at 02:06:24PM -0400, Brian Foster wrote:
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> Hi all,
> 
> What do folks think of something like this?

Definitely sounds like something we need to address.

> The motivation here is that
> the VDO (dedup) devs had reported seeing online discards during
> write-only workloads. These turn out to be related to trimming post-eof
> preallocation blocks after large file copies. To my knowledge, this
> isn't really a prevalent or serious issue, but I think that technically
> these discards are unnecessary and so I was looking into how we could
> avoid them.

We simply trucate post-eof extents, right? So we know in
xfs_itruncate_extents() if the inode size is changing, not to
mention we know if the extent is beyond EOF? e.g. all calls to
xfs_itruncate_extents() other than xfs_free_eofblocks() change the
inode size and so directly indicate they are removing written
blocks. Anything where the inode size is not changing is doing a
post-eof removal, and so we can assume no data has been written?

So rather than converting everything to unwritten extents, the "skip
discard flag" is simply triggered via extents being freed sitting
beyond the current EOF (not the new EOF) and/or being unwritten?

> This behavior is of course not directly related to unwritten extents,
> but the immediate/obvious solution to bubble up a bmapi flag of some
> kind to xfs_free_eofblocks() seemed rather crude. From there, I figured
> that we technically don't need to discard any unwritten extents (within
> or beyond EOF) because they haven't been written to since being
> allocated. In fact, I'm not sure we have to even busy them, but it's
> roughly equivalent logic either way and I'm trying to avoid getting too
> clever.

I think we still need to busy them to avoid re-allocating them in
the same checkpoint, as data extent free/realloc in the same
checkpoint could result in a failed recovery (i.e. partial
checkpoint replay) leaving the extent linked into two separate
files.

> I also recall that we've discussed using unwritten extents for delalloc
> -> real conversion to avoid the small stale data exposure window that
> exists in writeback. Without getting too deep into the reason we don't
> currently do an initial unwritten allocation [1], I don't think there's
> anything blocking us from converting any post-eof blocks that happen to
> be part of the resulting normal allocation. As it is, the imap is
> already trimmed to EOF by the writeback code for coherency reasons. If
> we were to convert post-eof blocks (not part of this patch) along with
> something like this patch, then we'd indirectly prevent discards for
> eofblocks trims.

I think we should leave that as a separate problem, as writeback
currently has issues with the way we manage bufferhead state.
i.e. things don't work right if we put unwritten extents under
delalloc buffers and vice versa. [ I have patches to address that
I'm working on.] And there's also the issue that we need to change
the delalloc reservations to take into account block allocations
required by unwritten extent conversion needed by delalloc.

Hence I think we should address that as a separate problem, not as
the solution to avoiding discard of post-eof extents.

> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 4bcc095fe44a..942c90ec6747 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2954,13 +2954,15 @@ xfs_free_extent(
>  	xfs_fsblock_t		bno,	/* starting block number of extent */
>  	xfs_extlen_t		len,	/* length of extent */
>  	struct xfs_owner_info	*oinfo,	/* extent owner */
> -	enum xfs_ag_resv_type	type)	/* block reservation type */
> +	enum xfs_ag_resv_type	type,	/* block reservation type */
> +	bool			skip_discard)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_buf		*agbp;
>  	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, bno);
>  	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(mp, bno);
>  	int			error;
> +	unsigned int		busy_flags = 0;
>  
>  	ASSERT(len != 0);
>  	ASSERT(type != XFS_AG_RESV_AGFL);
> @@ -2984,7 +2986,9 @@ xfs_free_extent(
>  	if (error)
>  		goto err;
>  
> -	xfs_extent_busy_insert(tp, agno, agbno, len, 0);
> +	if (skip_discard)
> +		busy_flags |= XFS_EXTENT_BUSY_SKIP_DISCARD;
> +	xfs_extent_busy_insert(tp, agno, agbno, len, busy_flags);
>  	return 0;

Rather than changing xfs_free_extent(), how about adding a
xfs_free_extent_nodiscard() wrapper, and only call it when
processing an extent that doesn't need discard? This means none of
the other code that frees extents needs to be changed. Similarly
add a xfs_bmap_add_free_nodiscard() wrapper. That will cut down on
the code churn caused by passing new parameters everywhere....

Cheers,

Dave.
Brian Foster May 1, 2018, 5:38 p.m. UTC | #5
On Tue, May 01, 2018 at 08:00:20AM +1000, Dave Chinner wrote:
> On Thu, Apr 26, 2018 at 02:06:24PM -0400, Brian Foster wrote:
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > Hi all,
> > 
> > What do folks think of something like this?
> 
> Definitely sounds like something we need to address.
> 
> > The motivation here is that
> > the VDO (dedup) devs had reported seeing online discards during
> > write-only workloads. These turn out to be related to trimming post-eof
> > preallocation blocks after large file copies. To my knowledge, this
> > isn't really a prevalent or serious issue, but I think that technically
> > these discards are unnecessary and so I was looking into how we could
> > avoid them.
> 
> We simply trucate post-eof extents, right? So we know in
> xfs_itruncate_extents() if the inode size is changing, not to
> mention we know if the extent is beyond EOF? e.g. all calls to
> xfs_itruncate_extents() other than xfs_free_eofblocks() change the
> inode size and so directly indicate they are removing written
> blocks. Anything where the inode size is not changing is doing a
> post-eof removal, and so we can assume no data has been written?
> 

Yes, xfs_free_eofblocks() is the only caller explicitly responsible for
trimming post-eof extents.

> So rather than converting everything to unwritten extents, the "skip
> discard flag" is simply triggered via extents being freed sitting
> beyond the current EOF (not the new EOF) and/or being unwritten?
> 

That was pretty much my initial thought, but note that the extent free
is ultimately deferred down in xfs_free_eofblocks() ->
xfs_itruncate_extents() -> xfs_bunmapi() -> xfs_bmap_del_extent_real()
-> xfs_bmap_add_free(). We can communicate this down to that point with
an itruncate_extents() parameter and XFS_BMAPI_NODISCARD flag or some
such, it just seemed a bit kludgy to pass that down through those layers
when the unwritten state is known in the bunmapi code (but I'll take
that approach if preferred).

> > This behavior is of course not directly related to unwritten extents,
> > but the immediate/obvious solution to bubble up a bmapi flag of some
> > kind to xfs_free_eofblocks() seemed rather crude. From there, I figured
> > that we technically don't need to discard any unwritten extents (within
> > or beyond EOF) because they haven't been written to since being
> > allocated. In fact, I'm not sure we have to even busy them, but it's
> > roughly equivalent logic either way and I'm trying to avoid getting too
> > clever.
> 
> I think we still need to busy them to avoid re-allocating them in
> the same checkpoint, as data extent free/realloc in the same
> checkpoint could result in a failed recovery (i.e. partial
> checkpoint replay) leaving the extent linked into two separate
> files.
> 

Ah, Ok.. I was only thinking about metadata/data reuse. Hm, isn't the
filesystem essentially corrupted on a failed/partial recovery anyways?
(Not that this matters much in this context, I wasn't planning to bypass
the busy sequence...).

> > I also recall that we've discussed using unwritten extents for delalloc
> > -> real conversion to avoid the small stale data exposure window that
> > exists in writeback. Without getting too deep into the reason we don't
> > currently do an initial unwritten allocation [1], I don't think there's
> > anything blocking us from converting any post-eof blocks that happen to
> > be part of the resulting normal allocation. As it is, the imap is
> > already trimmed to EOF by the writeback code for coherency reasons. If
> > we were to convert post-eof blocks (not part of this patch) along with
> > something like this patch, then we'd indirectly prevent discards for
> > eofblocks trims.
> 
> I think we should leave that as a separate problem, as writeback
> currently has issues with the way we manage bufferhead state.
> i.e. things don't work right if we put unwritten extents under
> delalloc buffers and vice versa. [ I have patches to address that
> I'm working on.] And there's also the issue that we need to change
> the delalloc reservations to take into account block allocations
> required by unwritten extent conversion needed by delalloc.
> 

Right.. I wasn't planning to try and solve the whole buffer head state
mismatch thing as a dependency to not discard eofblocks. I was only
going to convert blocks that happened to be post-eof after the
xfs_iomap_write_allocate() allocation because those blocks by definition
don't have buffers. So it's essentially just another xfs_bmapi_write()
call from xfs_iomap_write_allocate() to convert eofblocks if the just
allocated mapping crosses eof.

> Hence I think we should address that as a separate problem, not as
> the solution to avoiding discard of post-eof extents.
> 

Fair enough, I'll look into tacking on a separate patch to also skip
discards for unwritten extents (irrespective of eof).

> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 4bcc095fe44a..942c90ec6747 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -2954,13 +2954,15 @@ xfs_free_extent(
> >  	xfs_fsblock_t		bno,	/* starting block number of extent */
> >  	xfs_extlen_t		len,	/* length of extent */
> >  	struct xfs_owner_info	*oinfo,	/* extent owner */
> > -	enum xfs_ag_resv_type	type)	/* block reservation type */
> > +	enum xfs_ag_resv_type	type,	/* block reservation type */
> > +	bool			skip_discard)
> >  {
> >  	struct xfs_mount	*mp = tp->t_mountp;
> >  	struct xfs_buf		*agbp;
> >  	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, bno);
> >  	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(mp, bno);
> >  	int			error;
> > +	unsigned int		busy_flags = 0;
> >  
> >  	ASSERT(len != 0);
> >  	ASSERT(type != XFS_AG_RESV_AGFL);
> > @@ -2984,7 +2986,9 @@ xfs_free_extent(
> >  	if (error)
> >  		goto err;
> >  
> > -	xfs_extent_busy_insert(tp, agno, agbno, len, 0);
> > +	if (skip_discard)
> > +		busy_flags |= XFS_EXTENT_BUSY_SKIP_DISCARD;
> > +	xfs_extent_busy_insert(tp, agno, agbno, len, busy_flags);
> >  	return 0;
> 
> Rather than changing xfs_free_extent(), how about adding a
> xfs_free_extent_nodiscard() wrapper, and only call it when
> processing an extent that doesn't need discard? This means none of
> the other code that frees extents needs to be changed. Similarly
> add a xfs_bmap_add_free_nodiscard() wrapper. That will cut down on
> the code churn caused by passing new parameters everywhere....
> 

Ok, I can clean that up. Thanks for the feedback.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner May 1, 2018, 10:39 p.m. UTC | #6
On Tue, May 01, 2018 at 01:38:15PM -0400, Brian Foster wrote:
> On Tue, May 01, 2018 at 08:00:20AM +1000, Dave Chinner wrote:
> > On Thu, Apr 26, 2018 at 02:06:24PM -0400, Brian Foster wrote:
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > 
> > > Hi all,
> > > 
> > > What do folks think of something like this?
> > 
> > Definitely sounds like something we need to address.
> > 
> > > The motivation here is that
> > > the VDO (dedup) devs had reported seeing online discards during
> > > write-only workloads. These turn out to be related to trimming post-eof
> > > preallocation blocks after large file copies. To my knowledge, this
> > > isn't really a prevalent or serious issue, but I think that technically
> > > these discards are unnecessary and so I was looking into how we could
> > > avoid them.
> > 
> > We simply trucate post-eof extents, right? So we know in
> > xfs_itruncate_extents() if the inode size is changing, not to
> > mention we know if the extent is beyond EOF? e.g. all calls to
> > xfs_itruncate_extents() other than xfs_free_eofblocks() change the
> > inode size and so directly indicate they are removing written
> > blocks. Anything where the inode size is not changing is doing a
> > post-eof removal, and so we can assume no data has been written?
> > 
> 
> Yes, xfs_free_eofblocks() is the only caller explicitly responsible for
> trimming post-eof extents.
> 
> > So rather than converting everything to unwritten extents, the "skip
> > discard flag" is simply triggered via extents being freed sitting
> > beyond the current EOF (not the new EOF) and/or being unwritten?
> > 
> 
> That was pretty much my initial thought, but note that the extent free
> is ultimately deferred down in xfs_free_eofblocks() ->
> xfs_itruncate_extents() -> xfs_bunmapi() -> xfs_bmap_del_extent_real()
> -> xfs_bmap_add_free(). We can communicate this down to that point with
> an itruncate_extents() parameter and XFS_BMAPI_NODISCARD flag or some
> such, it just seemed a bit kludgy to pass that down through those layers
> when the unwritten state is known in the bunmapi code (but I'll take
> that approach if preferred).

Hmmm - don't we already pass the XFS_BMAPI* flags to
xfs_bmap_del_extent_real()? If so, I don't think there's anything
extra that needs plumbing here. Conceptually it seems cleaner to me
to direct extent freeing policy through the bmapi interface flags
than it is add another flag interface elsewhere...

> > > This behavior is of course not directly related to unwritten extents,
> > > but the immediate/obvious solution to bubble up a bmapi flag of some
> > > kind to xfs_free_eofblocks() seemed rather crude. From there, I figured
> > > that we technically don't need to discard any unwritten extents (within
> > > or beyond EOF) because they haven't been written to since being
> > > allocated. In fact, I'm not sure we have to even busy them, but it's
> > > roughly equivalent logic either way and I'm trying to avoid getting too
> > > clever.
> > 
> > I think we still need to busy them to avoid re-allocating them in
> > the same checkpoint, as data extent free/realloc in the same
> > checkpoint could result in a failed recovery (i.e. partial
> > checkpoint replay) leaving the extent linked into two separate
> > files.
> > 
> 
> Ah, Ok.. I was only thinking about metadata/data reuse. Hm, isn't the
> filesystem essentially corrupted on a failed/partial recovery anyways?
> (Not that this matters much in this context, I wasn't planning to bypass
> the busy sequence...).

Yeah, the filesystem will be metadata corrupt, but IIUC skipping the
busy extent state and multiply linking data extents will make it
worse because repair just throws away multiply linked data extents.
i.e. partial log recovery could cause valid user data that should
have been left at rest to be thrown away by repair.

> > > I also recall that we've discussed using unwritten extents for delalloc
> > > -> real conversion to avoid the small stale data exposure window that
> > > exists in writeback. Without getting too deep into the reason we don't
> > > currently do an initial unwritten allocation [1], I don't think there's
> > > anything blocking us from converting any post-eof blocks that happen to
> > > be part of the resulting normal allocation. As it is, the imap is
> > > already trimmed to EOF by the writeback code for coherency reasons. If
> > > we were to convert post-eof blocks (not part of this patch) along with
> > > something like this patch, then we'd indirectly prevent discards for
> > > eofblocks trims.
> > 
> > I think we should leave that as a separate problem, as writeback
> > currently has issues with the way we manage bufferhead state.
> > i.e. things don't work right if we put unwritten extents under
> > delalloc buffers and vice versa. [ I have patches to address that
> > I'm working on.] And there's also the issue that we need to change
> > the delalloc reservations to take into account block allocations
> > required by unwritten extent conversion needed by delalloc.
> > 
> 
> Right.. I wasn't planning to try and solve the whole buffer head state
> mismatch thing as a dependency to not discard eofblocks. I was only
> going to convert blocks that happened to be post-eof after the
> xfs_iomap_write_allocate() allocation because those blocks by definition
> don't have buffers. So it's essentially just another xfs_bmapi_write()
> call from xfs_iomap_write_allocate() to convert eofblocks if the just
> allocated mapping crosses eof.

I think it's a bit more complex than that. We have delalloc buffers
in memory beyond the on disk EOF (i.e. in-memory EOF is far beyond
on disk EOF) but when we do the allocation we would have to mark the
entire extent covering beyond the current IO past the on-disk EOF as
unwritten. Hence the range between the on-disk EOF and the in-memory
EOF ends up with unwritten extents under a delalloc range. Now our
IO completion needs to be different for the next IO beyond EOF,
because it needs to do unwritten completion, not just a file size
update. This is currently problematic because we use the "buffer
head is delalloc" state to determine the IO completion action, not
the on disk extent state.

We could solve this with the current code, but it just gets complex
(even more so than what we have now). I'd much prefer we change over
to doing allocation based on the extent tree state (the patches I
have) rather than bufferhead state first, and then the change to
using unwritten extents for delalloc ranges is nice and
simple...

> > Hence I think we should address that as a separate problem, not as
> > the solution to avoiding discard of post-eof extents.
> > 
> 
> Fair enough, I'll look into tacking on a separate patch to also skip
> discards for unwritten extents (irrespective of eof).

Awesome!

Cheers,

Dave.
Brian Foster May 2, 2018, 11:18 a.m. UTC | #7
On Wed, May 02, 2018 at 08:39:52AM +1000, Dave Chinner wrote:
> On Tue, May 01, 2018 at 01:38:15PM -0400, Brian Foster wrote:
> > On Tue, May 01, 2018 at 08:00:20AM +1000, Dave Chinner wrote:
> > > On Thu, Apr 26, 2018 at 02:06:24PM -0400, Brian Foster wrote:
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > > 
> > > > Hi all,
> > > > 
> > > > What do folks think of something like this?
> > > 
> > > Definitely sounds like something we need to address.
> > > 
> > > > The motivation here is that
> > > > the VDO (dedup) devs had reported seeing online discards during
> > > > write-only workloads. These turn out to be related to trimming post-eof
> > > > preallocation blocks after large file copies. To my knowledge, this
> > > > isn't really a prevalent or serious issue, but I think that technically
> > > > these discards are unnecessary and so I was looking into how we could
> > > > avoid them.
> > > 
> > > We simply trucate post-eof extents, right? So we know in
> > > xfs_itruncate_extents() if the inode size is changing, not to
> > > mention we know if the extent is beyond EOF? e.g. all calls to
> > > xfs_itruncate_extents() other than xfs_free_eofblocks() change the
> > > inode size and so directly indicate they are removing written
> > > blocks. Anything where the inode size is not changing is doing a
> > > post-eof removal, and so we can assume no data has been written?
> > > 
> > 
> > Yes, xfs_free_eofblocks() is the only caller explicitly responsible for
> > trimming post-eof extents.
> > 
> > > So rather than converting everything to unwritten extents, the "skip
> > > discard flag" is simply triggered via extents being freed sitting
> > > beyond the current EOF (not the new EOF) and/or being unwritten?
> > > 
> > 
> > That was pretty much my initial thought, but note that the extent free
> > is ultimately deferred down in xfs_free_eofblocks() ->
> > xfs_itruncate_extents() -> xfs_bunmapi() -> xfs_bmap_del_extent_real()
> > -> xfs_bmap_add_free(). We can communicate this down to that point with
> > an itruncate_extents() parameter and XFS_BMAPI_NODISCARD flag or some
> > such, it just seemed a bit kludgy to pass that down through those layers
> > when the unwritten state is known in the bunmapi code (but I'll take
> > that approach if preferred).
> 
> Hmmm - don't we already pass the XFS_BMAPI* flags to
> xfs_bmap_del_extent_real()? If so, I don't think there's anything
> extra that needs plumbing here. Conceptually it seems cleaner to me
> to direct extent freeing policy through the bmapi interface flags
> than it is add another flag interface elsewhere...
> 

Yeah, the bmapi flag will cover down through there. The params (or
wrappers) are needed at the very top (xfs_itruncate_extents()) and
bottom (xfs_bmap_add_free_nodiscard()) of the chain.

The point I was trying to make above is just that the interface for
controlling discards as such for unwritten extents (eofblocks aside..)
is much more simple. We reduce the need to the
xfs_bmap_add_free_nodiscard() helper in that case.

> > > > This behavior is of course not directly related to unwritten extents,
> > > > but the immediate/obvious solution to bubble up a bmapi flag of some
> > > > kind to xfs_free_eofblocks() seemed rather crude. From there, I figured
> > > > that we technically don't need to discard any unwritten extents (within
> > > > or beyond EOF) because they haven't been written to since being
> > > > allocated. In fact, I'm not sure we have to even busy them, but it's
> > > > roughly equivalent logic either way and I'm trying to avoid getting too
> > > > clever.
> > > 
> > > I think we still need to busy them to avoid re-allocating them in
> > > the same checkpoint, as data extent free/realloc in the same
> > > checkpoint could result in a failed recovery (i.e. partial
> > > checkpoint replay) leaving the extent linked into two separate
> > > files.
> > > 
> > 
> > Ah, Ok.. I was only thinking about metadata/data reuse. Hm, isn't the
> > filesystem essentially corrupted on a failed/partial recovery anyways?
> > (Not that this matters much in this context, I wasn't planning to bypass
> > the busy sequence...).
> 
> Yeah, the filesystem will be metadata corrupt, but IIUC skipping the
> busy extent state and multiply linking data extents will make it
> worse because repair just throws away multiply linked data extents.
> i.e. partial log recovery could cause valid user data that should
> have been left at rest to be thrown away by repair.
> 

Interesting.. Ok.

> > > > I also recall that we've discussed using unwritten extents for delalloc
> > > > -> real conversion to avoid the small stale data exposure window that
> > > > exists in writeback. Without getting too deep into the reason we don't
> > > > currently do an initial unwritten allocation [1], I don't think there's
> > > > anything blocking us from converting any post-eof blocks that happen to
> > > > be part of the resulting normal allocation. As it is, the imap is
> > > > already trimmed to EOF by the writeback code for coherency reasons. If
> > > > we were to convert post-eof blocks (not part of this patch) along with
> > > > something like this patch, then we'd indirectly prevent discards for
> > > > eofblocks trims.
> > > 
> > > I think we should leave that as a separate problem, as writeback
> > > currently has issues with the way we manage bufferhead state.
> > > i.e. things don't work right if we put unwritten extents under
> > > delalloc buffers and vice versa. [ I have patches to address that
> > > I'm working on.] And there's also the issue that we need to change
> > > the delalloc reservations to take into account block allocations
> > > required by unwritten extent conversion needed by delalloc.
> > > 
> > 
> > Right.. I wasn't planning to try and solve the whole buffer head state
> > mismatch thing as a dependency to not discard eofblocks. I was only
> > going to convert blocks that happened to be post-eof after the
> > xfs_iomap_write_allocate() allocation because those blocks by definition
> > don't have buffers. So it's essentially just another xfs_bmapi_write()
> > call from xfs_iomap_write_allocate() to convert eofblocks if the just
> > allocated mapping crosses eof.
> 
> I think it's a bit more complex than that. We have delalloc buffers
> in memory beyond the on disk EOF (i.e. in-memory EOF is far beyond
> on disk EOF) but when we do the allocation we would have to mark the
> entire extent covering beyond the current IO past the on-disk EOF as
> unwritten. Hence the range between the on-disk EOF and the in-memory
> EOF ends up with unwritten extents under a delalloc range. Now our
> IO completion needs to be different for the next IO beyond EOF,
> because it needs to do unwritten completion, not just a file size
> update. This is currently problematic because we use the "buffer
> head is delalloc" state to determine the IO completion action, not
> the on disk extent state.
> 

Right, but the conversion would be based on the in-core EOF so we don't
end up with unwritten extents beneath delalloc buffer_head's. In fact,
the intent is to only convert blocks that are guaranteed to not have any
buffer_head coverage at all.

Hmm, I think you're thinking that I want to change some or all of the
existing delalloc -> real allocation to a delalloc -> prealloc
allocation somehow or another such that it affects within-eof state, and
that's not really what I had in mind. Given that I think we're talking
about different things and I'm still not sure what I was thinking is
totally sane, I'll look into tacking on an rfc with an example.

> We could solve this with the current code, but it just gets complex
> (even more so than what we have now). I'd much prefer we change over
> to doing allocation based on the extent tree state (the patches I
> have) rather than bufferhead state first, and then the change to
> using unwritten extents for delalloc ranges is nice and
> simple...
> 

I definitely agree with putting buffer_head -> extent state work ahead
of mucking around with the buffer_heads to try and deal with associated
extent state inconsistencies. I'm aware of the dragons there and so had
no intention of trying to complicate that code for the sake of an
optimization. I'll look more into it and post something tangible if
there's something there worth considering...

Brian

> > > Hence I think we should address that as a separate problem, not as
> > > the solution to avoiding discard of post-eof extents.
> > > 
> > 
> > Fair enough, I'll look into tacking on a separate patch to also skip
> > discards for unwritten extents (irrespective of eof).
> 
> Awesome!
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner May 3, 2018, 12:48 a.m. UTC | #8
On Wed, May 02, 2018 at 07:18:07AM -0400, Brian Foster wrote:
> On Wed, May 02, 2018 at 08:39:52AM +1000, Dave Chinner wrote:
> > On Tue, May 01, 2018 at 01:38:15PM -0400, Brian Foster wrote:
> > > On Tue, May 01, 2018 at 08:00:20AM +1000, Dave Chinner wrote:
> > > > On Thu, Apr 26, 2018 at 02:06:24PM -0400, Brian Foster wrote:
> > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > ---
> > > > > 
> > > > > Hi all,
> > > > > 
> > > > > What do folks think of something like this?
> > > > 
> > > > Definitely sounds like something we need to address.
> > > > 
> > > > > The motivation here is that
> > > > > the VDO (dedup) devs had reported seeing online discards during
> > > > > write-only workloads. These turn out to be related to trimming post-eof
> > > > > preallocation blocks after large file copies. To my knowledge, this
> > > > > isn't really a prevalent or serious issue, but I think that technically
> > > > > these discards are unnecessary and so I was looking into how we could
> > > > > avoid them.
> > > > 
> > > > We simply trucate post-eof extents, right? So we know in
> > > > xfs_itruncate_extents() if the inode size is changing, not to
> > > > mention we know if the extent is beyond EOF? e.g. all calls to
> > > > xfs_itruncate_extents() other than xfs_free_eofblocks() change the
> > > > inode size and so directly indicate they are removing written
> > > > blocks. Anything where the inode size is not changing is doing a
> > > > post-eof removal, and so we can assume no data has been written?
> > > > 
> > > 
> > > Yes, xfs_free_eofblocks() is the only caller explicitly responsible for
> > > trimming post-eof extents.
> > > 
> > > > So rather than converting everything to unwritten extents, the "skip
> > > > discard flag" is simply triggered via extents being freed sitting
> > > > beyond the current EOF (not the new EOF) and/or being unwritten?
> > > > 
> > > 
> > > That was pretty much my initial thought, but note that the extent free
> > > is ultimately deferred down in xfs_free_eofblocks() ->
> > > xfs_itruncate_extents() -> xfs_bunmapi() -> xfs_bmap_del_extent_real()
> > > -> xfs_bmap_add_free(). We can communicate this down to that point with
> > > an itruncate_extents() parameter and XFS_BMAPI_NODISCARD flag or some
> > > such, it just seemed a bit kludgy to pass that down through those layers
> > > when the unwritten state is known in the bunmapi code (but I'll take
> > > that approach if preferred).
> > 
> > Hmmm - don't we already pass the XFS_BMAPI* flags to
> > xfs_bmap_del_extent_real()? If so, I don't think there's anything
> > extra that needs plumbing here. Conceptually it seems cleaner to me
> > to direct extent freeing policy through the bmapi interface flags
> > than it is add another flag interface elsewhere...
> > 
> 
> Yeah, the bmapi flag will cover down through there. The params (or
> wrappers) are needed at the very top (xfs_itruncate_extents()) and
> bottom (xfs_bmap_add_free_nodiscard()) of the chain.
> 
> The point I was trying to make above is just that the interface for
> controlling discards as such for unwritten extents (eofblocks aside..)
> is much more simple. We reduce the need to the
> xfs_bmap_add_free_nodiscard() helper in that case.

Yeah, I see that. I was really looking at whether it's a general
optimisation we can make without having to add lots of plumbing,,,

> > > > I think we should leave that as a separate problem, as writeback
> > > > currently has issues with the way we manage bufferhead state.
> > > > i.e. things don't work right if we put unwritten extents under
> > > > delalloc buffers and vice versa. [ I have patches to address that
> > > > I'm working on.] And there's also the issue that we need to change
> > > > the delalloc reservations to take into account block allocations
> > > > required by unwritten extent conversion needed by delalloc.
> > > > 
> > > 
> > > Right.. I wasn't planning to try and solve the whole buffer head state
> > > mismatch thing as a dependency to not discard eofblocks. I was only
> > > going to convert blocks that happened to be post-eof after the
> > > xfs_iomap_write_allocate() allocation because those blocks by definition
> > > don't have buffers. So it's essentially just another xfs_bmapi_write()
> > > call from xfs_iomap_write_allocate() to convert eofblocks if the just
> > > allocated mapping crosses eof.
> > 
> > I think it's a bit more complex than that. We have delalloc buffers
> > in memory beyond the on disk EOF (i.e. in-memory EOF is far beyond
> > on disk EOF) but when we do the allocation we would have to mark the
> > entire extent covering beyond the current IO past the on-disk EOF as
> > unwritten. Hence the range between the on-disk EOF and the in-memory
> > EOF ends up with unwritten extents under a delalloc range. Now our
> > IO completion needs to be different for the next IO beyond EOF,
> > because it needs to do unwritten completion, not just a file size
> > update. This is currently problematic because we use the "buffer
> > head is delalloc" state to determine the IO completion action, not
> > the on disk extent state.
> > 
> 
> Right, but the conversion would be based on the in-core EOF so we don't
> end up with unwritten extents beneath delalloc buffer_head's. In fact,
> the intent is to only convert blocks that are guaranteed to not have any
> buffer_head coverage at all.

Which means if we truncate after the initial allocation, but before
we write the in-core data, we still discard the range between the
on-disk EOF and where the unwritten extent beyond EOF begins. And
for large files, that means we might not be converting any of the
extent allocated beyond EOF at all because we can hmore dirty data
queued up in memory than we can allocate contiguously.

> Hmm, I think you're thinking that I want to change some or all of the
> existing delalloc -> real allocation to a delalloc -> prealloc
> allocation somehow or another such that it affects within-eof state, and
> that's not really what I had in mind.

Not really, I was thinking about how the in memory EOF changes can
affect the on-disk EOF changes and the action that we need to take
when completing an IO and extending EOF. Some of the issues
are the same as for allocations within EOF, but I think there are
some different ones because of the way specualtive preallocation
beyond EOF works....

I suspect we don't really want to change allocation time behaviour
right now, just truncate time behaviour where we know that we've
already locked out incoming IO and the in-memory size cannot
change...

Cheers,

Dave.
Brian Foster May 3, 2018, 12:07 p.m. UTC | #9
On Thu, May 03, 2018 at 10:48:03AM +1000, Dave Chinner wrote:
> On Wed, May 02, 2018 at 07:18:07AM -0400, Brian Foster wrote:
> > On Wed, May 02, 2018 at 08:39:52AM +1000, Dave Chinner wrote:
> > > On Tue, May 01, 2018 at 01:38:15PM -0400, Brian Foster wrote:
> > > > On Tue, May 01, 2018 at 08:00:20AM +1000, Dave Chinner wrote:
> > > > > On Thu, Apr 26, 2018 at 02:06:24PM -0400, Brian Foster wrote:
> > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > > ---
> > > > > > 
> > > > > > Hi all,
> > > > > > 
> > > > > > What do folks think of something like this?
> > > > > 
> > > > > Definitely sounds like something we need to address.
> > > > > 
> > > > > > The motivation here is that
> > > > > > the VDO (dedup) devs had reported seeing online discards during
> > > > > > write-only workloads. These turn out to be related to trimming post-eof
> > > > > > preallocation blocks after large file copies. To my knowledge, this
> > > > > > isn't really a prevalent or serious issue, but I think that technically
> > > > > > these discards are unnecessary and so I was looking into how we could
> > > > > > avoid them.
> > > > > 
> > > > > We simply trucate post-eof extents, right? So we know in
> > > > > xfs_itruncate_extents() if the inode size is changing, not to
> > > > > mention we know if the extent is beyond EOF? e.g. all calls to
> > > > > xfs_itruncate_extents() other than xfs_free_eofblocks() change the
> > > > > inode size and so directly indicate they are removing written
> > > > > blocks. Anything where the inode size is not changing is doing a
> > > > > post-eof removal, and so we can assume no data has been written?
> > > > > 
> > > > 
> > > > Yes, xfs_free_eofblocks() is the only caller explicitly responsible for
> > > > trimming post-eof extents.
> > > > 
> > > > > So rather than converting everything to unwritten extents, the "skip
> > > > > discard flag" is simply triggered via extents being freed sitting
> > > > > beyond the current EOF (not the new EOF) and/or being unwritten?
> > > > > 
> > > > 
> > > > That was pretty much my initial thought, but note that the extent free
> > > > is ultimately deferred down in xfs_free_eofblocks() ->
> > > > xfs_itruncate_extents() -> xfs_bunmapi() -> xfs_bmap_del_extent_real()
> > > > -> xfs_bmap_add_free(). We can communicate this down to that point with
> > > > an itruncate_extents() parameter and XFS_BMAPI_NODISCARD flag or some
> > > > such, it just seemed a bit kludgy to pass that down through those layers
> > > > when the unwritten state is known in the bunmapi code (but I'll take
> > > > that approach if preferred).
> > > 
> > > Hmmm - don't we already pass the XFS_BMAPI* flags to
> > > xfs_bmap_del_extent_real()? If so, I don't think there's anything
> > > extra that needs plumbing here. Conceptually it seems cleaner to me
> > > to direct extent freeing policy through the bmapi interface flags
> > > than it is add another flag interface elsewhere...
> > > 
> > 
> > Yeah, the bmapi flag will cover down through there. The params (or
> > wrappers) are needed at the very top (xfs_itruncate_extents()) and
> > bottom (xfs_bmap_add_free_nodiscard()) of the chain.
> > 
> > The point I was trying to make above is just that the interface for
> > controlling discards as such for unwritten extents (eofblocks aside..)
> > is much more simple. We reduce the need to the
> > xfs_bmap_add_free_nodiscard() helper in that case.
> 
> Yeah, I see that. I was really looking at whether it's a general
> optimisation we can make without having to add lots of plumbing,,,
> 
> > > > > I think we should leave that as a separate problem, as writeback
> > > > > currently has issues with the way we manage bufferhead state.
> > > > > i.e. things don't work right if we put unwritten extents under
> > > > > delalloc buffers and vice versa. [ I have patches to address that
> > > > > I'm working on.] And there's also the issue that we need to change
> > > > > the delalloc reservations to take into account block allocations
> > > > > required by unwritten extent conversion needed by delalloc.
> > > > > 
> > > > 
> > > > Right.. I wasn't planning to try and solve the whole buffer head state
> > > > mismatch thing as a dependency to not discard eofblocks. I was only
> > > > going to convert blocks that happened to be post-eof after the
> > > > xfs_iomap_write_allocate() allocation because those blocks by definition
> > > > don't have buffers. So it's essentially just another xfs_bmapi_write()
> > > > call from xfs_iomap_write_allocate() to convert eofblocks if the just
> > > > allocated mapping crosses eof.
> > > 
> > > I think it's a bit more complex than that. We have delalloc buffers
> > > in memory beyond the on disk EOF (i.e. in-memory EOF is far beyond
> > > on disk EOF) but when we do the allocation we would have to mark the
> > > entire extent covering beyond the current IO past the on-disk EOF as
> > > unwritten. Hence the range between the on-disk EOF and the in-memory
> > > EOF ends up with unwritten extents under a delalloc range. Now our
> > > IO completion needs to be different for the next IO beyond EOF,
> > > because it needs to do unwritten completion, not just a file size
> > > update. This is currently problematic because we use the "buffer
> > > head is delalloc" state to determine the IO completion action, not
> > > the on disk extent state.
> > > 
> > 
> > Right, but the conversion would be based on the in-core EOF so we don't
> > end up with unwritten extents beneath delalloc buffer_head's. In fact,
> > the intent is to only convert blocks that are guaranteed to not have any
> > buffer_head coverage at all.
> 
> Which means if we truncate after the initial allocation, but before
> we write the in-core data, we still discard the range between the
> on-disk EOF and where the unwritten extent beyond EOF begins. And
> for large files, that means we might not be converting any of the
> extent allocated beyond EOF at all because we can hmore dirty data
> queued up in memory than we can allocate contiguously.
> 

It's not clear to me whether you refer to explicit file truncate or
eofblocks trimming, and they have slightly different behaviors in the
context of the proposed "nodiscard" implementation...

With regard to eofblocks trimming, discard between on-disk and in-core
eof occurs regardless because eofblocks trim only ever affects blocks
beyond the in-core EOF. IOW, this is already-defined behavior for
when/how we define prealloc blocks and how to safely clean it up when
unused, etc. This work should not change any of that fundamental
behavior, only skip the discard when an eofblocks trim occurs under the
current framework.

If we explicitly truncate the file, I think we're going to discard
post-eof blocks no matter what even with this patch. Truncate !=
eofblocks trim, that is simply a characteristic of the implementation
that controls nodiscard from xfs_free_eofblocks(). Note that I don't
think that is harmful, it still avoids blatantly unnecessary discards
associated with prealloc management and fundamentally ties into the
eofblocks mechanism. It's just not quite as effective as compared to the
unwritten extents approach. I'm fine with it because it still resolves
the "why the heck am I getting discards on a write-only workload?" issue
the vdo guys ran into.

> > Hmm, I think you're thinking that I want to change some or all of the
> > existing delalloc -> real allocation to a delalloc -> prealloc
> > allocation somehow or another such that it affects within-eof state, and
> > that's not really what I had in mind.
> 
> Not really, I was thinking about how the in memory EOF changes can
> affect the on-disk EOF changes and the action that we need to take
> when completing an IO and extending EOF. Some of the issues
> are the same as for allocations within EOF, but I think there are
> some different ones because of the way specualtive preallocation
> beyond EOF works....
> 
> I suspect we don't really want to change allocation time behaviour
> right now, just truncate time behaviour where we know that we've
> already locked out incoming IO and the in-memory size cannot
> change...
> 

Yep, the stability of the in-core size at allocation/conversion time is
definitely a concern. That's one of the reasons I wasn't totally
convinced of the sanity of the idea. :P

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 4bcc095fe44a..942c90ec6747 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2954,13 +2954,15 @@  xfs_free_extent(
 	xfs_fsblock_t		bno,	/* starting block number of extent */
 	xfs_extlen_t		len,	/* length of extent */
 	struct xfs_owner_info	*oinfo,	/* extent owner */
-	enum xfs_ag_resv_type	type)	/* block reservation type */
+	enum xfs_ag_resv_type	type,	/* block reservation type */
+	bool			skip_discard)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_buf		*agbp;
 	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, bno);
 	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(mp, bno);
 	int			error;
+	unsigned int		busy_flags = 0;
 
 	ASSERT(len != 0);
 	ASSERT(type != XFS_AG_RESV_AGFL);
@@ -2984,7 +2986,9 @@  xfs_free_extent(
 	if (error)
 		goto err;
 
-	xfs_extent_busy_insert(tp, agno, agbno, len, 0);
+	if (skip_discard)
+		busy_flags |= XFS_EXTENT_BUSY_SKIP_DISCARD;
+	xfs_extent_busy_insert(tp, agno, agbno, len, busy_flags);
 	return 0;
 
 err:
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index cbf789ea5a4e..5c7d8391edc4 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -196,7 +196,8 @@  xfs_free_extent(
 	xfs_fsblock_t		bno,	/* starting block number of extent */
 	xfs_extlen_t		len,	/* length of extent */
 	struct xfs_owner_info	*oinfo,	/* extent owner */
-	enum xfs_ag_resv_type	type);	/* block reservation type */
+	enum xfs_ag_resv_type	type,	/* block reservation type */
+	bool			skip_discard);
 
 int				/* error */
 xfs_alloc_lookup_le(
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 040eeda8426f..a5a37803f589 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -547,7 +547,8 @@  xfs_bmap_add_free(
 	struct xfs_defer_ops		*dfops,
 	xfs_fsblock_t			bno,
 	xfs_filblks_t			len,
-	struct xfs_owner_info		*oinfo)
+	struct xfs_owner_info		*oinfo,
+	bool				skip_discard)
 {
 	struct xfs_extent_free_item	*new;		/* new element */
 #ifdef DEBUG
@@ -574,6 +575,7 @@  xfs_bmap_add_free(
 		new->xefi_oinfo = *oinfo;
 	else
 		xfs_rmap_skip_owner_update(&new->xefi_oinfo);
+	new->xefi_skip_discard = skip_discard;
 	trace_xfs_bmap_free_defer(mp, XFS_FSB_TO_AGNO(mp, bno), 0,
 			XFS_FSB_TO_AGBNO(mp, bno), len);
 	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_FREE, &new->xefi_list);
@@ -632,7 +634,7 @@  xfs_bmap_btree_to_extents(
 	if ((error = xfs_btree_check_block(cur, cblock, 0, cbp)))
 		return error;
 	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork);
-	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1, &oinfo);
+	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1, &oinfo, false);
 	ip->i_d.di_nblocks--;
 	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, -1L);
 	xfs_trans_binval(tp, cbp);
@@ -5106,7 +5108,8 @@  xfs_bmap_del_extent_real(
 				goto done;
 		} else
 			xfs_bmap_add_free(mp, dfops, del->br_startblock,
-					del->br_blockcount, NULL);
+					del->br_blockcount, NULL,
+					del->br_state == XFS_EXT_UNWRITTEN);
 	}
 
 	/*
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 2b766b37096d..0d2de22d143e 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -68,6 +68,7 @@  struct xfs_extent_free_item
 	xfs_extlen_t		xefi_blockcount;/* number of blocks in extent */
 	struct list_head	xefi_list;
 	struct xfs_owner_info	xefi_oinfo;	/* extent owner */
+	bool			xefi_skip_discard;
 };
 
 #define	XFS_BMAP_MAX_NMAP	4
@@ -194,7 +195,7 @@  int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
 void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
 void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
 			  xfs_fsblock_t bno, xfs_filblks_t len,
-			  struct xfs_owner_info *oinfo);
+			  struct xfs_owner_info *oinfo, bool skip_discard);
 void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
 int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_extlen_t len, xfs_fileoff_t *unused, int whichfork);
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index d89d06bea6e3..cecddfcbe11e 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -305,7 +305,7 @@  xfs_bmbt_free_block(
 	struct xfs_owner_info	oinfo;
 
 	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, cur->bc_private.b.whichfork);
-	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1, &oinfo);
+	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1, &oinfo, false);
 	ip->i_d.di_nblocks--;
 
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index de627fa19168..854ebe04c86f 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1837,7 +1837,7 @@  xfs_difree_inode_chunk(
 	if (!xfs_inobt_issparse(rec->ir_holemask)) {
 		/* not sparse, calculate extent info directly */
 		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, sagbno),
-				  mp->m_ialloc_blks, &oinfo);
+				  mp->m_ialloc_blks, &oinfo, false);
 		return;
 	}
 
@@ -1881,7 +1881,7 @@  xfs_difree_inode_chunk(
 		ASSERT(agbno % mp->m_sb.sb_spino_align == 0);
 		ASSERT(contigblk % mp->m_sb.sb_spino_align == 0);
 		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, agbno),
-				  contigblk, &oinfo);
+				  contigblk, &oinfo, false);
 
 		/* reset range to current bit and carry on... */
 		startidx = endidx = nextbit;
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 367e9a0726e6..977a33cc60d3 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -153,7 +153,7 @@  __xfs_inobt_free_block(
 	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
 	return xfs_free_extent(cur->bc_tp,
 			XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp)), 1,
-			&oinfo, resv);
+			&oinfo, resv, false);
 }
 
 STATIC int
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 560e28473024..e5cfbe2534b1 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -883,7 +883,8 @@  xfs_refcount_adjust_extents(
 						cur->bc_private.a.agno,
 						tmp.rc_startblock);
 				xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
-						tmp.rc_blockcount, oinfo);
+						tmp.rc_blockcount, oinfo,
+						false);
 			}
 
 			(*agbno) += tmp.rc_blockcount;
@@ -926,7 +927,7 @@  xfs_refcount_adjust_extents(
 					cur->bc_private.a.agno,
 					ext.rc_startblock);
 			xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
-					ext.rc_blockcount, oinfo);
+					ext.rc_blockcount, oinfo, false);
 		}
 
 skip:
@@ -1658,7 +1659,7 @@  xfs_refcount_recover_cow_leftovers(
 
 		/* Free the block. */
 		xfs_bmap_add_free(mp, &dfops, fsb,
-				rr->rr_rrec.rc_blockcount, NULL);
+				rr->rr_rrec.rc_blockcount, NULL, false);
 
 		error = xfs_defer_finish(&tp, &dfops);
 		if (error)
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 375abfeb6267..bb0bdc6d97b5 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -131,7 +131,7 @@  xfs_refcountbt_free_block(
 	be32_add_cpu(&agf->agf_refcount_blocks, -1);
 	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
 	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
-			XFS_AG_RESV_METADATA);
+			XFS_AG_RESV_METADATA, false);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index b5b1e567b9f4..4735a31793b0 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -542,7 +542,7 @@  xfs_efi_recover(
 	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
 		extp = &efip->efi_format.efi_extents[i];
 		error = xfs_trans_free_extent(tp, efdp, extp->ext_start,
-					      extp->ext_len, &oinfo);
+					      extp->ext_len, &oinfo, false);
 		if (error)
 			goto abort_error;
 
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 523792768080..9c555f81431e 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -502,7 +502,7 @@  xfs_growfs_data_private(
 		error = xfs_free_extent(tp,
 				XFS_AGB_TO_FSB(mp, agno,
 					be32_to_cpu(agf->agf_length) - new),
-				new, &oinfo, XFS_AG_RESV_NONE);
+				new, &oinfo, XFS_AG_RESV_NONE, false);
 		if (error)
 			goto error0;
 	}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index cdbd342a5249..08381266ad85 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -604,7 +604,7 @@  xfs_reflink_cancel_cow_blocks(
 
 			xfs_bmap_add_free(ip->i_mount, &dfops,
 					del.br_startblock, del.br_blockcount,
-					NULL);
+					NULL, false);
 
 			/* Roll the transaction */
 			xfs_defer_ijoin(&dfops, ip);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 9d542dfe0052..d5be3f6a3e8f 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -228,7 +228,8 @@  struct xfs_efd_log_item	*xfs_trans_get_efd(struct xfs_trans *,
 				  uint);
 int		xfs_trans_free_extent(struct xfs_trans *,
 				      struct xfs_efd_log_item *, xfs_fsblock_t,
-				      xfs_extlen_t, struct xfs_owner_info *);
+				      xfs_extlen_t, struct xfs_owner_info *,
+				      bool);
 int		xfs_trans_commit(struct xfs_trans *);
 int		xfs_trans_roll(struct xfs_trans **);
 int		xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *);
diff --git a/fs/xfs/xfs_trans_extfree.c b/fs/xfs/xfs_trans_extfree.c
index ab438647592a..cd2acfa4e562 100644
--- a/fs/xfs/xfs_trans_extfree.c
+++ b/fs/xfs/xfs_trans_extfree.c
@@ -68,7 +68,8 @@  xfs_trans_free_extent(
 	struct xfs_efd_log_item	*efdp,
 	xfs_fsblock_t		start_block,
 	xfs_extlen_t		ext_len,
-	struct xfs_owner_info	*oinfo)
+	struct xfs_owner_info	*oinfo,
+	bool			skip_discard)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 	uint			next_extent;
@@ -80,7 +81,7 @@  xfs_trans_free_extent(
 	trace_xfs_bmap_free_deferred(tp->t_mountp, agno, 0, agbno, ext_len);
 
 	error = xfs_free_extent(tp, start_block, ext_len, oinfo,
-			XFS_AG_RESV_NONE);
+			XFS_AG_RESV_NONE, skip_discard);
 
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
@@ -195,7 +196,7 @@  xfs_extent_free_finish_item(
 	error = xfs_trans_free_extent(tp, done_item,
 			free->xefi_startblock,
 			free->xefi_blockcount,
-			&free->xefi_oinfo);
+			&free->xefi_oinfo, free->xefi_skip_discard);
 	kmem_free(free);
 	return error;
 }