diff mbox series

[6/7] xfs: don't ENOSPC on writeback when punching holes

Message ID 20181119210459.8506-7-david@fromorbit.com (mailing list archive)
State New, archived
Headers show
Series xfs: various fixes for 4.20 | expand

Commit Message

Dave Chinner Nov. 19, 2018, 9:04 p.m. UTC
From: Dave Chinner <dchinner@redhat.com>

Just tried to punch a 40T sparse file when enospc was triggered due
to extent size hints consuming more space than expected. It failed
with:

# sudo xfs_io -c "fpunch 0 40t" /mnt/fast/xfs-arekm.img
fallocate: No space left on device
#

because the writeback error of ENOSPC was being reported. We're
going to free that space, so we don't care if there was a ENOSPC
writeback error. So ignore ENOSPC errors and punch anyway.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig Nov. 20, 2018, 8:20 a.m. UTC | #1
On Tue, Nov 20, 2018 at 08:04:58AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Just tried to punch a 40T sparse file when enospc was triggered due
> to extent size hints consuming more space than expected. It failed
> with:
> 
> # sudo xfs_io -c "fpunch 0 40t" /mnt/fast/xfs-arekm.img
> fallocate: No space left on device
> #
> 
> because the writeback error of ENOSPC was being reported. We're
> going to free that space, so we don't care if there was a ENOSPC
> writeback error. So ignore ENOSPC errors and punch anyway.

How do even get -ENOSPC back from writeback?  It seems like we have
a much worse root cause lingering here.
Dave Chinner Nov. 20, 2018, 9:50 a.m. UTC | #2
On Tue, Nov 20, 2018 at 12:20:40AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 20, 2018 at 08:04:58AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Just tried to punch a 40T sparse file when enospc was triggered due
> > to extent size hints consuming more space than expected. It failed
> > with:
> > 
> > # sudo xfs_io -c "fpunch 0 40t" /mnt/fast/xfs-arekm.img
> > fallocate: No space left on device
> > #
> > 
> > because the writeback error of ENOSPC was being reported. We're
> > going to free that space, so we don't care if there was a ENOSPC
> > writeback error. So ignore ENOSPC errors and punch anyway.
> 
> How do even get -ENOSPC back from writeback?  It seems like we have
> a much worse root cause lingering here.

It hammered ENOSPC pretty hard - I think it had consumed the entire
reserve pool and that can causes allocation transaction reservations
in xfs_iomap_write_allocate() to return ENOSPC even if we've got a
reservation for the data extents being allocated.

This doesn't happen very often in the real world, but if it does we
need operations like punch to work to be able to free space and
get us out of that hole...

Cheers,

Dave.
Christoph Hellwig Nov. 20, 2018, 4:28 p.m. UTC | #3
On Tue, Nov 20, 2018 at 08:50:42PM +1100, Dave Chinner wrote:
> > > because the writeback error of ENOSPC was being reported. We're
> > > going to free that space, so we don't care if there was a ENOSPC
> > > writeback error. So ignore ENOSPC errors and punch anyway.
> > 
> > How do even get -ENOSPC back from writeback?  It seems like we have
> > a much worse root cause lingering here.
> 
> It hammered ENOSPC pretty hard - I think it had consumed the entire
> reserve pool and that can causes allocation transaction reservations
> in xfs_iomap_write_allocate() to return ENOSPC even if we've got a
> reservation for the data extents being allocated.

Well, that means we do have a problem somewhere in our accounting,
as writeback should not run into ENOSPC.

> This doesn't happen very often in the real world, but if it does we
> need operations like punch to work to be able to free space and
> get us out of that hole...

I'm ok(-ish) with the patch, but I wish we could also sort out the
root cause..
Dave Chinner Nov. 20, 2018, 9 p.m. UTC | #4
On Tue, Nov 20, 2018 at 08:28:33AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 20, 2018 at 08:50:42PM +1100, Dave Chinner wrote:
> > > > because the writeback error of ENOSPC was being reported. We're
> > > > going to free that space, so we don't care if there was a ENOSPC
> > > > writeback error. So ignore ENOSPC errors and punch anyway.
> > > 
> > > How do even get -ENOSPC back from writeback?  It seems like we have
> > > a much worse root cause lingering here.
> > 
> > It hammered ENOSPC pretty hard - I think it had consumed the entire
> > reserve pool and that can causes allocation transaction reservations
> > in xfs_iomap_write_allocate() to return ENOSPC even if we've got a
> > reservation for the data extents being allocated.
> 
> Well, that means we do have a problem somewhere in our accounting,
> as writeback should not run into ENOSPC.

Unwritten extent conversion consumes unreserved space when it splits
the BMBT and does all the other btree updates. If we do enough of
them at ENOSPC, we can consume the entire reservation pool.

In this case, I was creating a 25 million extent file using extent
size hints to try to get the number of extents down. It actually
resulted in the number of extents going up, because most of the
extents weren't fully written. IOWs, there was a /lot/ of partial
unwritten extent conversion going on in a very big extent tree.

> > This doesn't happen very often in the real world, but if it does we
> > need operations like punch to work to be able to free space and
> > get us out of that hole...
> 
> I'm ok(-ish) with the patch, but I wish we could also sort out the
> root cause..

Yeah, it's not great, but it did mean I didn't have to mkfs the
filesystem to get out of trouble. mount/unmount doesn't fix a
depleted reserve pool, only freeing space will do that, and we
should be able to punch space out of a file when at ENOSPC...

Cheers,

Dave.
Darrick J. Wong Nov. 21, 2018, 6:09 p.m. UTC | #5
On Tue, Nov 20, 2018 at 08:04:58AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Just tried to punch a 40T sparse file when enospc was triggered due
> to extent size hints consuming more space than expected. It failed
> with:
> 
> # sudo xfs_io -c "fpunch 0 40t" /mnt/fast/xfs-arekm.img
> fallocate: No space left on device
> #
> 
> because the writeback error of ENOSPC was being reported. We're
> going to free that space, so we don't care if there was a ENOSPC
> writeback error. So ignore ENOSPC errors and punch anyway.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 167ff4297e5c..cc7a0d47c529 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1060,8 +1060,13 @@ xfs_flush_unmap_range(
>  	start = round_down(offset, rounding);
>  	end = round_up(offset + len, rounding) - 1;
>  
> +	/*
> +	 * We're going to trash the data in this range, so if writeback reports
> +	 * an enospc error, don't let it stop us from /freeing the space/ in the
> +	 * range to alleviate the ENOSPC condition.
> +	 */
>  	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> -	if (error)
> +	if (error && error != -ENOSPC)

I think there's a bug here -- COLLAPSE_RANGE and INSERT_RANGE call
xfs_prepare_shift to writeback and invalidate the pagecache from a
selected offset to EOF, and _prepare_shift calls _flush_unmap_range.

So if we have a dirty 100 block file and we ask xfs to collapse 3 blocks
at block 17, it'll _prepare_shift blocks 20-99 before shifting them down
by 3 blocks.  If we instead asked to insert 3 blocks at block 17, it'll
_prepare_shift blocks 17-99 before shifting them up by 3 blocks.

Unlike the punch/reflink cases, the _prepare_shift ranges aren't doomed,
so the user might want to know if the write fails.

--D

>  		return error;
>  	truncate_pagecache_range(inode, start, end);
>  	return 0;
> -- 
> 2.19.1
>
Dave Chinner Nov. 22, 2018, 2:31 a.m. UTC | #6
On Wed, Nov 21, 2018 at 10:09:36AM -0800, Darrick J. Wong wrote:
> On Tue, Nov 20, 2018 at 08:04:58AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Just tried to punch a 40T sparse file when enospc was triggered due
> > to extent size hints consuming more space than expected. It failed
> > with:
> > 
> > # sudo xfs_io -c "fpunch 0 40t" /mnt/fast/xfs-arekm.img
> > fallocate: No space left on device
> > #
> > 
> > because the writeback error of ENOSPC was being reported. We're
> > going to free that space, so we don't care if there was a ENOSPC
> > writeback error. So ignore ENOSPC errors and punch anyway.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_bmap_util.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > index 167ff4297e5c..cc7a0d47c529 100644
> > --- a/fs/xfs/xfs_bmap_util.c
> > +++ b/fs/xfs/xfs_bmap_util.c
> > @@ -1060,8 +1060,13 @@ xfs_flush_unmap_range(
> >  	start = round_down(offset, rounding);
> >  	end = round_up(offset + len, rounding) - 1;
> >  
> > +	/*
> > +	 * We're going to trash the data in this range, so if writeback reports
> > +	 * an enospc error, don't let it stop us from /freeing the space/ in the
> > +	 * range to alleviate the ENOSPC condition.
> > +	 */
> >  	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> > -	if (error)
> > +	if (error && error != -ENOSPC)
> 
> I think there's a bug here -- COLLAPSE_RANGE and INSERT_RANGE call
> xfs_prepare_shift to writeback and invalidate the pagecache from a
> selected offset to EOF, and _prepare_shift calls _flush_unmap_range.
> 
> So if we have a dirty 100 block file and we ask xfs to collapse 3 blocks
> at block 17, it'll _prepare_shift blocks 20-99 before shifting them down
> by 3 blocks.  If we instead asked to insert 3 blocks at block 17, it'll
> _prepare_shift blocks 17-99 before shifting them up by 3 blocks.
> 
> Unlike the punch/reflink cases, the _prepare_shift ranges aren't doomed,
> so the user might want to know if the write fails.

Hmmm, yeah, the _prepare_shift changes came after this patch, so I
didn't consider that when originally writing this patch. Drop it for
now, will have to rethink what to do here....

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 167ff4297e5c..cc7a0d47c529 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1060,8 +1060,13 @@  xfs_flush_unmap_range(
 	start = round_down(offset, rounding);
 	end = round_up(offset + len, rounding) - 1;
 
+	/*
+	 * We're going to trash the data in this range, so if writeback reports
+	 * an enospc error, don't let it stop us from /freeing the space/ in the
+	 * range to alleviate the ENOSPC condition.
+	 */
 	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
-	if (error)
+	if (error && error != -ENOSPC)
 		return error;
 	truncate_pagecache_range(inode, start, end);
 	return 0;