[RFC] xfs: flush posteof zeroing before reflink truncation
diff mbox series

Message ID 20181116190724.GW4235@magnolia
State New
Headers show
Series
  • [RFC] xfs: flush posteof zeroing before reflink truncation
Related show

Commit Message

Darrick J. Wong Nov. 16, 2018, 7:07 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

If we're remapping into a range that starts beyond EOF, we have to zero
the memory between EOF and the start of the target range, as established
in 410fdc72b05af.  However, in 4918ef4ea008, we extended the pagecache
truncation range downwards to a page boundary to guarantee that
pagecache pages are removed and that there's no possibility that we end
up zeroing subpage blocks within a page.  Unfortunately, we never commit
the posteof zeroing to disk, so on a filesystem where page size > block
size the truncation partially undoes the zeroing and we end up with
stale disk contents.

Brian and I reproduced this problem by running generic/091 on a 1k block
xfs filesystem, assuming fsx in fstests supports clone/dedupe/copyrange.

Fixes: 410fdc72b05a ("xfs: zero posteof blocks when cloning above eof")
Fixes: 4918ef4ea008 ("xfs: fix pagecache truncation prior to reflink")
Simultaneously-diagnosed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
Note: I haven't tested this thoroughly but wanted to push this out for
everyone to look at ASAP.
---
 fs/xfs/xfs_reflink.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Brian Foster Nov. 16, 2018, 7:31 p.m. UTC | #1
On Fri, Nov 16, 2018 at 11:07:24AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> If we're remapping into a range that starts beyond EOF, we have to zero
> the memory between EOF and the start of the target range, as established
> in 410fdc72b05af.  However, in 4918ef4ea008, we extended the pagecache
> truncation range downwards to a page boundary to guarantee that
> pagecache pages are removed and that there's no possibility that we end
> up zeroing subpage blocks within a page.  Unfortunately, we never commit
> the posteof zeroing to disk, so on a filesystem where page size > block
> size the truncation partially undoes the zeroing and we end up with
> stale disk contents.
> 
> Brian and I reproduced this problem by running generic/091 on a 1k block
> xfs filesystem, assuming fsx in fstests supports clone/dedupe/copyrange.
> 

Note that I think we might have hit different manifestations of the same
problem. I actually reproduced an assert failure via shared/010 with the
writeback patch I just posted that checks that we don't write to shared
extents. A trace focused on the problematic range showed:

        fsstress-12966 [002] 14464.404582: xfs_zero_eof:         dev 253:3 ino 0x80009b isize 0x14070 disize 0x14070 offset 0x14070 count 655248
        fsstress-12966 [002] 14464.404583: xfs_ilock:            dev 253:3 ino 0x80009b flags ILOCK_EXCL caller xfs_file_iomap_begin
        fsstress-12966 [002] 14464.404585: xfs_reflink_trim_around_shared: dev 253:3 ino 0x80009b lblk 0x14 len 0x1 pblk 1048635 st 0
        fsstress-12966 [002] 14464.404641: xfs_iomap_found:      dev 253:3 ino 0x80009b size 0x14070 offset 0x14070 count 655248 type 0x0 startoff 0x14 startblock 1048635 blockcount 0x1
        fsstress-12966 [002] 14464.404642: xfs_iunlock:          dev 253:3 ino 0x80009b flags ILOCK_EXCL caller xfs_file_iomap_begin
        fsstress-12966 [002] 14464.404704: writeback_dirty_page: bdi 253:3: ino=8388763 index=20

... which implies that the eof zeroing associated with the remap dirties
the page that ends up being written back to a shared block. As a quick
hack, I just swapped the order of the zeroing and the existing flush
calls and that made the problem disappear.

> Fixes: 410fdc72b05a ("xfs: zero posteof blocks when cloning above eof")
> Fixes: 4918ef4ea008 ("xfs: fix pagecache truncation prior to reflink")
> Simultaneously-diagnosed-by: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> Note: I haven't tested this thoroughly but wanted to push this out for
> everyone to look at ASAP.
> ---

This addresses the problem that I reproduce with shared/010 as well, and
looks Ok to me pending further testing:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_reflink.c |    8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index c56bdbfcf7ae..8ea09a7e550c 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1255,13 +1255,19 @@ xfs_reflink_zero_posteof(
>  	loff_t			pos)
>  {
>  	loff_t			isize = i_size_read(VFS_I(ip));
> +	int			error;
>  
>  	if (pos <= isize)
>  		return 0;
>  
>  	trace_xfs_zero_eof(ip, isize, pos - isize);
> -	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> +	error = iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
>  			&xfs_iomap_ops);
> +	if (error)
> +		return error;
> +
> +	return filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> +			isize, pos - 1);

Random nit: any particular reason we use ->i_data in the page truncate
call in xfs_reflink_remap_prep() and ->i_mapping everywhere else?

Brian

>  }
>  
>  /*
Dave Chinner Nov. 16, 2018, 8:37 p.m. UTC | #2
On Fri, Nov 16, 2018 at 11:07:24AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> If we're remapping into a range that starts beyond EOF, we have to zero
> the memory between EOF and the start of the target range, as established
> in 410fdc72b05af.  However, in 4918ef4ea008, we extended the pagecache
> truncation range downwards to a page boundary to guarantee that
> pagecache pages are removed and that there's no possibility that we end
> up zeroing subpage blocks within a page.  Unfortunately, we never commit
> the posteof zeroing to disk, so on a filesystem where page size > block
> size the truncation partially undoes the zeroing and we end up with
> stale disk contents.
> 
> Brian and I reproduced this problem by running generic/091 on a 1k block
> xfs filesystem, assuming fsx in fstests supports clone/dedupe/copyrange.
> 
> Fixes: 410fdc72b05a ("xfs: zero posteof blocks when cloning above eof")
> Fixes: 4918ef4ea008 ("xfs: fix pagecache truncation prior to reflink")
> Simultaneously-diagnosed-by: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Ok, I have a different fix for this again.

> ---
> Note: I haven't tested this thoroughly but wanted to push this out for
> everyone to look at ASAP.
> ---
>  fs/xfs/xfs_reflink.c |    8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index c56bdbfcf7ae..8ea09a7e550c 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1255,13 +1255,19 @@ xfs_reflink_zero_posteof(
>  	loff_t			pos)
>  {
>  	loff_t			isize = i_size_read(VFS_I(ip));
> +	int			error;
>  
>  	if (pos <= isize)
>  		return 0;
>  
>  	trace_xfs_zero_eof(ip, isize, pos - isize);
> -	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> +	error = iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
>  			&xfs_iomap_ops);
> +	if (error)
> +		return error;
> +
> +	return filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> +			isize, pos - 1);

This doesn't work on block size > page size setups, unfortunately.

Immediately after this we truncate the page cache, which also
doesn't do the right thing on block size > page cache setups.
So there's a couple of bugs here.

IMO, the truncate needs fixing, not the zeroing. Flushing after
zeroing leaves a potential landmine of other dirty data not getting
flushed properly before the truncate, so we should fix the truncate
to do a flush first. And we should fix it in a way that doesn't mean
we need to fix it again in the very near future. i.e. the patch
below that uses xfs_flush_unmap_range().

FWIW, I'm workng on cleaning up the ~10 patches I have for various
fsx and other corruption fixes so I can post them - it'll be monday
before I get that done - but if you're having fsx failures w/
copy/dedupe/clone on fsx I've probably already got a fix for it...

Cheers,

Dave.
Brian Foster Nov. 18, 2018, 2:12 p.m. UTC | #3
On Sat, Nov 17, 2018 at 07:37:56AM +1100, Dave Chinner wrote:
> On Fri, Nov 16, 2018 at 11:07:24AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > If we're remapping into a range that starts beyond EOF, we have to zero
> > the memory between EOF and the start of the target range, as established
> > in 410fdc72b05af.  However, in 4918ef4ea008, we extended the pagecache
> > truncation range downwards to a page boundary to guarantee that
> > pagecache pages are removed and that there's no possibility that we end
> > up zeroing subpage blocks within a page.  Unfortunately, we never commit
> > the posteof zeroing to disk, so on a filesystem where page size > block
> > size the truncation partially undoes the zeroing and we end up with
> > stale disk contents.
> > 
> > Brian and I reproduced this problem by running generic/091 on a 1k block
> > xfs filesystem, assuming fsx in fstests supports clone/dedupe/copyrange.
> > 
> > Fixes: 410fdc72b05a ("xfs: zero posteof blocks when cloning above eof")
> > Fixes: 4918ef4ea008 ("xfs: fix pagecache truncation prior to reflink")
> > Simultaneously-diagnosed-by: Brian Foster <bfoster@redhat.com>
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Ok, I have a different fix for this again.
> 
> > ---
> > Note: I haven't tested this thoroughly but wanted to push this out for
> > everyone to look at ASAP.
> > ---
> >  fs/xfs/xfs_reflink.c |    8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index c56bdbfcf7ae..8ea09a7e550c 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -1255,13 +1255,19 @@ xfs_reflink_zero_posteof(
> >  	loff_t			pos)
> >  {
> >  	loff_t			isize = i_size_read(VFS_I(ip));
> > +	int			error;
> >  
> >  	if (pos <= isize)
> >  		return 0;
> >  
> >  	trace_xfs_zero_eof(ip, isize, pos - isize);
> > -	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> > +	error = iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> >  			&xfs_iomap_ops);
> > +	if (error)
> > +		return error;
> > +
> > +	return filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> > +			isize, pos - 1);
> 
> This doesn't work on block size > page size setups, unfortunately.
> 
> Immediately after this we truncate the page cache, which also
> doesn't do the right thing on block size > page cache setups.
> So there's a couple of bugs here.
> 
> IMO, the truncate needs fixing, not the zeroing. Flushing after
> zeroing leaves a potential landmine of other dirty data not getting
> flushed properly before the truncate, so we should fix the truncate
> to do a flush first. And we should fix it in a way that doesn't mean
> we need to fix it again in the very near future. i.e. the patch
> below that uses xfs_flush_unmap_range().
> 
> FWIW, I'm workng on cleaning up the ~10 patches I have for various
> fsx and other corruption fixes so I can post them - it'll be monday
> before I get that done - but if you're having fsx failures w/
> copy/dedupe/clone on fsx I've probably already got a fix for it...
> 

Ok, so FYI this doesn't actually address the writeback issue I
reproduced because the added flush targets the start of the destination
offset. Note again that I think this is distinct from the issue both you
and Darrick have documented in each commit log. Darrick's patch
addresses it because the flush targets the range that has been zeroed
(isize to dest offset) and thus potentially dirtied in cache. The
zeroing is what leads to sharing a block with an active dirty page (and
eventually leads to writeback of a page to a shared block).

A broader trace is shown below. Note that I don't think this particular
instance is a corruption scenario because we're remapping to the same
inode and the same block ends up at EOF again. It wouldn't surprise me
that much if we could manufacture some kind of corruption out of it, but
that would require more investigation.

        fsstress-990   [000] 84440.830311: xfs_getattr:          dev 253:3 ino 0x1800097
        fsstress-990   [000] 84440.830313: xfs_getattr:          dev 253:3 ino 0x1800097
        fsstress-990   [000] 84440.830317: xfs_ilock:            dev 253:3 ino 0x1800097 flags MMAPLOCK_EXCL caller xfs_reflink_remap_prep
        fsstress-990   [000] 84440.830325: xfs_update_time:      dev 253:3 ino 0x1800097
        fsstress-990   [000] 84440.830352: xfs_ilock:            dev 253:3 ino 0x1800097 flags ILOCK_EXCL caller xfs_vn_update_time
        fsstress-990   [000] 84440.830668: xfs_write_extent:     dev 253:3 ino 0x1800097 state  cur 0xffff8800da7626e0/0 offset 0 block 3146169 count 24 flag 0 caller xfs_inode_item_format_data_fork.isra.6
        fsstress-990   [000] 84440.830669: xfs_inode_pin:        dev 253:3 ino 0x1800097 count 1 pincount 0 caller xfs_cil_prepare_item.isra.4
        fsstress-990   [000] 84440.830674: xfs_iunlock:          dev 253:3 ino 0x1800097 flags ILOCK_EXCL caller xfs_trans_free_items
        fsstress-990   [000] 84440.830676: xfs_zero_eof:         dev 253:3 ino 0x1800097 isize 0x17995 disize 0x17995 offset 0x17995 count 1570411
        fsstress-990   [000] 84440.830677: xfs_ilock:            dev 253:3 ino 0x1800097 flags ILOCK_EXCL caller xfs_file_iomap_begin
        fsstress-990   [000] 84440.830678: xfs_iomap_found:      dev 253:3 ino 0x1800097 size 0x17995 offset 0x17995 count 1570411 type 0x0 startoff 0x0 startblock 3146169 blockcount 0x18
        fsstress-990   [000] 84440.830679: xfs_iunlock:          dev 253:3 ino 0x1800097 flags ILOCK_EXCL caller xfs_file_iomap_begin
        fsstress-990   [000] 84440.830684: writeback_dirty_page: bdi 253:3: ino=25165975 index=23
        fsstress-990   [000] 84440.830686: xfs_ilock:            dev 253:3 ino 0x1800097 flags ILOCK_EXCL caller xfs_file_iomap_begin
        fsstress-990   [000] 84440.830687: xfs_iunlock:          dev 253:3 ino 0x1800097 flags ILOCK_EXCL caller xfs_file_iomap_begin
        fsstress-990   [000] 84440.830690: xfs_ilock:            dev 253:3 ino 0x1800097 flags ILOCK_EXCL caller xfs_reflink_set_inode_flag
        fsstress-990   [000] 84440.830692: xfs_reflink_set_inode_flag: dev 253:3 ino 0x1800097
        fsstress-990   [000] 84440.830698: xfs_write_extent:     dev 253:3 ino 0x1800097 state  cur 0xffff8800da7626e0/0 offset 0 block 3146169 count 24 flag 0 caller xfs_inode_item_format_data_fork.isra.6
        fsstress-990   [000] 84440.830701: xfs_iunlock:          dev 253:3 ino 0x1800097 flags ILOCK_EXCL caller xfs_trans_free_items
        fsstress-990   [000] 84440.830704: bprint:               xfs_flush_unmap_range: 1056: ino 0x1800097 offset 0x197000 len 0x17995
        fsstress-990   [000] 84440.830709: xfs_reflink_remap_range: dev 253:3 count 96661 ino 0x1800097 isize 0x18000 disize 0x17995 offset 0x0 -> ino 0x1800097 isize 0x18000 disize 0x17995 offset 0x197000
	...
    kworker/u8:1-26905 [001] 84440.923735: xfs_writepage:        dev 253:3 ino 0x1800097 pgoff 0x17000 size 0x1ae995 offset 0 length 0
    kworker/u8:1-26905 [001] 84440.923736: xfs_ilock:            dev 253:3 ino 0x1800097 flags ILOCK_SHARED caller xfs_map_blocks
    kworker/u8:1-26905 [001] 84440.923736: xfs_iunlock:          dev 253:3 ino 0x1800097 flags ILOCK_SHARED caller xfs_map_blocks
    kworker/u8:1-26905 [001] 84440.923737: xfs_map_blocks_found: dev 253:3 ino 0x1800097 size 0x1ae995 offset 0x17000 count 4096 type 0x3 startoff 0x0 startblock 3146169 blockcount 0x18
    kworker/u8:1-26905 [001] 84440.957872: bprint:               xfs_do_writepage: 341: ino 0x1800097 offset 0x17000 agno 3 agbno 0x1d0 fbno 0x1d0 flen 0x1

Note that the last trace_printk() is tied to the assert failure. It
reflects that we've issued a writeback directly to a shared block.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> 
> xfs: flush removing page cache in xfs_reflink_remap_prep
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> On a sub-page block size filesystem, fsx is failing with a data
> corruption after a series of operations involving copying a file
> with the destination offset beyond EOF of the destination of the file:
> 
> 8093(157 mod 256): TRUNCATE DOWN        from 0x7a120 to 0x50000 ******WWWW
> 8094(158 mod 256): INSERT 0x25000 thru 0x25fff  (0x1000 bytes)
> 8095(159 mod 256): COPY 0x18000 thru 0x1afff    (0x3000 bytes) to 0x2f400
> 8096(160 mod 256): WRITE    0x5da00 thru 0x651ff        (0x7800 bytes) HOLE
> 8097(161 mod 256): COPY 0x2000 thru 0x5fff      (0x4000 bytes) to 0x6fc00
> 
> The second copy here is beyond EOF, and it is to sub-page (4k) but
> block aligned (1k) offset. The clone runs the EOF zeroing, landing
> in a pre-existing post-eof delalloc extent. This zeroes the post-eof
> extents in the page cache just fine, dirtying the pages correctly.
> 
> The problem is that xfs_reflink_remap_prep() now truncates the page
> cache over the range that it is copying it to, and rounds that down
> to cover the entire start page. This removes the dirty page over the
> delalloc extent from the page cache without having written it back.
> Hence later, when the page cache is flushed, the page at offset
> 0x6f000 has not been written back and hence exposes stale data,
> which fsx trips over less than 10 operations later.
> 
> Fix this by changing xfs_reflink_remap_prep() to use
> xfs_flush_unmap_range().
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 2 +-
>  fs/xfs/xfs_bmap_util.h | 3 +++
>  fs/xfs/xfs_reflink.c   | 7 +++----
>  3 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index cc7a0d47c529..3e66cf0520a9 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1042,7 +1042,7 @@ xfs_unmap_extent(
>  	goto out_unlock;
>  }
>  
> -static int
> +int
>  xfs_flush_unmap_range(
>  	struct xfs_inode	*ip,
>  	xfs_off_t		offset,
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index 87363d136bb6..7a78229cf1a7 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -80,4 +80,7 @@ int xfs_bmap_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
>  			  int whichfork, xfs_extnum_t *nextents,
>  			  xfs_filblks_t *count);
>  
> +int	xfs_flush_unmap_range(struct xfs_inode *ip, xfs_off_t offset,
> +			      xfs_off_t len);
> +
>  #endif	/* __XFS_BMAP_UTIL_H__ */
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index ecdb086bc23e..bd9135afe3f4 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1351,10 +1351,9 @@ xfs_reflink_remap_prep(
>  	if (ret)
>  		goto out_unlock;
>  
> -	/* Zap any page cache for the destination file's range. */
> -	truncate_inode_pages_range(&inode_out->i_data,
> -			round_down(pos_out, PAGE_SIZE),
> -			round_up(pos_out + *len, PAGE_SIZE) - 1);
> +	ret = xfs_flush_unmap_range(dest, pos_out, *len);
> +	if (ret)
> +		goto out_unlock;
>  
>  	return 1;
>  out_unlock:
Dave Chinner Nov. 19, 2018, 12:26 a.m. UTC | #4
On Sun, Nov 18, 2018 at 09:12:06AM -0500, Brian Foster wrote:
> On Sat, Nov 17, 2018 at 07:37:56AM +1100, Dave Chinner wrote:
> > On Fri, Nov 16, 2018 at 11:07:24AM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > If we're remapping into a range that starts beyond EOF, we have to zero
> > > the memory between EOF and the start of the target range, as established
> > > in 410fdc72b05af.  However, in 4918ef4ea008, we extended the pagecache
> > > truncation range downwards to a page boundary to guarantee that
> > > pagecache pages are removed and that there's no possibility that we end
> > > up zeroing subpage blocks within a page.  Unfortunately, we never commit
> > > the posteof zeroing to disk, so on a filesystem where page size > block
> > > size the truncation partially undoes the zeroing and we end up with
> > > stale disk contents.
> > > 
> > > Brian and I reproduced this problem by running generic/091 on a 1k block
> > > xfs filesystem, assuming fsx in fstests supports clone/dedupe/copyrange.
> > > 
> > > Fixes: 410fdc72b05a ("xfs: zero posteof blocks when cloning above eof")
> > > Fixes: 4918ef4ea008 ("xfs: fix pagecache truncation prior to reflink")
> > > Simultaneously-diagnosed-by: Brian Foster <bfoster@redhat.com>
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Ok, I have a different fix for this again.
> > 
> > > ---
> > > Note: I haven't tested this thoroughly but wanted to push this out for
> > > everyone to look at ASAP.
> > > ---
> > >  fs/xfs/xfs_reflink.c |    8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index c56bdbfcf7ae..8ea09a7e550c 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -1255,13 +1255,19 @@ xfs_reflink_zero_posteof(
> > >  	loff_t			pos)
> > >  {
> > >  	loff_t			isize = i_size_read(VFS_I(ip));
> > > +	int			error;
> > >  
> > >  	if (pos <= isize)
> > >  		return 0;
> > >  
> > >  	trace_xfs_zero_eof(ip, isize, pos - isize);
> > > -	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> > > +	error = iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> > >  			&xfs_iomap_ops);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	return filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> > > +			isize, pos - 1);
> > 
> > This doesn't work on block size > page size setups, unfortunately.
> > 
> > Immediately after this we truncate the page cache, which also
> > doesn't do the right thing on block size > page cache setups.
> > So there's a couple of bugs here.
> > 
> > IMO, the truncate needs fixing, not the zeroing. Flushing after
> > zeroing leaves a potential landmine of other dirty data not getting
> > flushed properly before the truncate, so we should fix the truncate
> > to do a flush first. And we should fix it in a way that doesn't mean
> > we need to fix it again in the very near future. i.e. the patch
> > below that uses xfs_flush_unmap_range().
> > 
> > FWIW, I'm workng on cleaning up the ~10 patches I have for various
> > fsx and other corruption fixes so I can post them - it'll be monday
> > before I get that done - but if you're having fsx failures w/
> > copy/dedupe/clone on fsx I've probably already got a fix for it...
> > 
> 
> Ok, so FYI this doesn't actually address the writeback issue I
> reproduced because the added flush targets the start of the destination
> offset.

Oh, sorry, I didn't notice that difference. Fixed that now. That
might actually be one(*) of the fsx bugs I've been chasing for
several days.

> Note again that I think this is distinct from the issue both you
> and Darrick have documented in each commit log. Darrick's patch
> addresses it because the flush targets the range that has been zeroed
> (isize to dest offset) and thus potentially dirtied in cache. The
> zeroing is what leads to sharing a block with an active dirty page (and
> eventually leads to writeback of a page to a shared block).

Yes, we may be seeing *different symptoms* but the underlying
problem is the same - we are not writing back pages over the range
we are about to share, and so we don't trigger a COW on the range
before writeback occurs.

IMO, using xfs_flush_unmap_range() is still the right change to make
here, even if my initial patch didn't address this specific problem
but a different flush/inval problem with this code.

Cheers.

Dave.

(*) I've still got several different fsx variants that fail on either
default configs and/or 1k block size with different signatures.
Problem is they take between 370,000 ops and 5 million ops to
trigger, and so generate tens to hundreds of GB of trace data....

e.g. on a default 4k filesystem on pmem, this fails after 377k ops:

# ltp/fsx -q -p50000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /mnt/scratch/foo
....
READ BAD DATA: offset = 0x31000, size = 0xa000, fname = /mnt/scratch/foo
OFFSET  GOOD    BAD     RANGE
0x36000 0x9084  0x7940  0x00000
....

and this slight variant (buffered IO rather than direct IO) fails
after 2.17 million ops:

# ltp/fsx -q -p50000 -l 500000 -r 4096 -t 512 -w 512 -R -W /mnt/scratch/foo
....
READ BAD DATA: offset = 0xf000, size = 0xf318, fname = /mnt/scratch/foo
OFFSET  GOOD    BAD     RANGE
0x15000 0x990b  0x6c0b  0x00000
....

I'm also seeing MAPREAD failures with data after EOF on several
different configs, and there's a couple of other failures that show
up every so often, too.

If I turn off copy/dedupe/clone file range, they all run for
billions of ops without failure....
Brian Foster Nov. 19, 2018, 7:05 p.m. UTC | #5
On Mon, Nov 19, 2018 at 11:26:11AM +1100, Dave Chinner wrote:
> On Sun, Nov 18, 2018 at 09:12:06AM -0500, Brian Foster wrote:
> > On Sat, Nov 17, 2018 at 07:37:56AM +1100, Dave Chinner wrote:
> > > On Fri, Nov 16, 2018 at 11:07:24AM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > If we're remapping into a range that starts beyond EOF, we have to zero
> > > > the memory between EOF and the start of the target range, as established
> > > > in 410fdc72b05af.  However, in 4918ef4ea008, we extended the pagecache
> > > > truncation range downwards to a page boundary to guarantee that
> > > > pagecache pages are removed and that there's no possibility that we end
> > > > up zeroing subpage blocks within a page.  Unfortunately, we never commit
> > > > the posteof zeroing to disk, so on a filesystem where page size > block
> > > > size the truncation partially undoes the zeroing and we end up with
> > > > stale disk contents.
> > > > 
> > > > Brian and I reproduced this problem by running generic/091 on a 1k block
> > > > xfs filesystem, assuming fsx in fstests supports clone/dedupe/copyrange.
> > > > 
> > > > Fixes: 410fdc72b05a ("xfs: zero posteof blocks when cloning above eof")
> > > > Fixes: 4918ef4ea008 ("xfs: fix pagecache truncation prior to reflink")
> > > > Simultaneously-diagnosed-by: Brian Foster <bfoster@redhat.com>
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Ok, I have a different fix for this again.
> > > 
> > > > ---
> > > > Note: I haven't tested this thoroughly but wanted to push this out for
> > > > everyone to look at ASAP.
> > > > ---
> > > >  fs/xfs/xfs_reflink.c |    8 +++++++-
> > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > > index c56bdbfcf7ae..8ea09a7e550c 100644
> > > > --- a/fs/xfs/xfs_reflink.c
> > > > +++ b/fs/xfs/xfs_reflink.c
> > > > @@ -1255,13 +1255,19 @@ xfs_reflink_zero_posteof(
> > > >  	loff_t			pos)
> > > >  {
> > > >  	loff_t			isize = i_size_read(VFS_I(ip));
> > > > +	int			error;
> > > >  
> > > >  	if (pos <= isize)
> > > >  		return 0;
> > > >  
> > > >  	trace_xfs_zero_eof(ip, isize, pos - isize);
> > > > -	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> > > > +	error = iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> > > >  			&xfs_iomap_ops);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	return filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> > > > +			isize, pos - 1);
> > > 
> > > This doesn't work on block size > page size setups, unfortunately.
> > > 
> > > Immediately after this we truncate the page cache, which also
> > > doesn't do the right thing on block size > page cache setups.
> > > So there's a couple of bugs here.
> > > 
> > > IMO, the truncate needs fixing, not the zeroing. Flushing after
> > > zeroing leaves a potential landmine of other dirty data not getting
> > > flushed properly before the truncate, so we should fix the truncate
> > > to do a flush first. And we should fix it in a way that doesn't mean
> > > we need to fix it again in the very near future. i.e. the patch
> > > below that uses xfs_flush_unmap_range().
> > > 
> > > FWIW, I'm workng on cleaning up the ~10 patches I have for various
> > > fsx and other corruption fixes so I can post them - it'll be monday
> > > before I get that done - but if you're having fsx failures w/
> > > copy/dedupe/clone on fsx I've probably already got a fix for it...
> > > 
> > 
> > Ok, so FYI this doesn't actually address the writeback issue I
> > reproduced because the added flush targets the start of the destination
> > offset.
> 
> Oh, sorry, I didn't notice that difference. Fixed that now. That
> might actually be one(*) of the fsx bugs I've been chasing for
> several days.
> 
> > Note again that I think this is distinct from the issue both you
> > and Darrick have documented in each commit log. Darrick's patch
> > addresses it because the flush targets the range that has been zeroed
> > (isize to dest offset) and thus potentially dirtied in cache. The
> > zeroing is what leads to sharing a block with an active dirty page (and
> > eventually leads to writeback of a page to a shared block).
> 
> Yes, we may be seeing *different symptoms* but the underlying
> problem is the same - we are not writing back pages over the range
> we are about to share, and so we don't trigger a COW on the range
> before writeback occurs.
> 

Yes, I'm just pointing out there was still a gap between the two
patches.

> IMO, using xfs_flush_unmap_range() is still the right change to make
> here, even if my initial patch didn't address this specific problem
> but a different flush/inval problem with this code.
> 
> Cheers.
> 
> Dave.
> 
> (*) I've still got several different fsx variants that fail on either
> default configs and/or 1k block size with different signatures.
> Problem is they take between 370,000 ops and 5 million ops to
> trigger, and so generate tens to hundreds of GB of trace data....
> 

Have you tried 1.) further reducing the likely unrelated operations
(i.e., fallocs, insert/collapse range, etc.) from the test and 2.)
manually trimming down and replaying the op record file fsx dumps out on
failure?  I usually don't bother with fs level tracing for this kind of
thing until I get a repeatable and somewhat manageable set of operations
to work with.

Brian

> e.g. on a default 4k filesystem on pmem, this fails after 377k ops:
> 
> # ltp/fsx -q -p50000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /mnt/scratch/foo
> ....
> READ BAD DATA: offset = 0x31000, size = 0xa000, fname = /mnt/scratch/foo
> OFFSET  GOOD    BAD     RANGE
> 0x36000 0x9084  0x7940  0x00000
> ....
> 
> and this slight variant (buffered IO rather than direct IO) fails
> after 2.17 million ops:
> 
> # ltp/fsx -q -p50000 -l 500000 -r 4096 -t 512 -w 512 -R -W /mnt/scratch/foo
> ....
> READ BAD DATA: offset = 0xf000, size = 0xf318, fname = /mnt/scratch/foo
> OFFSET  GOOD    BAD     RANGE
> 0x15000 0x990b  0x6c0b  0x00000
> ....
> 
> I'm also seeing MAPREAD failures with data after EOF on several
> different configs, and there's a couple of other failures that show
> up every so often, too.
> 
> If I turn off copy/dedupe/clone file range, they all run for
> billions of ops without failure....
> 
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Nov. 19, 2018, 10:04 p.m. UTC | #6
On Mon, Nov 19, 2018 at 02:05:13PM -0500, Brian Foster wrote:
> On Mon, Nov 19, 2018 at 11:26:11AM +1100, Dave Chinner wrote:
> > (*) I've still got several different fsx variants that fail on either
> > default configs and/or 1k block size with different signatures.
> > Problem is they take between 370,000 ops and 5 million ops to
> > trigger, and so generate tens to hundreds of GB of trace data....
> > 
> 
> Have you tried 1.) further reducing the likely unrelated operations
> (i.e., fallocs, insert/collapse range, etc.) from the test

Yes. The test cases I have cut out all the unnecessary ops.

Oh, look, I just found a new failure on a default 4k block size
filesystem:

# src/xfstests-dev/ltp/fsx -q -p 10000  -o 128000   -l 500000 -r 4096 -t 512 -w 512 -Z -R -W -F -H -z -C -I  /mnt/scratch/foo
20000 clone     from 0x46000 to 0x48000, (0x2000 bytes) at 0x2c000
100000 clone    from 0x44000 to 0x51000, (0xd000 bytes) at 0x1e000
110000 clone    from 0x54000 to 0x5b000, (0x7000 bytes) at 0xf000
READ BAD DATA: offset = 0x1000, size = 0xb000, fname = /mnt/scratch/foo
OFFSET  GOOD    BAD     RANGE
0x07000 0xa2d9  0x711b  0x00000
....


> and 2.)
> manually trimming down and replaying the op record file fsx dumps out on
> failure?

I've mostly been unable to get that to reliably reproduce the
problems. The failures I'm getting smell like race conditions -
turning on tracing makes a couple of them go away - and I haven't
found a reliable set of cut-down ops to reproduce them.

> I usually don't bother with fs level tracing for this kind of
> thing until I get a repeatable and somewhat manageable set of operations
> to work with.

Neither do I, but there's little choice when the failures aren't
reliable.

Cheers,

Dave.
Dave Chinner Nov. 19, 2018, 10:07 p.m. UTC | #7
On Tue, Nov 20, 2018 at 09:04:22AM +1100, Dave Chinner wrote:
> On Mon, Nov 19, 2018 at 02:05:13PM -0500, Brian Foster wrote:
> > On Mon, Nov 19, 2018 at 11:26:11AM +1100, Dave Chinner wrote:
> > > (*) I've still got several different fsx variants that fail on either
> > > default configs and/or 1k block size with different signatures.
> > > Problem is they take between 370,000 ops and 5 million ops to
> > > trigger, and so generate tens to hundreds of GB of trace data....
> > > 
> > 
> > Have you tried 1.) further reducing the likely unrelated operations
> > (i.e., fallocs, insert/collapse range, etc.) from the test
> 
> Yes. The test cases I have cut out all the unnecessary ops.
> 
> Oh, look, I just found a new failure on a default 4k block size
> filesystem:
> 
> # src/xfstests-dev/ltp/fsx -q -p 10000  -o 128000   -l 500000 -r 4096 -t 512 -w 512 -Z -R -W -F -H -z -C -I  /mnt/scratch/foo
> 20000 clone     from 0x46000 to 0x48000, (0x2000 bytes) at 0x2c000
> 100000 clone    from 0x44000 to 0x51000, (0xd000 bytes) at 0x1e000
> 110000 clone    from 0x54000 to 0x5b000, (0x7000 bytes) at 0xf000
> READ BAD DATA: offset = 0x1000, size = 0xb000, fname = /mnt/scratch/foo
> OFFSET  GOOD    BAD     RANGE
> 0x07000 0xa2d9  0x711b  0x00000
> ....

And just to reinforce that these are unreliable failures, the second
time I ran the above command:

$ src/xfstests-dev/ltp/fsx -q -p 10000  -o 128000   -l 500000 -r 4096 -t 512 -w 512 -Z -R -W -F -H -z -C -I  /mnt/scratch/foo
20000 clone     from 0x46000 to 0x48000, (0x2000 bytes) at 0x2c000
100000 clone    from 0x44000 to 0x51000, (0xd000 bytes) at 0x1e000
110000 clone    from 0x54000 to 0x5b000, (0x7000 bytes) at 0xf000
120000 clone    from 0x3a000 to 0x3e000, (0x4000 bytes) at 0x10000
140000 trunc    from 0x62400 to 0xd800
190000 trunc    from 0x56800 to 0x78e00
READ BAD DATA: offset = 0x2c000, size = 0x1c000, fname = /mnt/scratch/foo
OFFSET  GOOD    BAD     RANGE
0x32000 0xc81d  0x9289  0x00000
.....

It failed somewhere else. That's the difficulty I'm having here -
the failures aren't reliable.

Cheers,

Dave.

Patch
diff mbox series

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index c56bdbfcf7ae..8ea09a7e550c 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1255,13 +1255,19 @@  xfs_reflink_zero_posteof(
 	loff_t			pos)
 {
 	loff_t			isize = i_size_read(VFS_I(ip));
+	int			error;
 
 	if (pos <= isize)
 		return 0;
 
 	trace_xfs_zero_eof(ip, isize, pos - isize);
-	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
+	error = iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
 			&xfs_iomap_ops);
+	if (error)
+		return error;
+
+	return filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
+			isize, pos - 1);
 }
 
 /*