xfs: zero posteof blocks when cloning above eof

Message ID	20181003020342.GD19324@magnolia (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-xfs-owner@kernel.org> Date: Tue, 2 Oct 2018 19:03:42 -0700 From: "Darrick J. Wong" <darrick.wong@oracle.com> To: Dave Chinner <david@fromorbit.com> Cc: Eric Sandeen <sandeen@redhat.com>, xfs <linux-xfs@vger.kernel.org>, zlang@redhat.com Subject: [PATCH] xfs: zero posteof blocks when cloning above eof Message-ID: <20181003020342.GD19324@magnolia> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk
Series	xfs: zero posteof blocks when cloning above eof \| expand xfs: zero posteof blocks when cloning above eof

Darrick J. Wong Oct. 3, 2018, 2:03 a.m. UTC

From: Darrick J. Wong <darrick.wong@oracle.com>

When we're reflinking between two files and the destination file range
is well beyond the destination file's EOF marker, zero any posteof
speculative preallocations in the destination file so that we don't
expose stale disk contents.  The previous strategy of trying to clear
the preallocations does not work if the destination file has the
PREALLOC flag set but no delalloc blocks.

Uncovered by shared/010.

Reported-by: Zorro Lang <zlang@redhat.com>
Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |   34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

Eric Sandeen Oct. 3, 2018, 12:11 p.m. UTC | #1

On 10/2/18 9:03 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we're reflinking between two files and the destination file range
> is well beyond the destination file's EOF marker, zero any posteof
> speculative preallocations in the destination file so that we don't
> expose stale disk contents.  The previous strategy of trying to clear
> the preallocations does not work if the destination file has the
> PREALLOC flag set but no delalloc blocks.
> 
> Uncovered by shared/010.
> 
> Reported-by: Zorro Lang <zlang@redhat.com>
> Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

The action makes sense, and this does resolve my simple testcase,
and makes shared/010 pass for me as well.

However, this makes my correctness spidey-sense tingle; why is there
a new helper unique to extending reflinks, when extending writes already
must do the same thing?  I didn't follow all the discussion on IRC,
but might be worth explaining on the list for others as well.  Are there
any other extending write tests that aren't happening for extending reflink?

-Eric

> ---
>  fs/xfs/xfs_reflink.c |   34 ++++++++++++++++++++++++++--------
>  1 file changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 38f405415b88..c8e996a99a74 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1195,6 +1195,27 @@ xfs_iolock_two_inodes_and_break_layout(
>  	return 0;
>  }
>  
> +/*
> + * If we're reflinking to a point past the destination file's EOF, we must
> + * zero any speculative post-EOF preallocations that sit between the old EOF
> + * and the destination file offset.
> + */
> +static int
> +xfs_reflink_zero_posteof(
> +	struct xfs_inode	*ip,
> +	loff_t			pos)
> +{
> +	loff_t			isize = i_size_read(VFS_I(ip));
> +	bool			did_zeroing = false;
> +
> +	if (pos <= isize)
> +		return 0;
> +
> +	trace_xfs_zero_eof(ip, isize, pos - isize);
> +	return iomap_zero_range(VFS_I(ip), isize, pos - isize, &did_zeroing,
> +			&xfs_iomap_ops);
> +}
> +
>  /*
>   * Link a range of blocks from one file to another.
>   */
> @@ -1257,15 +1278,12 @@ xfs_reflink_remap_range(
>  	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
>  
>  	/*
> -	 * Clear out post-eof preallocations because we don't have page cache
> -	 * backing the delayed allocations and they'll never get freed on
> -	 * their own.
> +	 * Zero existing post-eof speculative preallocations in the destination
> +	 * file.
>  	 */
> -	if (xfs_can_free_eofblocks(dest, true)) {
> -		ret = xfs_free_eofblocks(dest);
> -		if (ret)
> -			goto out_unlock;
> -	}
> +	ret = xfs_reflink_zero_posteof(dest, pos_out);
> +	if (ret)
> +		goto out_unlock;
>  
>  	/* Set flags and remap blocks. */
>  	ret = xfs_reflink_set_inode_flag(src, dest);
>

Brian Foster Oct. 3, 2018, 12:20 p.m. UTC | #2

On Tue, Oct 02, 2018 at 07:03:42PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we're reflinking between two files and the destination file range
> is well beyond the destination file's EOF marker, zero any posteof
> speculative preallocations in the destination file so that we don't
> expose stale disk contents.  The previous strategy of trying to clear
> the preallocations does not work if the destination file has the
> PREALLOC flag set but no delalloc blocks.
> 
> Uncovered by shared/010.
> 
> Reported-by: Zorro Lang <zlang@redhat.com>
> Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_reflink.c |   34 ++++++++++++++++++++++++++--------
>  1 file changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 38f405415b88..c8e996a99a74 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1195,6 +1195,27 @@ xfs_iolock_two_inodes_and_break_layout(
>  	return 0;
>  }
>  
> +/*
> + * If we're reflinking to a point past the destination file's EOF, we must
> + * zero any speculative post-EOF preallocations that sit between the old EOF
> + * and the destination file offset.
> + */
> +static int
> +xfs_reflink_zero_posteof(
> +	struct xfs_inode	*ip,
> +	loff_t			pos)
> +{
> +	loff_t			isize = i_size_read(VFS_I(ip));
> +	bool			did_zeroing = false;
> +
> +	if (pos <= isize)
> +		return 0;
> +
> +	trace_xfs_zero_eof(ip, isize, pos - isize);
> +	return iomap_zero_range(VFS_I(ip), isize, pos - isize, &did_zeroing,
> +			&xfs_iomap_ops);

iomap_zero_range() accepts NULL for the *did_zero param. Otherwise seems
fine, barring Eric's question on whether we need additional checks..

Brian

> +}
> +
>  /*
>   * Link a range of blocks from one file to another.
>   */
> @@ -1257,15 +1278,12 @@ xfs_reflink_remap_range(
>  	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
>  
>  	/*
> -	 * Clear out post-eof preallocations because we don't have page cache
> -	 * backing the delayed allocations and they'll never get freed on
> -	 * their own.
> +	 * Zero existing post-eof speculative preallocations in the destination
> +	 * file.
>  	 */
> -	if (xfs_can_free_eofblocks(dest, true)) {
> -		ret = xfs_free_eofblocks(dest);
> -		if (ret)
> -			goto out_unlock;
> -	}
> +	ret = xfs_reflink_zero_posteof(dest, pos_out);
> +	if (ret)
> +		goto out_unlock;
>  
>  	/* Set flags and remap blocks. */
>  	ret = xfs_reflink_set_inode_flag(src, dest);

Darrick J. Wong Oct. 3, 2018, 3:12 p.m. UTC | #3

On Wed, Oct 03, 2018 at 07:11:14AM -0500, Eric Sandeen wrote:
> On 10/2/18 9:03 PM, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > When we're reflinking between two files and the destination file range
> > is well beyond the destination file's EOF marker, zero any posteof
> > speculative preallocations in the destination file so that we don't
> > expose stale disk contents.  The previous strategy of trying to clear
> > the preallocations does not work if the destination file has the
> > PREALLOC flag set but no delalloc blocks.
> > 
> > Uncovered by shared/010.
> > 
> > Reported-by: Zorro Lang <zlang@redhat.com>
> > Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> The action makes sense, and this does resolve my simple testcase,
> and makes shared/010 pass for me as well.
> 
> However, this makes my correctness spidey-sense tingle; why is there a
> new helper unique to extending reflinks, when extending writes already
> must do the same thing?

I think you're referring to Dave's earlier question of "Why don't you
just use xfs_file_aio_write_checks?"

It's tempting to adapt xfs_file_aio_write_checks for reflink, but I
think I have to create a new function because (a) we don't have a kiocb
to pass in, and (b) we have to lock two inodes for reflink while abiding
the [VX]FS inode locking rules and making sure we break the destination
flie's layout correctly.

> I didn't follow all the discussion on IRC, but might be worth
> explaining on the list for others as well.  Are there any other
> extending write tests that aren't happening for extending reflink?

Yes, there are a number of behavioral inconsistencies between regular
write and clonerange that have been discovered in the past few days, and
it's going to take me a few days to clean all of this up:

- Lack of file_update_times(), though the ctime update is open-coded in
  the reflink routines.

- Lack of file_remove_privs() to drop suid and capabilities on write.
  Totally missing from the btrfs implementation and xfs/ocfs2 followed
  that behavior warts and all.

- Lack of RLIMIT_FSIZE checking: D'oh.  Same lame excuse as above.

- Lack of MAX_NON_LFS size checking: Same.

- Lack of s_maxbytes checking: Same.  Alarming since this means we can
  reflink to offsets the pagecache doesn't support.

- Should our clonerange return bytes reflinked to copy_file_range?

That last one requires more careful consideration & will take longer;
the first two are nearly ready.

--D

> -Eric
> 
> > ---
> >  fs/xfs/xfs_reflink.c |   34 ++++++++++++++++++++++++++--------
> >  1 file changed, 26 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 38f405415b88..c8e996a99a74 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -1195,6 +1195,27 @@ xfs_iolock_two_inodes_and_break_layout(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * If we're reflinking to a point past the destination file's EOF, we must
> > + * zero any speculative post-EOF preallocations that sit between the old EOF
> > + * and the destination file offset.
> > + */
> > +static int
> > +xfs_reflink_zero_posteof(
> > +	struct xfs_inode	*ip,
> > +	loff_t			pos)
> > +{
> > +	loff_t			isize = i_size_read(VFS_I(ip));
> > +	bool			did_zeroing = false;
> > +
> > +	if (pos <= isize)
> > +		return 0;
> > +
> > +	trace_xfs_zero_eof(ip, isize, pos - isize);
> > +	return iomap_zero_range(VFS_I(ip), isize, pos - isize, &did_zeroing,
> > +			&xfs_iomap_ops);
> > +}
> > +
> >  /*
> >   * Link a range of blocks from one file to another.
> >   */
> > @@ -1257,15 +1278,12 @@ xfs_reflink_remap_range(
> >  	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
> >  
> >  	/*
> > -	 * Clear out post-eof preallocations because we don't have page cache
> > -	 * backing the delayed allocations and they'll never get freed on
> > -	 * their own.
> > +	 * Zero existing post-eof speculative preallocations in the destination
> > +	 * file.
> >  	 */
> > -	if (xfs_can_free_eofblocks(dest, true)) {
> > -		ret = xfs_free_eofblocks(dest);
> > -		if (ret)
> > -			goto out_unlock;
> > -	}
> > +	ret = xfs_reflink_zero_posteof(dest, pos_out);
> > +	if (ret)
> > +		goto out_unlock;
> >  
> >  	/* Set flags and remap blocks. */
> >  	ret = xfs_reflink_set_inode_flag(src, dest);
> >

Darrick J. Wong Oct. 3, 2018, 3:18 p.m. UTC | #4

On Wed, Oct 03, 2018 at 08:20:27AM -0400, Brian Foster wrote:
> On Tue, Oct 02, 2018 at 07:03:42PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > When we're reflinking between two files and the destination file range
> > is well beyond the destination file's EOF marker, zero any posteof
> > speculative preallocations in the destination file so that we don't
> > expose stale disk contents.  The previous strategy of trying to clear
> > the preallocations does not work if the destination file has the
> > PREALLOC flag set but no delalloc blocks.
> > 
> > Uncovered by shared/010.
> > 
> > Reported-by: Zorro Lang <zlang@redhat.com>
> > Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_reflink.c |   34 ++++++++++++++++++++++++++--------
> >  1 file changed, 26 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 38f405415b88..c8e996a99a74 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -1195,6 +1195,27 @@ xfs_iolock_two_inodes_and_break_layout(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * If we're reflinking to a point past the destination file's EOF, we must
> > + * zero any speculative post-EOF preallocations that sit between the old EOF
> > + * and the destination file offset.
> > + */
> > +static int
> > +xfs_reflink_zero_posteof(
> > +	struct xfs_inode	*ip,
> > +	loff_t			pos)
> > +{
> > +	loff_t			isize = i_size_read(VFS_I(ip));
> > +	bool			did_zeroing = false;
> > +
> > +	if (pos <= isize)
> > +		return 0;
> > +
> > +	trace_xfs_zero_eof(ip, isize, pos - isize);
> > +	return iomap_zero_range(VFS_I(ip), isize, pos - isize, &did_zeroing,
> > +			&xfs_iomap_ops);
> 
> iomap_zero_range() accepts NULL for the *did_zero param. Otherwise seems
> fine, barring Eric's question on whether we need additional checks..

Oh, right, it does.  Will fix, thank you!

--D

> Brian
> 
> > +}
> > +
> >  /*
> >   * Link a range of blocks from one file to another.
> >   */
> > @@ -1257,15 +1278,12 @@ xfs_reflink_remap_range(
> >  	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
> >  
> >  	/*
> > -	 * Clear out post-eof preallocations because we don't have page cache
> > -	 * backing the delayed allocations and they'll never get freed on
> > -	 * their own.
> > +	 * Zero existing post-eof speculative preallocations in the destination
> > +	 * file.
> >  	 */
> > -	if (xfs_can_free_eofblocks(dest, true)) {
> > -		ret = xfs_free_eofblocks(dest);
> > -		if (ret)
> > -			goto out_unlock;
> > -	}
> > +	ret = xfs_reflink_zero_posteof(dest, pos_out);
> > +	if (ret)
> > +		goto out_unlock;
> >  
> >  	/* Set flags and remap blocks. */
> >  	ret = xfs_reflink_set_inode_flag(src, dest);

Eric Sandeen Oct. 3, 2018, 3:35 p.m. UTC | #5

On 10/3/18 10:12 AM, Darrick J. Wong wrote:
> On Wed, Oct 03, 2018 at 07:11:14AM -0500, Eric Sandeen wrote:
>> On 10/2/18 9:03 PM, Darrick J. Wong wrote:
>>> From: Darrick J. Wong <darrick.wong@oracle.com>
>>>
>>> When we're reflinking between two files and the destination file range
>>> is well beyond the destination file's EOF marker, zero any posteof
>>> speculative preallocations in the destination file so that we don't
>>> expose stale disk contents.  The previous strategy of trying to clear
>>> the preallocations does not work if the destination file has the
>>> PREALLOC flag set but no delalloc blocks.
>>>
>>> Uncovered by shared/010.
>>>
>>> Reported-by: Zorro Lang <zlang@redhat.com>
>>> Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259
>>> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
>>
>> The action makes sense, and this does resolve my simple testcase,
>> and makes shared/010 pass for me as well.
>>
>> However, this makes my correctness spidey-sense tingle; why is there a
>> new helper unique to extending reflinks, when extending writes already
>> must do the same thing?
> 
> I think you're referring to Dave's earlier question of "Why don't you
> just use xfs_file_aio_write_checks?"
> 
> It's tempting to adapt xfs_file_aio_write_checks for reflink, but I
> think I have to create a new function because (a) we don't have a kiocb
> to pass in, and (b) we have to lock two inodes for reflink while abiding
> the [VX]FS inode locking rules and making sure we break the destination
> flie's layout correctly.
> 
>> I didn't follow all the discussion on IRC, but might be worth
>> explaining on the list for others as well.  Are there any other
>> extending write tests that aren't happening for extending reflink?
> 
> Yes, there are a number of behavioral inconsistencies between regular
> write and clonerange that have been discovered in the past few days, and
> it's going to take me a few days to clean all of this up:
> 
> - Lack of file_update_times(), though the ctime update is open-coded in
>   the reflink routines.
> 
> - Lack of file_remove_privs() to drop suid and capabilities on write.
>   Totally missing from the btrfs implementation and xfs/ocfs2 followed
>   that behavior warts and all.
> 
> - Lack of RLIMIT_FSIZE checking: D'oh.  Same lame excuse as above.
> 
> - Lack of MAX_NON_LFS size checking: Same.
> 
> - Lack of s_maxbytes checking: Same.  Alarming since this means we can
>   reflink to offsets the pagecache doesn't support.
> 
> - Should our clonerange return bytes reflinked to copy_file_range?
> 
> That last one requires more careful consideration & will take longer;
> the first two are nearly ready.

Ok, so let's say something like "this patch looks good as far as it goes,
but as you work out these other issues, please consider code structure
so that requirements which are common to extending write & extending reflink
are done in common code rather than cut & pasted?"  :)

-Eric

Darrick J. Wong Oct. 3, 2018, 3:51 p.m. UTC | #6

On Wed, Oct 03, 2018 at 10:35:50AM -0500, Eric Sandeen wrote:
> On 10/3/18 10:12 AM, Darrick J. Wong wrote:
> > On Wed, Oct 03, 2018 at 07:11:14AM -0500, Eric Sandeen wrote:
> >> On 10/2/18 9:03 PM, Darrick J. Wong wrote:
> >>> From: Darrick J. Wong <darrick.wong@oracle.com>
> >>>
> >>> When we're reflinking between two files and the destination file range
> >>> is well beyond the destination file's EOF marker, zero any posteof
> >>> speculative preallocations in the destination file so that we don't
> >>> expose stale disk contents.  The previous strategy of trying to clear
> >>> the preallocations does not work if the destination file has the
> >>> PREALLOC flag set but no delalloc blocks.
> >>>
> >>> Uncovered by shared/010.
> >>>
> >>> Reported-by: Zorro Lang <zlang@redhat.com>
> >>> Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259
> >>> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> >>
> >> The action makes sense, and this does resolve my simple testcase,
> >> and makes shared/010 pass for me as well.
> >>
> >> However, this makes my correctness spidey-sense tingle; why is there a
> >> new helper unique to extending reflinks, when extending writes already
> >> must do the same thing?
> > 
> > I think you're referring to Dave's earlier question of "Why don't you
> > just use xfs_file_aio_write_checks?"
> > 
> > It's tempting to adapt xfs_file_aio_write_checks for reflink, but I
> > think I have to create a new function because (a) we don't have a kiocb
> > to pass in, and (b) we have to lock two inodes for reflink while abiding
> > the [VX]FS inode locking rules and making sure we break the destination
> > flie's layout correctly.
> > 
> >> I didn't follow all the discussion on IRC, but might be worth
> >> explaining on the list for others as well.  Are there any other
> >> extending write tests that aren't happening for extending reflink?
> > 
> > Yes, there are a number of behavioral inconsistencies between regular
> > write and clonerange that have been discovered in the past few days, and
> > it's going to take me a few days to clean all of this up:
> > 
> > - Lack of file_update_times(), though the ctime update is open-coded in
> >   the reflink routines.
> > 
> > - Lack of file_remove_privs() to drop suid and capabilities on write.
> >   Totally missing from the btrfs implementation and xfs/ocfs2 followed
> >   that behavior warts and all.
> > 
> > - Lack of RLIMIT_FSIZE checking: D'oh.  Same lame excuse as above.
> > 
> > - Lack of MAX_NON_LFS size checking: Same.
> > 
> > - Lack of s_maxbytes checking: Same.  Alarming since this means we can
> >   reflink to offsets the pagecache doesn't support.
> > 
> > - Should our clonerange return bytes reflinked to copy_file_range?
> > 
> > That last one requires more careful consideration & will take longer;
> > the first two are nearly ready.
> 
> Ok, so let's say something like "this patch looks good as far as it goes,
> but as you work out these other issues, please consider code structure
> so that requirements which are common to extending write & extending reflink
> are done in common code rather than cut & pasted?"  :)

The scattershot approach sucks, yes.  I'm concentrating for now on
fixing the glaring holes and anticipate adding a final patch to pull
everything into a common xfs_reflink_clone_file_prep function that takes
both inodes and does whatever checking and prep work are needed (like
xfs_file_aio_write_checks) so that when it returns, the two files are
ready for xfs_reflink_remap_blocks.

--D

> -Eric

xfs: zero posteof blocks when cloning above eof

Commit Message

Comments

Patch