[2/2] xfs: convert extents in place for ZERO_RANGE

Message ID	25a2ebbc-0ec9-f5dd-eba0-4101c80837dd@sandeen.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-xfs-owner@kernel.org> Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D270076 for <patchwork-linux-xfs@patchwork.kernel.org>; Tue, 25 Jun 2019 00:48:14 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BEEFA28B35 for <patchwork-linux-xfs@patchwork.kernel.org>; Tue, 25 Jun 2019 00:48:14 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id AF69B28B3B; Tue, 25 Jun 2019 00:48:14 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4B57528B35 for <patchwork-linux-xfs@patchwork.kernel.org>; Tue, 25 Jun 2019 00:48:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729654AbfFYAsN (ORCPT <rfc822;patchwork-linux-xfs@patchwork.kernel.org>); Mon, 24 Jun 2019 20:48:13 -0400 Received: from sandeen.net ([63.231.237.45]:37882 "EHLO sandeen.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727648AbfFYAsN (ORCPT <rfc822;linux-xfs@vger.kernel.org>); Mon, 24 Jun 2019 20:48:13 -0400 Received: from [10.0.0.4] (liberator [10.0.0.4]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTPSA id 3DAE1325F; Mon, 24 Jun 2019 19:48:04 -0500 (CDT) Subject: [PATCH 2/2] xfs: convert extents in place for ZERO_RANGE To: Eric Sandeen <sandeen@redhat.com>, linux-xfs <linux-xfs@vger.kernel.org> References: <ace9a6b9-3833-ec15-e3df-b9d88985685e@redhat.com> From: Eric Sandeen <sandeen@sandeen.net> Openpgp: preference=signencrypt Autocrypt: addr=sandeen@sandeen.net; prefer-encrypt=mutual; keydata= mQINBE6x99QBEADMR+yNFBc1Y5avoUhzI/sdR9ANwznsNpiCtZlaO4pIWvqQJCjBzp96cpCs nQZV32nqJBYnDpBDITBqTa/EF+IrHx8gKq8TaSBLHUq2ju2gJJLfBoL7V3807PQcI18YzkF+ WL05ODFQ2cemDhx5uLghHEeOxuGj+1AI+kh/FCzMedHc6k87Yu2ZuaWF+Gh1W2ix6hikRJmQ vj5BEeAx7xKkyBhzdbNIbbjV/iGi9b26B/dNcyd5w2My2gxMtxaiP7q5b6GM2rsQklHP8FtW ZiYO7jsg/qIppR1C6Zr5jK1GQlMUIclYFeBbKggJ9mSwXJH7MIftilGQ8KDvNuV5AbkronGC sEEHj2khs7GfVv4pmUUHf1MRIvV0x3WJkpmhuZaYg8AdJlyGKgp+TQ7B+wCjNTdVqMI1vDk2 BS6Rg851ay7AypbCPx2w4d8jIkQEgNjACHVDU89PNKAjScK1aTnW+HNUqg9BliCvuX5g4z2j gJBs57loTWAGe2Ve3cMy3VoQ40Wt3yKK0Eno8jfgzgb48wyycINZgnseMRhxc2c8hd51tftK LKhPj4c7uqjnBjrgOVaVBupGUmvLiePlnW56zJZ51BR5igWnILeOJ1ZIcf7KsaHyE6B1mG+X dmYtjDhjf3NAcoBWJuj8euxMB6TcQN2MrSXy5wSKaw40evooGwARAQABtCVFcmljIFIuIFNh bmRlZW4gPHNhbmRlZW5Ac2FuZGVlbi5uZXQ+iQI7BBMBAgAlAhsDBgsJCAcDAgYVCAIJCgsE FgIDAQIeAQIXgAUCUzMzbAIZAQAKCRAgrhaS4T3e4Fr7D/wO+fenqVvHjq21SCjDCrt8HdVj aJ28B1SqSU2toxyg5I160GllAxEHpLFGdbFAhQfBtnmlY9eMjwmJb0sCIrkrB6XNPSPA/B2B UPISh0z2odJv35/euJF71qIFgWzp2czJHkHWwVZaZpMWWNvsLIroXoR+uA9c2V1hQFVAJZyk EE4xzfm1+oVtjIC12B9tTCuS00pY3AUy21yzNowT6SSk7HAzmtG/PJ/uSB5wEkwldB6jVs2A sjOg1wMwVvh/JHilsQg4HSmDfObmZj1d0RWlMWcUE7csRnCE0ZWBMp/ttTn+oosioGa09HAS 9jAnauznmYg43oQ5Akd8iQRxz5I58F/+JsdKvWiyrPDfYZtFS+UIgWD7x+mHBZ53Qjazszox gjwO9ehZpwUQxBm4I0lPDAKw3HJA+GwwiubTSlq5PS3P7QoCjaV8llH1bNFZMz2o8wPANiDx 5FHgpRVgwLHakoCU1Gc+LXHXBzDXt7Cj02WYHdFzMm2hXaslRdhNGowLo1SXZFXa41KGTlNe 4di53y9CK5ynV0z+YUa+5LR6RdHrHtgywdKnjeWdqhoVpsWIeORtwWGX8evNOiKJ7j0RsHha WrePTubr5nuYTDsQqgc2r4aBIOpeSRR2brlT/UE3wGgy9LY78L4EwPR0MzzecfE1Ws60iSqw Pu3vhb7h3bkCDQROsffUARAA0DrUifTrXQzqxO8aiQOC5p9Tz25Np/Tfpv1rofOwL8VPBMvJ X4P5l1V2yd70MZRUVgjmCydEyxLJ6G2YyHO2IZTEajUY0Up+b3ErOpLpZwhvgWatjifpj6bB SKuDXeThqFdkphF5kAmgfVAIkan5SxWK3+S0V2F/oxstIViBhMhDwI6XsRlnVBoLLYcEilxA 2FlRUS7MOZGmRJkRtdGD5koVZSM6xVZQSmfEBaYQ/WJBGJQdPy94nnlAVn3lH3+N7pXvNUuC GV+t4YUt3tLcRuIpYBCOWlc7bpgeCps5Xa0dIZgJ8Louu6OBJ5vVXjPxTlkFdT0S0/uerCG5 1u8p6sGRLnUeAUGkQfIUqGUjW2rHaXgWNvzOV6i3tf9YaiXKl3avFaNW1kKBs0T5M1cnlWZU Utl6k04lz5OjoNY9J/bGyV3DSlkblXRMK87iLYQSrcV6cFz9PRl4vW1LGff3xRQHngeN5fPx ze8X5NE3hb+SSwyMSEqJxhVTXJVfQWWW0dQxP7HNwqmOWYF/6m+1gK/Y2gY3jAQnsWTru4RV TZGnKwEPmOCpSUvsTRXsVHgsWJ70qd0yOSjWuiv4b8vmD3+QFgyvCBxPMdP3xsxN5etheLMO gRwWpLn6yNFq/xtgs+ECgG+gR78yXQyA7iCs5tFs2OrMqV5juSMGmn0kxJUAEQEAAYkCHwQY AQIACQUCTrH31AIbDAAKCRAgrhaS4T3e4BKwD/0ZOOmUNOZCSOLAMjZx3mtYtjYgfUNKi0ki YPveGoRWTqbis8UitPtNrG4XxgzLOijSdOEzQwkdOIp/QnZhGNssMejCnsluK0GQd+RkFVWN mcQT78hBeGcnEMAXZKq7bkIKzvc06GFmkMbX/gAl6DiNGv0UNAX+5FYh+ucCJZSyAp3sA+9/ LKjxnTedX0aygXA6rkpX0Y0FvN/9dfm47+LGq7WAqBOyYTU3E6/+Z72bZoG/cG7ANLxcPool LOrU43oqFnD8QwcN56y4VfFj3/jDF2MX3xu4v2OjglVjMEYHTCxP3mpxesGHuqOit/FR+mF0 MP9JGfj6x+bj/9JMBtCW1bY/aPeMdPGTJvXjGtOVYblGZrSjXRn5++Uuy36CvkcrjuziSDG+ JEexGxczWwN4mrOQWhMT5Jyb+18CO+CWxJfHaYXiLEW7dI1AynL4jjn4W0MSiXpWDUw+fsBO Pk6ah10C4+R1Jc7dyUsKksMfvvhRX1hTIXhth85H16706bneTayZBhlZ/hK18uqTX+s0onG/ m1F3vYvdlE4p2ts1mmixMF7KajN9/E5RQtiSArvKTbfsB6Two4MthIuLuf+M0mI4gPl9SPlf fWCYVPhaU9o83y1KFbD/+lh1pjP7bEu/YudBvz7F2Myjh4/9GUAijrCTNeDTDAgvIJDjXuLX pA== Message-ID: <25a2ebbc-0ec9-f5dd-eba0-4101c80837dd@sandeen.net> Date: Mon, 24 Jun 2019 19:48:11 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.7.2 MIME-Version: 1.0 In-Reply-To: <ace9a6b9-3833-ec15-e3df-b9d88985685e@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: <linux-xfs.vger.kernel.org> X-Mailing-List: linux-xfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP
Series	xfs: don't fragment files with ZERO_RANGE calls \| expand [0/2] xfs: don't fragment files with ZERO_RANGE calls [1/2] xfs: factor range zeroing out of xfs_free_file_space [2/2] xfs: convert extents in place for ZERO_RANGE

Eric Sandeen June 25, 2019, 12:48 a.m. UTC

Rather than completely removing and re-allocating a range
during ZERO_RANGE fallocate calls, convert whole blocks in the
range using xfs_alloc_file_space(XFS_BMAPI_PREALLOC|XFS_BMAPI_CONVERT)
and then zero the edges with xfs_zero_range()

(Note that this changes the rounding direction of the
xfs_alloc_file_space range, because we only want to hit whole
blocks within the range.)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

<currently running fsx ad infinitum, so far so good>

Dave Chinner June 25, 2019, 2:39 a.m. UTC | #1

On Mon, Jun 24, 2019 at 07:48:11PM -0500, Eric Sandeen wrote:
> Rather than completely removing and re-allocating a range
> during ZERO_RANGE fallocate calls, convert whole blocks in the
> range using xfs_alloc_file_space(XFS_BMAPI_PREALLOC|XFS_BMAPI_CONVERT)
> and then zero the edges with xfs_zero_range()

That's what I originally used to implement ZERO_RANGE and that
had problems with zeroing the partial blocks either side and
unexpected inode size changes. See commit:

5d11fb4b9a1d xfs: rework zero range to prevent invalid i_size updates

I also remember discussion about zero range being inefficient on
sparse files and fragmented files - the current implementation
effectively defragments such files, whilst using XFS_BMAPI_CONVERT
just leaves all the fragments behind.

> (Note that this changes the rounding direction of the
> xfs_alloc_file_space range, because we only want to hit whole
> blocks within the range.)
> 
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> ---
> 
> <currently running fsx ad infinitum, so far so good>
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 0a96c4d1718e..eae202bfe134 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1164,23 +1164,25 @@ xfs_zero_file_space(
>  
>  	blksize = 1 << mp->m_sb.sb_blocklog;
>  
> +	error = xfs_flush_unmap_range(ip, offset, len);
> +	if (error)
> +		return error;
>  	/*
> -	 * Punch a hole and prealloc the range. We use hole punch rather than
> -	 * unwritten extent conversion for two reasons:
> -	 *
> -	 * 1.) Hole punch handles partial block zeroing for us.
> -	 *
> -	 * 2.) If prealloc returns ENOSPC, the file range is still zero-valued
> -	 * by virtue of the hole punch.
> +	 * Convert whole blocks in the range to unwritten, then call iomap
> +	 * via xfs_zero_range to zero the range.  iomap will skip holes and
> +	 * unwritten extents, and just zero the edges if needed.  If conversion
> +	 * fails, iomap will simply write zeros to the whole range.
> +	 * nb: always_cow doesn't support unwritten extents.
>  	 */
> -	error = xfs_free_file_space(ip, offset, len);
> -	if (error || xfs_is_always_cow_inode(ip))
> -		return error;
> +	if (!xfs_is_always_cow_inode(ip))
> +		xfs_alloc_file_space(ip, round_up(offset, blksize),
> +				     round_down(offset + len, blksize) -
> +				     round_up(offset, blksize),
> +				     XFS_BMAPI_PREALLOC|XFS_BMAPI_CONVERT);

If this fails with, say, corruption we should abort with an error,
not ignore it. I think we can only safely ignore ENOSPC and maybe
EDQUOT here...

> -	return xfs_alloc_file_space(ip, round_down(offset, blksize),
> -				     round_up(offset + len, blksize) -
> -				     round_down(offset, blksize),
> -				     XFS_BMAPI_PREALLOC);
> +	error = xfs_zero_range(ip, offset, len);

What prevents xfs_zero_range() from changing the file size if
offset + len is beyond EOF and there are allocated extents (from
delalloc conversion) beyond EOF? (i.e. FALLOC_FL_KEEP_SIZE is set by
the caller).

Cheers,

Dave.

Eric Sandeen June 25, 2019, 2:52 a.m. UTC | #2

On 6/24/19 9:39 PM, Dave Chinner wrote:
> On Mon, Jun 24, 2019 at 07:48:11PM -0500, Eric Sandeen wrote:
>> Rather than completely removing and re-allocating a range
>> during ZERO_RANGE fallocate calls, convert whole blocks in the
>> range using xfs_alloc_file_space(XFS_BMAPI_PREALLOC|XFS_BMAPI_CONVERT)
>> and then zero the edges with xfs_zero_range()
> 
> That's what I originally used to implement ZERO_RANGE and that
> had problems with zeroing the partial blocks either side and
> unexpected inode size changes. See commit:
> 
> 5d11fb4b9a1d xfs: rework zero range to prevent invalid i_size updates

Yep I did see that.  It had a lot of hand-rolled partial block stuff
that seems more complex than this, no?  That commit didn't indicate
what the root cause of the failure actually was, AFAICT.

(funny thought that I skimmed that commit just to see why we had
what we have, but didn't really intentionally re-implement it...
even though I guess I almost did...)

> I also remember discussion about zero range being inefficient on
> sparse files and fragmented files - the current implementation
> effectively defragments such files, whilst using XFS_BMAPI_CONVERT
> just leaves all the fragments behind.

That's true - and it fragments unfragmented files.  Is ZERO_RANGE
supposed to be a defragmenter?

>> (Note that this changes the rounding direction of the
>> xfs_alloc_file_space range, because we only want to hit whole
>> blocks within the range.)
>>
>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>> ---
>>
>> <currently running fsx ad infinitum, so far so good>

<still running, so far so good (4k blocks)>

>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index 0a96c4d1718e..eae202bfe134 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1164,23 +1164,25 @@ xfs_zero_file_space(
>>  
>>  	blksize = 1 << mp->m_sb.sb_blocklog;
>>  
>> +	error = xfs_flush_unmap_range(ip, offset, len);
>> +	if (error)
>> +		return error;
>>  	/*
>> -	 * Punch a hole and prealloc the range. We use hole punch rather than
>> -	 * unwritten extent conversion for two reasons:
>> -	 *
>> -	 * 1.) Hole punch handles partial block zeroing for us.
>> -	 *
>> -	 * 2.) If prealloc returns ENOSPC, the file range is still zero-valued
>> -	 * by virtue of the hole punch.
>> +	 * Convert whole blocks in the range to unwritten, then call iomap
>> +	 * via xfs_zero_range to zero the range.  iomap will skip holes and
>> +	 * unwritten extents, and just zero the edges if needed.  If conversion
>> +	 * fails, iomap will simply write zeros to the whole range.
>> +	 * nb: always_cow doesn't support unwritten extents.
>>  	 */
>> -	error = xfs_free_file_space(ip, offset, len);
>> -	if (error || xfs_is_always_cow_inode(ip))
>> -		return error;
>> +	if (!xfs_is_always_cow_inode(ip))
>> +		xfs_alloc_file_space(ip, round_up(offset, blksize),
>> +				     round_down(offset + len, blksize) -
>> +				     round_up(offset, blksize),
>> +				     XFS_BMAPI_PREALLOC|XFS_BMAPI_CONVERT);
> 
> If this fails with, say, corruption we should abort with an error,
> not ignore it. I think we can only safely ignore ENOSPC and maybe
> EDQUOT here...

Yes, I suppose so, though if this encounters corruption I'd guess
xfs_zero_range probably would as well but that's just handwaving.

>> -	return xfs_alloc_file_space(ip, round_down(offset, blksize),
>> -				     round_up(offset + len, blksize) -
>> -				     round_down(offset, blksize),
>> -				     XFS_BMAPI_PREALLOC);
>> +	error = xfs_zero_range(ip, offset, len);
> 
> What prevents xfs_zero_range() from changing the file size if
> offset + len is beyond EOF and there are allocated extents (from
> delalloc conversion) beyond EOF? (i.e. FALLOC_FL_KEEP_SIZE is set by
> the caller).

nothing, but AFAIK it does the same today... even w/o extents past
EOF:

$ xfs_io -f -c "truncate 0" -c "fzero 0 1m" testfile

$ ls -lh testfile
-rw-------. 1 sandeen sandeen 1.0M Jun 24 21:48 testfile

$ xfs_bmap -vvp testfile
testfile:
 EXT: FILE-OFFSET      BLOCK-RANGE          AG AG-OFFSET            TOTAL FLAGS
   0: [0..2047]:       183206064..183208111  2 (48988336..48990383)  2048 10000
 FLAG Values:
    010000 Unwritten preallocated extent
    001000 Doesn't begin on stripe unit
    000100 Doesn't end   on stripe unit
    000010 Doesn't begin on stripe width
    000001 Doesn't end   on stripe width

At the end of the day it's just one allocation behavior over another,
it's not a correctness issue, so if there are concerns I don't have
to push it...

-Eric
 
> Cheers,
> 
> Dave.
>

Darrick J. Wong June 25, 2019, 3 a.m. UTC | #3

On Mon, Jun 24, 2019 at 09:52:03PM -0500, Eric Sandeen wrote:
> On 6/24/19 9:39 PM, Dave Chinner wrote:
> > On Mon, Jun 24, 2019 at 07:48:11PM -0500, Eric Sandeen wrote:
> >> Rather than completely removing and re-allocating a range
> >> during ZERO_RANGE fallocate calls, convert whole blocks in the
> >> range using xfs_alloc_file_space(XFS_BMAPI_PREALLOC|XFS_BMAPI_CONVERT)
> >> and then zero the edges with xfs_zero_range()
> > 
> > That's what I originally used to implement ZERO_RANGE and that
> > had problems with zeroing the partial blocks either side and
> > unexpected inode size changes. See commit:
> > 
> > 5d11fb4b9a1d xfs: rework zero range to prevent invalid i_size updates
> 
> Yep I did see that.  It had a lot of hand-rolled partial block stuff
> that seems more complex than this, no?  That commit didn't indicate
> what the root cause of the failure actually was, AFAICT.
> 
> (funny thought that I skimmed that commit just to see why we had
> what we have, but didn't really intentionally re-implement it...
> even though I guess I almost did...)

FWIW the complaint I had about the fragmentary behavior really only
applied to fun and games when one fallocated an ext4 image and then ran
mkfs.ext4 which uses zero range which fragmented the image...

> > I also remember discussion about zero range being inefficient on
> > sparse files and fragmented files - the current implementation
> > effectively defragments such files, whilst using XFS_BMAPI_CONVERT
> > just leaves all the fragments behind.
> 
> That's true - and it fragments unfragmented files.  Is ZERO_RANGE
> supposed to be a defragmenter?

...so please remember, the key point we were talking about when we
discussed this a year ago was that if the /entire/ zero range maps to a
single extent within eof then maybe we ought to just convert it to
unwritten.

Note also that for pmem there's a slightly different optimization --
if the entire range is mapped by written extents (not necessarily
contiguous, just no holes/cow/delalloc/unwritten bits) then we can use
blkdev_issue_zeroout to zero memory and clear hwpoison cheaply.

> >> (Note that this changes the rounding direction of the
> >> xfs_alloc_file_space range, because we only want to hit whole
> >> blocks within the range.)
> >>
> >> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> >> ---
> >>
> >> <currently running fsx ad infinitum, so far so good>
> 
> <still running, so far so good (4k blocks)>
> 
> >> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> >> index 0a96c4d1718e..eae202bfe134 100644
> >> --- a/fs/xfs/xfs_bmap_util.c
> >> +++ b/fs/xfs/xfs_bmap_util.c
> >> @@ -1164,23 +1164,25 @@ xfs_zero_file_space(
> >>  
> >>  	blksize = 1 << mp->m_sb.sb_blocklog;
> >>  
> >> +	error = xfs_flush_unmap_range(ip, offset, len);
> >> +	if (error)
> >> +		return error;
> >>  	/*
> >> -	 * Punch a hole and prealloc the range. We use hole punch rather than
> >> -	 * unwritten extent conversion for two reasons:
> >> -	 *
> >> -	 * 1.) Hole punch handles partial block zeroing for us.
> >> -	 *
> >> -	 * 2.) If prealloc returns ENOSPC, the file range is still zero-valued
> >> -	 * by virtue of the hole punch.
> >> +	 * Convert whole blocks in the range to unwritten, then call iomap
> >> +	 * via xfs_zero_range to zero the range.  iomap will skip holes and
> >> +	 * unwritten extents, and just zero the edges if needed.  If conversion
> >> +	 * fails, iomap will simply write zeros to the whole range.
> >> +	 * nb: always_cow doesn't support unwritten extents.
> >>  	 */
> >> -	error = xfs_free_file_space(ip, offset, len);
> >> -	if (error || xfs_is_always_cow_inode(ip))
> >> -		return error;
> >> +	if (!xfs_is_always_cow_inode(ip))
> >> +		xfs_alloc_file_space(ip, round_up(offset, blksize),
> >> +				     round_down(offset + len, blksize) -
> >> +				     round_up(offset, blksize),
> >> +				     XFS_BMAPI_PREALLOC|XFS_BMAPI_CONVERT);
> > 
> > If this fails with, say, corruption we should abort with an error,
> > not ignore it. I think we can only safely ignore ENOSPC and maybe
> > EDQUOT here...
> 
> Yes, I suppose so, though if this encounters corruption I'd guess
> xfs_zero_range probably would as well but that's just handwaving.

<nod>

> >> -	return xfs_alloc_file_space(ip, round_down(offset, blksize),
> >> -				     round_up(offset + len, blksize) -
> >> -				     round_down(offset, blksize),
> >> -				     XFS_BMAPI_PREALLOC);
> >> +	error = xfs_zero_range(ip, offset, len);
> > 
> > What prevents xfs_zero_range() from changing the file size if
> > offset + len is beyond EOF and there are allocated extents (from
> > delalloc conversion) beyond EOF? (i.e. FALLOC_FL_KEEP_SIZE is set by
> > the caller).
> 
> nothing, but AFAIK it does the same today... even w/o extents past
> EOF:
> 
> $ xfs_io -f -c "truncate 0" -c "fzero 0 1m" testfile

fzero -k ?

--D

> 
> $ ls -lh testfile
> -rw-------. 1 sandeen sandeen 1.0M Jun 24 21:48 testfile
> 
> $ xfs_bmap -vvp testfile
> testfile:
>  EXT: FILE-OFFSET      BLOCK-RANGE          AG AG-OFFSET            TOTAL FLAGS
>    0: [0..2047]:       183206064..183208111  2 (48988336..48990383)  2048 10000
>  FLAG Values:
>     010000 Unwritten preallocated extent
>     001000 Doesn't begin on stripe unit
>     000100 Doesn't end   on stripe unit
>     000010 Doesn't begin on stripe width
>     000001 Doesn't end   on stripe width
> 
> At the end of the day it's just one allocation behavior over another,
> it's not a correctness issue, so if there are concerns I don't have
> to push it...
> 
> -Eric
>  
> > Cheers,
> > 
> > Dave.
> >

Eric Sandeen June 25, 2019, 3:05 a.m. UTC | #4

On 6/24/19 10:00 PM, Darrick J. Wong wrote:
> On Mon, Jun 24, 2019 at 09:52:03PM -0500, Eric Sandeen wrote:
>> On 6/24/19 9:39 PM, Dave Chinner wrote:
>>> On Mon, Jun 24, 2019 at 07:48:11PM -0500, Eric Sandeen wrote:
>>>> Rather than completely removing and re-allocating a range
>>>> during ZERO_RANGE fallocate calls, convert whole blocks in the
>>>> range using xfs_alloc_file_space(XFS_BMAPI_PREALLOC|XFS_BMAPI_CONVERT)
>>>> and then zero the edges with xfs_zero_range()
>>>
>>> That's what I originally used to implement ZERO_RANGE and that
>>> had problems with zeroing the partial blocks either side and
>>> unexpected inode size changes. See commit:
>>>
>>> 5d11fb4b9a1d xfs: rework zero range to prevent invalid i_size updates
>>
>> Yep I did see that.  It had a lot of hand-rolled partial block stuff
>> that seems more complex than this, no?  That commit didn't indicate
>> what the root cause of the failure actually was, AFAICT.
>>
>> (funny thought that I skimmed that commit just to see why we had
>> what we have, but didn't really intentionally re-implement it...
>> even though I guess I almost did...)
> 
> FWIW the complaint I had about the fragmentary behavior really only
> applied to fun and games when one fallocated an ext4 image and then ran
> mkfs.ext4 which uses zero range which fragmented the image...
> 
>>> I also remember discussion about zero range being inefficient on
>>> sparse files and fragmented files - the current implementation
>>> effectively defragments such files, whilst using XFS_BMAPI_CONVERT
>>> just leaves all the fragments behind.
>>
>> That's true - and it fragments unfragmented files.  Is ZERO_RANGE
>> supposed to be a defragmenter?
> 
> ...so please remember, the key point we were talking about when we
> discussed this a year ago was that if the /entire/ zero range maps to a
> single extent within eof then maybe we ought to just convert it to
> unwritten.

I remember you mentioning that, but honestly it seems like
overcomplication to say "ZERO_RANGE will also defragment extents
in the process, if it can..."

> Note also that for pmem there's a slightly different optimization --
> if the entire range is mapped by written extents (not necessarily
> contiguous, just no holes/cow/delalloc/unwritten bits) then we can use
> blkdev_issue_zeroout to zero memory and clear hwpoison cheaply.
> 
>>>> (Note that this changes the rounding direction of the
>>>> xfs_alloc_file_space range, because we only want to hit whole
>>>> blocks within the range.)
>>>>
>>>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>>>> ---
>>>>
>>>> <currently running fsx ad infinitum, so far so good>
>>
>> <still running, so far so good (4k blocks)>
>>
>>>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>>>> index 0a96c4d1718e..eae202bfe134 100644
>>>> --- a/fs/xfs/xfs_bmap_util.c
>>>> +++ b/fs/xfs/xfs_bmap_util.c
>>>> @@ -1164,23 +1164,25 @@ xfs_zero_file_space(
>>>>  
>>>>  	blksize = 1 << mp->m_sb.sb_blocklog;
>>>>  
>>>> +	error = xfs_flush_unmap_range(ip, offset, len);
>>>> +	if (error)
>>>> +		return error;
>>>>  	/*
>>>> -	 * Punch a hole and prealloc the range. We use hole punch rather than
>>>> -	 * unwritten extent conversion for two reasons:
>>>> -	 *
>>>> -	 * 1.) Hole punch handles partial block zeroing for us.
>>>> -	 *
>>>> -	 * 2.) If prealloc returns ENOSPC, the file range is still zero-valued
>>>> -	 * by virtue of the hole punch.
>>>> +	 * Convert whole blocks in the range to unwritten, then call iomap
>>>> +	 * via xfs_zero_range to zero the range.  iomap will skip holes and
>>>> +	 * unwritten extents, and just zero the edges if needed.  If conversion
>>>> +	 * fails, iomap will simply write zeros to the whole range.
>>>> +	 * nb: always_cow doesn't support unwritten extents.
>>>>  	 */
>>>> -	error = xfs_free_file_space(ip, offset, len);
>>>> -	if (error || xfs_is_always_cow_inode(ip))
>>>> -		return error;
>>>> +	if (!xfs_is_always_cow_inode(ip))
>>>> +		xfs_alloc_file_space(ip, round_up(offset, blksize),
>>>> +				     round_down(offset + len, blksize) -
>>>> +				     round_up(offset, blksize),
>>>> +				     XFS_BMAPI_PREALLOC|XFS_BMAPI_CONVERT);
>>>
>>> If this fails with, say, corruption we should abort with an error,
>>> not ignore it. I think we can only safely ignore ENOSPC and maybe
>>> EDQUOT here...
>>
>> Yes, I suppose so, though if this encounters corruption I'd guess
>> xfs_zero_range probably would as well but that's just handwaving.
> 
> <nod>
> 
>>>> -	return xfs_alloc_file_space(ip, round_down(offset, blksize),
>>>> -				     round_up(offset + len, blksize) -
>>>> -				     round_down(offset, blksize),
>>>> -				     XFS_BMAPI_PREALLOC);
>>>> +	error = xfs_zero_range(ip, offset, len);
>>>
>>> What prevents xfs_zero_range() from changing the file size if
>>> offset + len is beyond EOF and there are allocated extents (from
>>> delalloc conversion) beyond EOF? (i.e. FALLOC_FL_KEEP_SIZE is set by
>>> the caller).
>>
>> nothing, but AFAIK it does the same today... even w/o extents past
>> EOF:
>>
>> $ xfs_io -f -c "truncate 0" -c "fzero 0 1m" testfile
> 
> fzero -k ?

$ xfs_io -f -c "truncate 0" -c "fzero -k 0 1m" testfile

$ ls -lh testfile
-rw-------. 1 sandeen sandeen 0 Jun 24 22:02 testfile

with or without my patches.

(with or without it also seems to allocate 1M past EOF...)

-Eric

Darrick J. Wong June 25, 2019, 3:11 a.m. UTC | #5

On Mon, Jun 24, 2019 at 10:05:51PM -0500, Eric Sandeen wrote:
> On 6/24/19 10:00 PM, Darrick J. Wong wrote:
> > On Mon, Jun 24, 2019 at 09:52:03PM -0500, Eric Sandeen wrote:
> >> On 6/24/19 9:39 PM, Dave Chinner wrote:
> >>> On Mon, Jun 24, 2019 at 07:48:11PM -0500, Eric Sandeen wrote:
> >>>> Rather than completely removing and re-allocating a range
> >>>> during ZERO_RANGE fallocate calls, convert whole blocks in the
> >>>> range using xfs_alloc_file_space(XFS_BMAPI_PREALLOC|XFS_BMAPI_CONVERT)
> >>>> and then zero the edges with xfs_zero_range()
> >>>
> >>> That's what I originally used to implement ZERO_RANGE and that
> >>> had problems with zeroing the partial blocks either side and
> >>> unexpected inode size changes. See commit:
> >>>
> >>> 5d11fb4b9a1d xfs: rework zero range to prevent invalid i_size updates
> >>
> >> Yep I did see that.  It had a lot of hand-rolled partial block stuff
> >> that seems more complex than this, no?  That commit didn't indicate
> >> what the root cause of the failure actually was, AFAICT.
> >>
> >> (funny thought that I skimmed that commit just to see why we had
> >> what we have, but didn't really intentionally re-implement it...
> >> even though I guess I almost did...)
> > 
> > FWIW the complaint I had about the fragmentary behavior really only
> > applied to fun and games when one fallocated an ext4 image and then ran
> > mkfs.ext4 which uses zero range which fragmented the image...
> > 
> >>> I also remember discussion about zero range being inefficient on
> >>> sparse files and fragmented files - the current implementation
> >>> effectively defragments such files, whilst using XFS_BMAPI_CONVERT
> >>> just leaves all the fragments behind.
> >>
> >> That's true - and it fragments unfragmented files.  Is ZERO_RANGE
> >> supposed to be a defragmenter?
> > 
> > ...so please remember, the key point we were talking about when we
> > discussed this a year ago was that if the /entire/ zero range maps to a
> > single extent within eof then maybe we ought to just convert it to
> > unwritten.
> 
> I remember you mentioning that, but honestly it seems like
> overcomplication to say "ZERO_RANGE will also defragment extents
> in the process, if it can..."

Well we could just do what we usually do and not write anything down
anywhere so 2022 Eric can argue with 2022 Dave and 2022 me about WTF
zero range is supposed to do.

Really, zero range doesn't specify the effects on the physical mapping.
All it says is that subsequent reads will return zeroes; that holes
will be filled with preallocations; and that preferably it converts to
unwritten extents.

It's that last part where it seems weird that we'd go out of our way to
punch and reallocate for a simple corner case where we could just
convert.

> > Note also that for pmem there's a slightly different optimization --
> > if the entire range is mapped by written extents (not necessarily
> > contiguous, just no holes/cow/delalloc/unwritten bits) then we can use
> > blkdev_issue_zeroout to zero memory and clear hwpoison cheaply.
> > 
> >>>> (Note that this changes the rounding direction of the
> >>>> xfs_alloc_file_space range, because we only want to hit whole
> >>>> blocks within the range.)
> >>>>
> >>>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> >>>> ---
> >>>>
> >>>> <currently running fsx ad infinitum, so far so good>
> >>
> >> <still running, so far so good (4k blocks)>
> >>
> >>>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> >>>> index 0a96c4d1718e..eae202bfe134 100644
> >>>> --- a/fs/xfs/xfs_bmap_util.c
> >>>> +++ b/fs/xfs/xfs_bmap_util.c
> >>>> @@ -1164,23 +1164,25 @@ xfs_zero_file_space(
> >>>>  
> >>>>  	blksize = 1 << mp->m_sb.sb_blocklog;
> >>>>  
> >>>> +	error = xfs_flush_unmap_range(ip, offset, len);
> >>>> +	if (error)
> >>>> +		return error;
> >>>>  	/*
> >>>> -	 * Punch a hole and prealloc the range. We use hole punch rather than
> >>>> -	 * unwritten extent conversion for two reasons:
> >>>> -	 *
> >>>> -	 * 1.) Hole punch handles partial block zeroing for us.
> >>>> -	 *
> >>>> -	 * 2.) If prealloc returns ENOSPC, the file range is still zero-valued
> >>>> -	 * by virtue of the hole punch.
> >>>> +	 * Convert whole blocks in the range to unwritten, then call iomap
> >>>> +	 * via xfs_zero_range to zero the range.  iomap will skip holes and
> >>>> +	 * unwritten extents, and just zero the edges if needed.  If conversion
> >>>> +	 * fails, iomap will simply write zeros to the whole range.
> >>>> +	 * nb: always_cow doesn't support unwritten extents.
> >>>>  	 */
> >>>> -	error = xfs_free_file_space(ip, offset, len);
> >>>> -	if (error || xfs_is_always_cow_inode(ip))
> >>>> -		return error;
> >>>> +	if (!xfs_is_always_cow_inode(ip))
> >>>> +		xfs_alloc_file_space(ip, round_up(offset, blksize),
> >>>> +				     round_down(offset + len, blksize) -
> >>>> +				     round_up(offset, blksize),
> >>>> +				     XFS_BMAPI_PREALLOC|XFS_BMAPI_CONVERT);
> >>>
> >>> If this fails with, say, corruption we should abort with an error,
> >>> not ignore it. I think we can only safely ignore ENOSPC and maybe
> >>> EDQUOT here...
> >>
> >> Yes, I suppose so, though if this encounters corruption I'd guess
> >> xfs_zero_range probably would as well but that's just handwaving.
> > 
> > <nod>
> > 
> >>>> -	return xfs_alloc_file_space(ip, round_down(offset, blksize),
> >>>> -				     round_up(offset + len, blksize) -
> >>>> -				     round_down(offset, blksize),
> >>>> -				     XFS_BMAPI_PREALLOC);
> >>>> +	error = xfs_zero_range(ip, offset, len);
> >>>
> >>> What prevents xfs_zero_range() from changing the file size if
> >>> offset + len is beyond EOF and there are allocated extents (from
> >>> delalloc conversion) beyond EOF? (i.e. FALLOC_FL_KEEP_SIZE is set by
> >>> the caller).
> >>
> >> nothing, but AFAIK it does the same today... even w/o extents past
> >> EOF:
> >>
> >> $ xfs_io -f -c "truncate 0" -c "fzero 0 1m" testfile
> > 
> > fzero -k ?
> 
> $ xfs_io -f -c "truncate 0" -c "fzero -k 0 1m" testfile
> 
> $ ls -lh testfile
> -rw-------. 1 sandeen sandeen 0 Jun 24 22:02 testfile
> 
> with or without my patches.
> 
> (with or without it also seems to allocate 1M past EOF...)

ok cool.

--D

> -Eric
>

Dave Chinner June 25, 2019, 3:54 a.m. UTC | #6

On Mon, Jun 24, 2019 at 10:05:51PM -0500, Eric Sandeen wrote:
> On 6/24/19 10:00 PM, Darrick J. Wong wrote:
> > On Mon, Jun 24, 2019 at 09:52:03PM -0500, Eric Sandeen wrote:
> >> On 6/24/19 9:39 PM, Dave Chinner wrote:
> >>> On Mon, Jun 24, 2019 at 07:48:11PM -0500, Eric Sandeen wrote:
> >>>> Rather than completely removing and re-allocating a range
> >>>> during ZERO_RANGE fallocate calls, convert whole blocks in the
> >>>> range using xfs_alloc_file_space(XFS_BMAPI_PREALLOC|XFS_BMAPI_CONVERT)
> >>>> and then zero the edges with xfs_zero_range()
> >>>
> >>> That's what I originally used to implement ZERO_RANGE and that
> >>> had problems with zeroing the partial blocks either side and
> >>> unexpected inode size changes. See commit:
> >>>
> >>> 5d11fb4b9a1d xfs: rework zero range to prevent invalid i_size updates
> >>
> >> Yep I did see that.  It had a lot of hand-rolled partial block stuff
> >> that seems more complex than this, no?  That commit didn't indicate
> >> what the root cause of the failure actually was, AFAICT.
> >>
> >> (funny thought that I skimmed that commit just to see why we had
> >> what we have, but didn't really intentionally re-implement it...
> >> even though I guess I almost did...)
> > 
> > FWIW the complaint I had about the fragmentary behavior really only
> > applied to fun and games when one fallocated an ext4 image and then ran
> > mkfs.ext4 which uses zero range which fragmented the image...
> > 
> >>> I also remember discussion about zero range being inefficient on
> >>> sparse files and fragmented files - the current implementation
> >>> effectively defragments such files, whilst using XFS_BMAPI_CONVERT
> >>> just leaves all the fragments behind.
> >>
> >> That's true - and it fragments unfragmented files.  Is ZERO_RANGE
> >> supposed to be a defragmenter?
> > 
> > ...so please remember, the key point we were talking about when we
> > discussed this a year ago was that if the /entire/ zero range maps to a
> > single extent within eof then maybe we ought to just convert it to
> > unwritten.
> 
> I remember you mentioning that, but honestly it seems like
> overcomplication to say "ZERO_RANGE will also defragment extents
> in the process, if it can..."

Keep in mind that my original implementation of ZERO_RANGE was
for someone who explicitly requested zeroing of preallocated VM
image files without reallocating them. Hence the XFS_BMAPI_CONVERT
implementation. They'd been careful about initial allocation, and
they wanted to reuse image files for new VMs without perturbing
their initial careful preallocation patterns.

I think the punch+reallocate is a more generally useful behaviour
because people are far less careful about image file layout (e.g.
might be using extent size hints or discard within the VM) and so
we're more likely to see somewhat fragmented files for zeroing than
we are fully intact.

SO, yeah, I can see arguments for both cases, and situations where
one behaviour would be preferred over the other.

Random thought: KEEP_SIZE == I know what I'm doing, just convert in
place because the layout is as I want it. !KEEP_SIZE = punch and
preallocate because we are likely changing the file size anyway?

> >>> What prevents xfs_zero_range() from changing the file size if
> >>> offset + len is beyond EOF and there are allocated extents (from
> >>> delalloc conversion) beyond EOF? (i.e. FALLOC_FL_KEEP_SIZE is set by
> >>> the caller).
> >>
> >> nothing, but AFAIK it does the same today... even w/o extents past
> >> EOF:
> >>
> >> $ xfs_io -f -c "truncate 0" -c "fzero 0 1m" testfile
> > 
> > fzero -k ?
> 
> $ xfs_io -f -c "truncate 0" -c "fzero -k 0 1m" testfile
> 
> $ ls -lh testfile
> -rw-------. 1 sandeen sandeen 0 Jun 24 22:02 testfile
> 
> with or without my patches.

My concern was about files with pre-existing extents beyond EOF.
i.e. something like this (on a vanilla kernel):

$ xfs_io -f -c "truncate 0"  -c "pwrite 0 16m" -c "fsync" -c "bmap -vp" -c "fzero -k 0 32m" -c "bmap -vp" -c "stat" testfile
wrote 16777216/16777216 bytes at offset 0
16 MiB, 4096 ops; 0.0000 sec (700.556 MiB/sec and 179342.3530 ops/sec)
testfile:
 EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET            TOTAL FLAGS
   0: [0..65407]:      1145960728..1146026135 10 (47331608..47397015) 65408 000000
testfile:
 EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET            TOTAL FLAGS
   0: [0..65535]:      1146026136..1146091671 10 (47397016..47462551) 65536 010000
fd.path = "testfile"
fd.flags = non-sync,non-direct,read-write
stat.ino = 1342366656
stat.type = regular file
stat.size = 16777216
stat.blocks = 65536
fsxattr.xflags = 0x2 [-p--------------]
fsxattr.projid = 0
fsxattr.extsize = 0
fsxattr.cowextsize = 0
fsxattr.nextents = 1
fsxattr.naextents = 0
dioattr.mem = 0x200
dioattr.miniosz = 512
dioattr.maxiosz = 2147483136
$

So you can see it is 16MiB in size, but has 32MB of blocks allocated
to it, and it's a different 32MB of blocks that were allocated by
the delalloc because we punched and reallocated it as an unwritten
extent.

That's where I'm concerned - that range beyond EOF is no longer
punched away by this new code, an dit's not unwritten so the
zero_range is going to iterate and zero it, right?

Cheers,

Dave.

[2/2] xfs: convert extents in place for ZERO_RANGE

Commit Message

Comments

Patch