[v6,24/28] btrfs: enable relocation in HMZONED mode

Message ID	20191213040915.3502922-25-naohiro.aota@wdc.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=17CA=2D=vger.kernel.org=linux-btrfs-owner@kernel.org> IronPort-SDR: TFG+6NVR8kjFcdCKh8CuICgK/Ksi4UjKKuW9gedn2dTvP4wV7HathYmwRfM7L2RmyjUwj/xBVT AYpdyVsTQ18IGtNhxbOto58LwS6jXl91kVNnNVYIyxkHtqodWhzfqTEAlq5/Fu8fMtb3esHkTA mLcS1BM8EBoY5MAYxx/2G1YBvYoheA7ddOou8Ot7pj982WW2i0UV1DF3tFnM4xcmEsMIUBzdzR +YR72OyC7MRXwEhRTnVnxWQ/acKT1NT/c4bFKxFoDYooDjVidKDPp5pXclR74JvdmvNi2jDiWn XPg= IronPort-SDR: Vt706ZYGV/0Em+pav4zJiCxDKTD9PIdwYt3vUu3LoFpE6INc2H6nxSysfxNePglC5viNeGsPpa li1mKbifePoJvIad9sbIeX+Oqgx8xbJnV+c2U7BMjK4ICthL/qmu22orkEeQ9hrz+UUZ75gyOn t+OfGAMN+UV/05FQoF1XNtA5oyMSdbDaIzYjpEtn/51TLN+8Co4e5ppkXMNNsM1jl2GLptTVE4 hpqfcO80yLGu4AXO5U97bbiW7rflJD8UgJJ6rvYBYeBlUcM2GItMFMQD5ylqo7yF21k78T2JHd jIaDfarjQEVcUyduI1dCx6lC IronPort-SDR: xKp/Jg/HyGLx4nXCfMW4tEnQm8VnOtX/l/SQZeml8ASwLX9qqKknqWEIYsTsxTSWwYCRzIX/zt xHSz1FeGwLrMrSa3p0/xgDdJbbvFukPAgPYme6fJUAJ0iEvillGoa1oHRPh3QAysqaN+W8EwrK lHY32oZhLme4hDsKPVkrJhGftANUbVkjMfCc5X/c2JPh3CkJYXVVfAYRBHwuzas4JldEeNnAVU DAJsIXFnl4Tb7DGJv+em3EX/6904Ogrp1kupUkSIR8ZhVMX3d6th/j5PgeoJqyseB2pzCuz6IH sQ0= WDCIronportException: Internal From: Naohiro Aota <naohiro.aota@wdc.com> To: linux-btrfs@vger.kernel.org, David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, Nikolay Borisov <nborisov@suse.com>, Damien Le Moal <damien.lemoal@wdc.com>, Johannes Thumshirn <jthumshirn@suse.de>, Hannes Reinecke <hare@suse.com>, Anand Jain <anand.jain@oracle.com>, linux-fsdevel@vger.kernel.org, Naohiro Aota <naohiro.aota@wdc.com> Subject: [PATCH v6 24/28] btrfs: enable relocation in HMZONED mode Date: Fri, 13 Dec 2019 13:09:11 +0900 Message-Id: <20191213040915.3502922-25-naohiro.aota@wdc.com> In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	btrfs: zoned block device support \| expand [v6,00/28] btrfs: zoned block device support [v6,01/28] btrfs: introduce HMZONED feature flag [v6,02/28] btrfs: Get zone information of zoned block devices [v6,03/28] btrfs: Check and enable HMZONED mode [v6,04/28] btrfs: disallow RAID5/6 in HMZONED mode [v6,05/28] btrfs: disallow space_cache in HMZONED mode [v6,06/28] btrfs: disallow NODATACOW in HMZONED mode [v6,07/28] btrfs: disable fallocate in HMZONED mode [v6,08/28] btrfs: implement log-structured superblock for HMZONED mode [v6,09/28] btrfs: align device extent allocation to zone boundary [v6,10/28] btrfs: do sequential extent allocation in HMZONED mode [v6,11/28] btrfs: make unmirroed BGs readonly only if we have at least one writable BG [v6,12/28] btrfs: ensure metadata space available on/after degraded mount in HMZONED [v6,13/28] btrfs: reset zones of unused block groups [v6,14/28] btrfs: redirty released extent buffers in HMZONED mode [v6,15/28] btrfs: serialize data allocation and submit IOs [v6,16/28] btrfs: implement atomic compressed IO submission [v6,17/28] btrfs: support direct write IO in HMZONED [v6,18/28] btrfs: serialize meta IOs on HMZONED mode [v6,19/28] btrfs: wait existing extents before truncating [v6,20/28] btrfs: avoid async checksum on HMZONED mode [v6,21/28] btrfs: disallow mixed-bg in HMZONED mode [v6,22/28] btrfs: disallow inode_cache in HMZONED mode [v6,23/28] btrfs: support dev-replace in HMZONED mode [v6,24/28] btrfs: enable relocation in HMZONED mode [v6,25/28] btrfs: relocate block group to repair IO failure in HMZONED [v6,26/28] btrfs: split alloc_log_tree() [v6,27/28] btrfs: enable tree-log on HMZONED mode [v6,28/28] btrfs: enable to mount HMZONED incompat flag

Message ID

20191213040915.3502922-25-naohiro.aota@wdc.com (mailing list archive)

State

New, archived

Headers

IronPort-SDR: 
 TFG+6NVR8kjFcdCKh8CuICgK/Ksi4UjKKuW9gedn2dTvP4wV7HathYmwRfM7L2RmyjUwj/xBVT
 AYpdyVsTQ18IGtNhxbOto58LwS6jXl91kVNnNVYIyxkHtqodWhzfqTEAlq5/Fu8fMtb3esHkTA
 mLcS1BM8EBoY5MAYxx/2G1YBvYoheA7ddOou8Ot7pj982WW2i0UV1DF3tFnM4xcmEsMIUBzdzR
 +YR72OyC7MRXwEhRTnVnxWQ/acKT1NT/c4bFKxFoDYooDjVidKDPp5pXclR74JvdmvNi2jDiWn
 XPg=
IronPort-SDR: 
 Vt706ZYGV/0Em+pav4zJiCxDKTD9PIdwYt3vUu3LoFpE6INc2H6nxSysfxNePglC5viNeGsPpa
 li1mKbifePoJvIad9sbIeX+Oqgx8xbJnV+c2U7BMjK4ICthL/qmu22orkEeQ9hrz+UUZ75gyOn
 t+OfGAMN+UV/05FQoF1XNtA5oyMSdbDaIzYjpEtn/51TLN+8Co4e5ppkXMNNsM1jl2GLptTVE4
 hpqfcO80yLGu4AXO5U97bbiW7rflJD8UgJJ6rvYBYeBlUcM2GItMFMQD5ylqo7yF21k78T2JHd
 jIaDfarjQEVcUyduI1dCx6lC
IronPort-SDR: 
 xKp/Jg/HyGLx4nXCfMW4tEnQm8VnOtX/l/SQZeml8ASwLX9qqKknqWEIYsTsxTSWwYCRzIX/zt
 xHSz1FeGwLrMrSa3p0/xgDdJbbvFukPAgPYme6fJUAJ0iEvillGoa1oHRPh3QAysqaN+W8EwrK
 lHY32oZhLme4hDsKPVkrJhGftANUbVkjMfCc5X/c2JPh3CkJYXVVfAYRBHwuzas4JldEeNnAVU
 DAJsIXFnl4Tb7DGJv+em3EX/6904Ogrp1kupUkSIR8ZhVMX3d6th/j5PgeoJqyseB2pzCuz6IH
 sQ0=
WDCIronportException: Internal
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>,
        Nikolay Borisov <nborisov@suse.com>,
        Damien Le Moal <damien.lemoal@wdc.com>,
        Johannes Thumshirn <jthumshirn@suse.de>,
        Hannes Reinecke <hare@suse.com>,
        Anand Jain <anand.jain@oracle.com>,
        linux-fsdevel@vger.kernel.org, Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v6 24/28] btrfs: enable relocation in HMZONED mode
Date: Fri, 13 Dec 2019 13:09:11 +0900
Message-Id: <20191213040915.3502922-25-naohiro.aota@wdc.com>
In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com>
References: <20191213040915.3502922-1-naohiro.aota@wdc.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk

Series

btrfs: zoned block device support | expand

Commit Message

Naohiro Aota Dec. 13, 2019, 4:09 a.m. UTC

To serialize allocation and submit_bio, we introduced mutex around them. As
a result, preallocation must be completely disabled to avoid a deadlock.

Since current relocation process relies on preallocation to move file data
extents, it must be handled in another way. In HMZONED mode, we just
truncate the inode to the size that we wanted to pre-allocate. Then, we
flush dirty pages on the file before finishing relocation process.
run_delalloc_hmzoned() will handle all the allocation and submit IOs to
the underlying layers.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/relocation.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

Comments

Josef Bacik Dec. 17, 2019, 9:32 p.m. UTC | #1

On 12/12/19 11:09 PM, Naohiro Aota wrote:
> To serialize allocation and submit_bio, we introduced mutex around them. As
> a result, preallocation must be completely disabled to avoid a deadlock.
> 
> Since current relocation process relies on preallocation to move file data
> extents, it must be handled in another way. In HMZONED mode, we just
> truncate the inode to the size that we wanted to pre-allocate. Then, we
> flush dirty pages on the file before finishing relocation process.
> run_delalloc_hmzoned() will handle all the allocation and submit IOs to
> the underlying layers.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>   fs/btrfs/relocation.c | 39 +++++++++++++++++++++++++++++++++++++--
>   1 file changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index d897a8e5e430..2d17b7566df4 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -3159,6 +3159,34 @@ int prealloc_file_extent_cluster(struct inode *inode,
>   	if (ret)
>   		goto out;
>   
> +	/*
> +	 * In HMZONED, we cannot preallocate the file region. Instead,
> +	 * we dirty and fiemap_write the region.
> +	 */
> +
> +	if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED)) {
> +		struct btrfs_root *root = BTRFS_I(inode)->root;
> +		struct btrfs_trans_handle *trans;
> +
> +		end = cluster->end - offset + 1;
> +		trans = btrfs_start_transaction(root, 1);
> +		if (IS_ERR(trans))
> +			return PTR_ERR(trans);
> +
> +		inode->i_ctime = current_time(inode);
> +		i_size_write(inode, end);
> +		btrfs_ordered_update_i_size(inode, end, NULL);
> +		ret = btrfs_update_inode(trans, root, inode);
> +		if (ret) {
> +			btrfs_abort_transaction(trans, ret);
> +			btrfs_end_transaction(trans);
> +			return ret;
> +		}
> +		ret = btrfs_end_transaction(trans);
> +
> +		goto out;
> +	}
> +

Why are we arbitrarily extending the i_size here?  If we don't need prealloc we 
don't need to jack up the i_size either.

>   	cur_offset = prealloc_start;
>   	while (nr < cluster->nr) {
>   		start = cluster->boundary[nr] - offset;
> @@ -3346,6 +3374,10 @@ static int relocate_file_extent_cluster(struct inode *inode,
>   		btrfs_throttle(fs_info);
>   	}
>   	WARN_ON(nr != cluster->nr);
> +	if (btrfs_fs_incompat(fs_info, HMZONED) && !ret) {
> +		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
> +		WARN_ON(ret);

Do not WAR_ON() when this could happen due to IO errors.  Thanks,

Josef

Naohiro Aota Dec. 18, 2019, 10:49 a.m. UTC | #2

On Tue, Dec 17, 2019 at 04:32:04PM -0500, Josef Bacik wrote:
>On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>To serialize allocation and submit_bio, we introduced mutex around them. As
>>a result, preallocation must be completely disabled to avoid a deadlock.
>>
>>Since current relocation process relies on preallocation to move file data
>>extents, it must be handled in another way. In HMZONED mode, we just
>>truncate the inode to the size that we wanted to pre-allocate. Then, we
>>flush dirty pages on the file before finishing relocation process.
>>run_delalloc_hmzoned() will handle all the allocation and submit IOs to
>>the underlying layers.
>>
>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>---
>>  fs/btrfs/relocation.c | 39 +++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 37 insertions(+), 2 deletions(-)
>>
>>diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>>index d897a8e5e430..2d17b7566df4 100644
>>--- a/fs/btrfs/relocation.c
>>+++ b/fs/btrfs/relocation.c
>>@@ -3159,6 +3159,34 @@ int prealloc_file_extent_cluster(struct inode *inode,
>>  	if (ret)
>>  		goto out;
>>+	/*
>>+	 * In HMZONED, we cannot preallocate the file region. Instead,
>>+	 * we dirty and fiemap_write the region.
>>+	 */
>>+
>>+	if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED)) {
>>+		struct btrfs_root *root = BTRFS_I(inode)->root;
>>+		struct btrfs_trans_handle *trans;
>>+
>>+		end = cluster->end - offset + 1;
>>+		trans = btrfs_start_transaction(root, 1);
>>+		if (IS_ERR(trans))
>>+			return PTR_ERR(trans);
>>+
>>+		inode->i_ctime = current_time(inode);
>>+		i_size_write(inode, end);
>>+		btrfs_ordered_update_i_size(inode, end, NULL);
>>+		ret = btrfs_update_inode(trans, root, inode);
>>+		if (ret) {
>>+			btrfs_abort_transaction(trans, ret);
>>+			btrfs_end_transaction(trans);
>>+			return ret;
>>+		}
>>+		ret = btrfs_end_transaction(trans);
>>+
>>+		goto out;
>>+	}
>>+
>
>Why are we arbitrarily extending the i_size here?  If we don't need 
>prealloc we don't need to jack up the i_size either.

We need to extend i_size to read data from the relocating block
group. If not, btrfs_readpage() in relocate_file_extent_cluster()
always reads zero filled page because the read position is beyond the
file size.

>>  	cur_offset = prealloc_start;
>>  	while (nr < cluster->nr) {
>>  		start = cluster->boundary[nr] - offset;
>>@@ -3346,6 +3374,10 @@ static int relocate_file_extent_cluster(struct inode *inode,
>>  		btrfs_throttle(fs_info);
>>  	}
>>  	WARN_ON(nr != cluster->nr);
>>+	if (btrfs_fs_incompat(fs_info, HMZONED) && !ret) {
>>+		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
>>+		WARN_ON(ret);
>
>Do not WAR_ON() when this could happen due to IO errors.  Thanks,
>
>Josef

Sure. We can just drop it.

Josef Bacik Dec. 18, 2019, 3:01 p.m. UTC | #3

On 12/18/19 5:49 AM, Naohiro Aota wrote:
> On Tue, Dec 17, 2019 at 04:32:04PM -0500, Josef Bacik wrote:
>> On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>> To serialize allocation and submit_bio, we introduced mutex around them. As
>>> a result, preallocation must be completely disabled to avoid a deadlock.
>>>
>>> Since current relocation process relies on preallocation to move file data
>>> extents, it must be handled in another way. In HMZONED mode, we just
>>> truncate the inode to the size that we wanted to pre-allocate. Then, we
>>> flush dirty pages on the file before finishing relocation process.
>>> run_delalloc_hmzoned() will handle all the allocation and submit IOs to
>>> the underlying layers.
>>>
>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>> ---
>>>  fs/btrfs/relocation.c | 39 +++++++++++++++++++++++++++++++++++++--
>>>  1 file changed, 37 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>>> index d897a8e5e430..2d17b7566df4 100644
>>> --- a/fs/btrfs/relocation.c
>>> +++ b/fs/btrfs/relocation.c
>>> @@ -3159,6 +3159,34 @@ int prealloc_file_extent_cluster(struct inode *inode,
>>>      if (ret)
>>>          goto out;
>>> +    /*
>>> +     * In HMZONED, we cannot preallocate the file region. Instead,
>>> +     * we dirty and fiemap_write the region.
>>> +     */
>>> +
>>> +    if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED)) {
>>> +        struct btrfs_root *root = BTRFS_I(inode)->root;
>>> +        struct btrfs_trans_handle *trans;
>>> +
>>> +        end = cluster->end - offset + 1;
>>> +        trans = btrfs_start_transaction(root, 1);
>>> +        if (IS_ERR(trans))
>>> +            return PTR_ERR(trans);
>>> +
>>> +        inode->i_ctime = current_time(inode);
>>> +        i_size_write(inode, end);
>>> +        btrfs_ordered_update_i_size(inode, end, NULL);
>>> +        ret = btrfs_update_inode(trans, root, inode);
>>> +        if (ret) {
>>> +            btrfs_abort_transaction(trans, ret);
>>> +            btrfs_end_transaction(trans);
>>> +            return ret;
>>> +        }
>>> +        ret = btrfs_end_transaction(trans);
>>> +
>>> +        goto out;
>>> +    }
>>> +
>>
>> Why are we arbitrarily extending the i_size here?  If we don't need prealloc 
>> we don't need to jack up the i_size either.
> 
> We need to extend i_size to read data from the relocating block
> group. If not, btrfs_readpage() in relocate_file_extent_cluster()
> always reads zero filled page because the read position is beyond the
> file size.

Right but the finish_ordered_io stuff will do the btrfs_ordered_update_i_size() 
once the IO is complete.  So all you really need is the i_size_write and the 
btrfs_update_inode.  If this crashes you'll have an inode that has a i_size with 
no extents up to i_size.  This is fine for NO_HOLES but not fine for !NO_HOLES. 
Thanks,

Josef

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index d897a8e5e430..2d17b7566df4 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3159,6 +3159,34 @@  int prealloc_file_extent_cluster(struct inode *inode,
 	if (ret)
 		goto out;
 
+	/*
+	 * In HMZONED, we cannot preallocate the file region. Instead,
+	 * we dirty and fiemap_write the region.
+	 */
+
+	if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED)) {
+		struct btrfs_root *root = BTRFS_I(inode)->root;
+		struct btrfs_trans_handle *trans;
+
+		end = cluster->end - offset + 1;
+		trans = btrfs_start_transaction(root, 1);
+		if (IS_ERR(trans))
+			return PTR_ERR(trans);
+
+		inode->i_ctime = current_time(inode);
+		i_size_write(inode, end);
+		btrfs_ordered_update_i_size(inode, end, NULL);
+		ret = btrfs_update_inode(trans, root, inode);
+		if (ret) {
+			btrfs_abort_transaction(trans, ret);
+			btrfs_end_transaction(trans);
+			return ret;
+		}
+		ret = btrfs_end_transaction(trans);
+
+		goto out;
+	}
+
 	cur_offset = prealloc_start;
 	while (nr < cluster->nr) {
 		start = cluster->boundary[nr] - offset;
@@ -3346,6 +3374,10 @@  static int relocate_file_extent_cluster(struct inode *inode,
 		btrfs_throttle(fs_info);
 	}
 	WARN_ON(nr != cluster->nr);
+	if (btrfs_fs_incompat(fs_info, HMZONED) && !ret) {
+		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
+		WARN_ON(ret);
+	}
 out:
 	kfree(ra);
 	return ret;
@@ -4186,8 +4218,12 @@  static int __insert_orphan_inode(struct btrfs_trans_handle *trans,
 	struct btrfs_path *path;
 	struct btrfs_inode_item *item;
 	struct extent_buffer *leaf;
+	u64 flags = BTRFS_INODE_NOCOMPRESS | BTRFS_INODE_PREALLOC;
 	int ret;
 
+	if (btrfs_fs_incompat(trans->fs_info, HMZONED))
+		flags &= ~BTRFS_INODE_PREALLOC;
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -4202,8 +4238,7 @@  static int __insert_orphan_inode(struct btrfs_trans_handle *trans,
 	btrfs_set_inode_generation(leaf, item, 1);
 	btrfs_set_inode_size(leaf, item, 0);
 	btrfs_set_inode_mode(leaf, item, S_IFREG | 0600);
-	btrfs_set_inode_flags(leaf, item, BTRFS_INODE_NOCOMPRESS |
-					  BTRFS_INODE_PREALLOC);
+	btrfs_set_inode_flags(leaf, item, flags);
 	btrfs_mark_buffer_dirty(leaf);
 out:
 	btrfs_free_path(path);

[v6,24/28] btrfs: enable relocation in HMZONED mode

Commit Message

Comments

Patch