diff mbox series

btrfs: zoned: do not limit delalloc size to fs_info->max_extent_size

Message ID a2f4a2162fdc3457830fa997c70ffa7c231759ad.1686582572.git.naohiro.aota@wdc.com (mailing list archive)
State New, archived
Headers show
Series btrfs: zoned: do not limit delalloc size to fs_info->max_extent_size | expand

Commit Message

Naohiro Aota June 12, 2023, 3:10 p.m. UTC
There is an issue writing out huge buffered data with the compressed mount
option on zoned mode. For example, when you buffered write 16GB data and
call "sync" command on "mount -o compress" file-system on a zoned device,
it takes more than 3 minutes to finish the sync, invoking the hung_task
timeout.

Before the patch:

    + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo
    wrote 17179869184/17179869184 bytes at offset 0
    16.000 GiB, 2048 ops; 0:00:23.74 (690.013 MiB/sec and 86.2516 ops/sec)

    real    0m23.770s
    user    0m0.005s
    sys     0m23.599s
    + sync

    real    3m55.921s
    user    0m0.000s
    sys     0m0.134s

After the patch:

    + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo
    wrote 17179869184/17179869184 bytes at offset 0
    16.000 GiB, 2048 ops; 0:00:28.11 (582.648 MiB/sec and 72.8311 ops/sec)

    real    0m28.146s
    user    0m0.004s
    sys     0m27.951s
    + sync

    real    0m12.310s
    user    0m0.000s
    sys     0m0.048s

This slow "sync" comes from inefficient async chunk building due to
needlessly limited delalloc size.

find_lock_delalloc_range() looks for pages for the delayed allocation at
most fs_info->max_extent_size in its size. For the non-compress write case,
that range directly corresponds to num_bytes at cow_file_range() (= size of
allocation). So, limiting the size to the max_extent_size seems no harm as
we will split the extent even if we can have a larger allocation.

However, things are different with the compression case. The
run_delalloc_compressed() divides the delalloc range into 512 KB sized
chunks and queues a worker for each chunk. The device of the above test
case has 672 KB zone_append_max_bytes, which is equal to
fs_info->max_extent_size. Thus, one run_delalloc_compressed() call can
build only two chunks at most, which is really smaller than
BTRFS_MAX_EXTENT_SIZE / 512KB = 256, making it inefficient.

Also, as 672 KB is not aligned to 512 KB, it is creating ceil(16G / 672K) *
2 = 49934 async chunks for the above case. OTOH, with BTRFS_MAX_EXTENT_SIZE
delalloced, we will create 32768 chunks, which is 34% reduced.

This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does
not directly corresponds to the size of one extent. Instead, this patch
will limit the allocation size at cow_file_range() for the zoned mode.

As shown above, with this patch applied, the "sync" time is reduced from
almost 4 minutes to 12 seconds.

Fixes: f7b12a62f008 ("btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size")
CC: stable@vger.kernel.org # 6.1+
---
 fs/btrfs/extent-tree.c |  3 +++
 fs/btrfs/extent_io.c   | 11 ++++++++---
 fs/btrfs/inode.c       |  5 ++++-
 3 files changed, 15 insertions(+), 4 deletions(-)

Comments

David Sterba June 12, 2023, 10:22 p.m. UTC | #1
On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote:
> There is an issue writing out huge buffered data with the compressed mount
> option on zoned mode. For example, when you buffered write 16GB data and
> call "sync" command on "mount -o compress" file-system on a zoned device,
> it takes more than 3 minutes to finish the sync, invoking the hung_task
> timeout.
> 
> Before the patch:
> 
>     + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo
>     wrote 17179869184/17179869184 bytes at offset 0
>     16.000 GiB, 2048 ops; 0:00:23.74 (690.013 MiB/sec and 86.2516 ops/sec)
> 
>     real    0m23.770s
>     user    0m0.005s
>     sys     0m23.599s
>     + sync
> 
>     real    3m55.921s
>     user    0m0.000s
>     sys     0m0.134s
> 
> After the patch:
> 
>     + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo
>     wrote 17179869184/17179869184 bytes at offset 0
>     16.000 GiB, 2048 ops; 0:00:28.11 (582.648 MiB/sec and 72.8311 ops/sec)
> 
>     real    0m28.146s
>     user    0m0.004s
>     sys     0m27.951s
>     + sync
> 
>     real    0m12.310s
>     user    0m0.000s
>     sys     0m0.048s
> 
> This slow "sync" comes from inefficient async chunk building due to
> needlessly limited delalloc size.
> 
> find_lock_delalloc_range() looks for pages for the delayed allocation at
> most fs_info->max_extent_size in its size. For the non-compress write case,
> that range directly corresponds to num_bytes at cow_file_range() (= size of
> allocation). So, limiting the size to the max_extent_size seems no harm as
> we will split the extent even if we can have a larger allocation.
> 
> However, things are different with the compression case. The
> run_delalloc_compressed() divides the delalloc range into 512 KB sized
> chunks and queues a worker for each chunk. The device of the above test
> case has 672 KB zone_append_max_bytes, which is equal to
> fs_info->max_extent_size. Thus, one run_delalloc_compressed() call can
> build only two chunks at most, which is really smaller than
> BTRFS_MAX_EXTENT_SIZE / 512KB = 256, making it inefficient.
> 
> Also, as 672 KB is not aligned to 512 KB, it is creating ceil(16G / 672K) *
> 2 = 49934 async chunks for the above case. OTOH, with BTRFS_MAX_EXTENT_SIZE
> delalloced, we will create 32768 chunks, which is 34% reduced.
> 
> This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does
> not directly corresponds to the size of one extent. Instead, this patch
> will limit the allocation size at cow_file_range() for the zoned mode.
> 
> As shown above, with this patch applied, the "sync" time is reduced from
> almost 4 minutes to 12 seconds.
> 
> Fixes: f7b12a62f008 ("btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size")
> CC: stable@vger.kernel.org # 6.1+

Missing S-O-B, I've used the one with @wdc.com address. Added to
misc-next, thanks.
Naohiro Aota June 13, 2023, 3:35 a.m. UTC | #2
On Tue, Jun 13, 2023 at 12:22:49AM +0200, David Sterba wrote:
> On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote:
> > There is an issue writing out huge buffered data with the compressed mount
> > option on zoned mode. For example, when you buffered write 16GB data and
> > call "sync" command on "mount -o compress" file-system on a zoned device,
> > it takes more than 3 minutes to finish the sync, invoking the hung_task
> > timeout.
> > 
> > Before the patch:
> > 
> >     + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo
> >     wrote 17179869184/17179869184 bytes at offset 0
> >     16.000 GiB, 2048 ops; 0:00:23.74 (690.013 MiB/sec and 86.2516 ops/sec)
> > 
> >     real    0m23.770s
> >     user    0m0.005s
> >     sys     0m23.599s
> >     + sync
> > 
> >     real    3m55.921s
> >     user    0m0.000s
> >     sys     0m0.134s
> > 
> > After the patch:
> > 
> >     + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo
> >     wrote 17179869184/17179869184 bytes at offset 0
> >     16.000 GiB, 2048 ops; 0:00:28.11 (582.648 MiB/sec and 72.8311 ops/sec)
> > 
> >     real    0m28.146s
> >     user    0m0.004s
> >     sys     0m27.951s
> >     + sync
> > 
> >     real    0m12.310s
> >     user    0m0.000s
> >     sys     0m0.048s
> > 
> > This slow "sync" comes from inefficient async chunk building due to
> > needlessly limited delalloc size.
> > 
> > find_lock_delalloc_range() looks for pages for the delayed allocation at
> > most fs_info->max_extent_size in its size. For the non-compress write case,
> > that range directly corresponds to num_bytes at cow_file_range() (= size of
> > allocation). So, limiting the size to the max_extent_size seems no harm as
> > we will split the extent even if we can have a larger allocation.
> > 
> > However, things are different with the compression case. The
> > run_delalloc_compressed() divides the delalloc range into 512 KB sized
> > chunks and queues a worker for each chunk. The device of the above test
> > case has 672 KB zone_append_max_bytes, which is equal to
> > fs_info->max_extent_size. Thus, one run_delalloc_compressed() call can
> > build only two chunks at most, which is really smaller than
> > BTRFS_MAX_EXTENT_SIZE / 512KB = 256, making it inefficient.
> > 
> > Also, as 672 KB is not aligned to 512 KB, it is creating ceil(16G / 672K) *
> > 2 = 49934 async chunks for the above case. OTOH, with BTRFS_MAX_EXTENT_SIZE
> > delalloced, we will create 32768 chunks, which is 34% reduced.
> > 
> > This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does
> > not directly corresponds to the size of one extent. Instead, this patch
> > will limit the allocation size at cow_file_range() for the zoned mode.
> > 
> > As shown above, with this patch applied, the "sync" time is reduced from
> > almost 4 minutes to 12 seconds.
> > 
> > Fixes: f7b12a62f008 ("btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size")
> > CC: stable@vger.kernel.org # 6.1+
> 
> Missing S-O-B, I've used the one with @wdc.com address. Added to
> misc-next, thanks.

Ah, I forgot it. Thank you.

Just in case:

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Christoph Hellwig June 13, 2023, 5:17 a.m. UTC | #3
On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote:
> This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does
> not directly corresponds to the size of one extent. Instead, this patch
> will limit the allocation size at cow_file_range() for the zoned mode.

Maybe I'm missing something, but that limitation also seems wrong or at
least suboptimal.  There is absolutely no problem in creating a large
allocation in cow_file_range.  btrfs_submit_bio will split it into max
appens size chunks for I/O, and depending on if they got reordered or
not we might even be able to record the entire big allocation as a
single extent on disk.
Naohiro Aota June 13, 2023, 5:53 a.m. UTC | #4
On Mon, Jun 12, 2023 at 10:17:23PM -0700, Christoph Hellwig wrote:
> On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote:
> > This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does
> > not directly corresponds to the size of one extent. Instead, this patch
> > will limit the allocation size at cow_file_range() for the zoned mode.
> 
> Maybe I'm missing something, but that limitation also seems wrong or at
> least suboptimal.  There is absolutely no problem in creating a large
> allocation in cow_file_range.  btrfs_submit_bio will split it into max
> appens size chunks for I/O, and depending on if they got reordered or
> not we might even be able to record the entire big allocation as a
> single extent on disk.
> 

The issue corresponds to per-inode metadata reservation pool. For each
outstanding extent, it reserves 16 * node_size to insert the extent item
considering the worst case.

If we allocate one large extent, it releases the unnecessary bytes from the
pool as it thinks it will only do only one insertion. Then, that extent is
split again, and it inserts several extents. For that insertion, btrfs
consumes the reserved bytes from the per-inode pool, which is now ready
only for one extent. So, with a big filesystem and a large extent write
out, we can exhaust that pool and hit a WARN.

And, re-charging the pool on split time is impossible, I think... But,
things might change as we moved the split time.

Please check the original commit f7b12a62f008 ("btrfs: replace
BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size") and
btrfs_calculate_inode_block_rsv_file() for detail.
Naohiro Aota June 14, 2023, 1:59 a.m. UTC | #5
On Tue, Jun 13, 2023 at 02:53:04PM +0900, Naohiro Aota wrote:
> On Mon, Jun 12, 2023 at 10:17:23PM -0700, Christoph Hellwig wrote:
> > On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote:
> > > This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does
> > > not directly corresponds to the size of one extent. Instead, this patch
> > > will limit the allocation size at cow_file_range() for the zoned mode.
> > 
> > Maybe I'm missing something, but that limitation also seems wrong or at
> > least suboptimal.  There is absolutely no problem in creating a large
> > allocation in cow_file_range.  btrfs_submit_bio will split it into max
> > appens size chunks for I/O, and depending on if they got reordered or
> > not we might even be able to record the entire big allocation as a
> > single extent on disk.
> > 
> 
> The issue corresponds to per-inode metadata reservation pool. For each
> outstanding extent, it reserves 16 * node_size to insert the extent item
> considering the worst case.
> 
> If we allocate one large extent, it releases the unnecessary bytes from the
> pool as it thinks it will only do only one insertion. Then, that extent is
> split again, and it inserts several extents. For that insertion, btrfs
> consumes the reserved bytes from the per-inode pool, which is now ready
> only for one extent. So, with a big filesystem and a large extent write
> out, we can exhaust that pool and hit a WARN.
> 
> And, re-charging the pool on split time is impossible, I think... But,
> things might change as we moved the split time.

I'm considering this again. We need to take care of the reservation pool to
ensure metadata insertion will succeed.

But, if we can keep the reservation pool for N (= delalloc size /
fs_info->max_extent_size) extents even the allocation is done for single
extent, the reservation should be OK and we can allocate one large extent.

So, I'm testing the patch below on top of current misc-next.

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e6944c4db18b..911908ea5f6f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4452,9 +4452,6 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 	bool for_treelog = (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID);
 	bool for_data_reloc = (btrfs_is_data_reloc_root(root) && is_data);
 
-	if (btrfs_is_zoned(fs_info))
-		ASSERT(num_bytes <= fs_info->max_extent_size);
-
 	flags = get_alloc_profile_by_root(root, is_data);
 again:
 	WARN_ON(num_bytes < fs_info->sectorsize);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5f98bc8d9196..b92c814cec97 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1494,10 +1494,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	while (num_bytes > 0) {
 		struct btrfs_ordered_extent *ordered;
 
-		if (btrfs_is_zoned(fs_info))
-			cur_alloc_size = min_t(u64, num_bytes, fs_info->max_extent_size);
-		else
-			cur_alloc_size = num_bytes;
+		cur_alloc_size = num_bytes;
 		ret = btrfs_reserve_extent(root, cur_alloc_size, cur_alloc_size,
 					   min_alloc_size, 0, alloc_hint,
 					   &ins, 1, 1);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a629532283bc..2d12547e4d64 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -151,7 +151,9 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
 			u64 ram_bytes, u64 disk_bytenr, u64 disk_num_bytes,
 			u64 offset, unsigned long flags, int compress_type)
 {
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_ordered_extent *entry;
+	int nr_extents = 1;
 	int ret;
 
 	if (flags &
@@ -193,13 +195,21 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
 	INIT_LIST_HEAD(&entry->work_list);
 	init_completion(&entry->completion);
 
+	/*
+	 * We may split the extent due to a device's max_zone_append_size.
+	 * Keep the reservation for the worst case, even we are allocating
+	 * only one ordered extent.
+	 */
+	if (btrfs_is_zoned(fs_info))
+		nr_extents = count_max_extents(fs_info, num_bytes);
+
 	/*
 	 * We don't need the count_max_extents here, we can assume that all of
 	 * that work has been done at higher layers, so this is truly the
 	 * smallest the extent is going to get.
 	 */
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, 1);
+	btrfs_mod_outstanding_extents(inode, nr_extents);
 	spin_unlock(&inode->lock);
 
 	return entry;
@@ -580,6 +590,7 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 	struct rb_node *node;
 	bool pending;
 	bool freespace_inode;
+	int nr_extents = 1;
 
 	/*
 	 * If this is a free space inode the thread has not acquired the ordered
@@ -588,9 +599,12 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 	freespace_inode = btrfs_is_free_space_inode(btrfs_inode);
 
 	btrfs_lockdep_acquire(fs_info, btrfs_trans_pending_ordered);
+
 	/* This is paired with btrfs_alloc_ordered_extent. */
+	if (btrfs_is_zoned(fs_info))
+		nr_extents = count_max_extents(fs_info, entry->num_bytes);
 	spin_lock(&btrfs_inode->lock);
-	btrfs_mod_outstanding_extents(btrfs_inode, -1);
+	btrfs_mod_outstanding_extents(btrfs_inode, -nr_extents);
 	spin_unlock(&btrfs_inode->lock);
 	if (root != fs_info->tree_root) {
 		u64 release;
Naohiro Aota June 14, 2023, 2:05 a.m. UTC | #6
On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote:
> There is an issue writing out huge buffered data with the compressed mount
> option on zoned mode. For example, when you buffered write 16GB data and
> call "sync" command on "mount -o compress" file-system on a zoned device,
> it takes more than 3 minutes to finish the sync, invoking the hung_task
> timeout.
> 
> Before the patch:
> 
>     + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo
>     wrote 17179869184/17179869184 bytes at offset 0
>     16.000 GiB, 2048 ops; 0:00:23.74 (690.013 MiB/sec and 86.2516 ops/sec)
> 
>     real    0m23.770s
>     user    0m0.005s
>     sys     0m23.599s
>     + sync
> 
>     real    3m55.921s
>     user    0m0.000s
>     sys     0m0.134s
> 
> After the patch:
> 
>     + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo
>     wrote 17179869184/17179869184 bytes at offset 0
>     16.000 GiB, 2048 ops; 0:00:28.11 (582.648 MiB/sec and 72.8311 ops/sec)
> 
>     real    0m28.146s
>     user    0m0.004s
>     sys     0m27.951s
>     + sync
> 
>     real    0m12.310s
>     user    0m0.000s
>     sys     0m0.048s
> 
> This slow "sync" comes from inefficient async chunk building due to
> needlessly limited delalloc size.
> 
> find_lock_delalloc_range() looks for pages for the delayed allocation at
> most fs_info->max_extent_size in its size. For the non-compress write case,
> that range directly corresponds to num_bytes at cow_file_range() (= size of
> allocation). So, limiting the size to the max_extent_size seems no harm as
> we will split the extent even if we can have a larger allocation.
> 
> However, things are different with the compression case. The
> run_delalloc_compressed() divides the delalloc range into 512 KB sized
> chunks and queues a worker for each chunk. The device of the above test
> case has 672 KB zone_append_max_bytes, which is equal to
> fs_info->max_extent_size. Thus, one run_delalloc_compressed() call can
> build only two chunks at most, which is really smaller than
> BTRFS_MAX_EXTENT_SIZE / 512KB = 256, making it inefficient.
> 
> Also, as 672 KB is not aligned to 512 KB, it is creating ceil(16G / 672K) *
> 2 = 49934 async chunks for the above case. OTOH, with BTRFS_MAX_EXTENT_SIZE
> delalloced, we will create 32768 chunks, which is 34% reduced.
> 
> This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does
> not directly corresponds to the size of one extent. Instead, this patch
> will limit the allocation size at cow_file_range() for the zoned mode.
> 
> As shown above, with this patch applied, the "sync" time is reduced from
> almost 4 minutes to 12 seconds.
> 
> Fixes: f7b12a62f008 ("btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size")
> CC: stable@vger.kernel.org # 6.1+
> ---
>  fs/btrfs/extent-tree.c |  3 +++
>  fs/btrfs/extent_io.c   | 11 ++++++++---
>  fs/btrfs/inode.c       |  5 ++++-
>  3 files changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 911908ea5f6f..e6944c4db18b 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4452,6 +4452,9 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
>  	bool for_treelog = (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID);
>  	bool for_data_reloc = (btrfs_is_data_reloc_root(root) && is_data);
>  
> +	if (btrfs_is_zoned(fs_info))
> +		ASSERT(num_bytes <= fs_info->max_extent_size);
> +

BTW, this ASSERT is problematic when it is called from the direct IO
context. So, we need to rework this patch anyway. Sorry for a confusion.
Christoph Hellwig June 14, 2023, 5:44 a.m. UTC | #7
On Wed, Jun 14, 2023 at 01:59:35AM +0000, Naohiro Aota wrote:
> On Tue, Jun 13, 2023 at 02:53:04PM +0900, Naohiro Aota wrote:
> > On Mon, Jun 12, 2023 at 10:17:23PM -0700, Christoph Hellwig wrote:
> > > On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote:
> > > > This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does
> > > > not directly corresponds to the size of one extent. Instead, this patch
> > > > will limit the allocation size at cow_file_range() for the zoned mode.
> > > 
> > > Maybe I'm missing something, but that limitation also seems wrong or at
> > > least suboptimal.  There is absolutely no problem in creating a large
> > > allocation in cow_file_range.  btrfs_submit_bio will split it into max
> > > appens size chunks for I/O, and depending on if they got reordered or
> > > not we might even be able to record the entire big allocation as a
> > > single extent on disk.
> > > 
> > 
> > The issue corresponds to per-inode metadata reservation pool. For each
> > outstanding extent, it reserves 16 * node_size to insert the extent item
> > considering the worst case.
> > 
> > If we allocate one large extent, it releases the unnecessary bytes from the
> > pool as it thinks it will only do only one insertion. Then, that extent is
> > split again, and it inserts several extents. For that insertion, btrfs
> > consumes the reserved bytes from the per-inode pool, which is now ready
> > only for one extent. So, with a big filesystem and a large extent write
> > out, we can exhaust that pool and hit a WARN.
> > 
> > And, re-charging the pool on split time is impossible, I think... But,
> > things might change as we moved the split time.
> 
> I'm considering this again. We need to take care of the reservation pool to
> ensure metadata insertion will succeed.
> 
> But, if we can keep the reservation pool for N (= delalloc size /
> fs_info->max_extent_size) extents even the allocation is done for single
> extent, the reservation should be OK and we can allocate one large extent.
> 
> So, I'm testing the patch below on top of current misc-next.

I like the idea, but we need to be very careful to not run into rounding
errors for the nr reserved extents calculation.  Maybe we should store
the number of reserved extents in the ordered_extent, and then steal
one for each split and justreturn the ones accounted for the
ordered_extent when removing it?
David Sterba June 14, 2023, 10:08 p.m. UTC | #8
On Wed, Jun 14, 2023 at 02:05:48AM +0000, Naohiro Aota wrote:
> On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote:
> > There is an issue writing out huge buffered data with the compressed mount
> > option on zoned mode. For example, when you buffered write 16GB data and
> > call "sync" command on "mount -o compress" file-system on a zoned device,
> > it takes more than 3 minutes to finish the sync, invoking the hung_task
> > timeout.
> > 
> > Before the patch:
> > 
> >     + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo
> >     wrote 17179869184/17179869184 bytes at offset 0
> >     16.000 GiB, 2048 ops; 0:00:23.74 (690.013 MiB/sec and 86.2516 ops/sec)
> > 
> >     real    0m23.770s
> >     user    0m0.005s
> >     sys     0m23.599s
> >     + sync
> > 
> >     real    3m55.921s
> >     user    0m0.000s
> >     sys     0m0.134s
> > 
> > After the patch:
> > 
> >     + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo
> >     wrote 17179869184/17179869184 bytes at offset 0
> >     16.000 GiB, 2048 ops; 0:00:28.11 (582.648 MiB/sec and 72.8311 ops/sec)
> > 
> >     real    0m28.146s
> >     user    0m0.004s
> >     sys     0m27.951s
> >     + sync
> > 
> >     real    0m12.310s
> >     user    0m0.000s
> >     sys     0m0.048s
> > 
> > This slow "sync" comes from inefficient async chunk building due to
> > needlessly limited delalloc size.
> > 
> > find_lock_delalloc_range() looks for pages for the delayed allocation at
> > most fs_info->max_extent_size in its size. For the non-compress write case,
> > that range directly corresponds to num_bytes at cow_file_range() (= size of
> > allocation). So, limiting the size to the max_extent_size seems no harm as
> > we will split the extent even if we can have a larger allocation.
> > 
> > However, things are different with the compression case. The
> > run_delalloc_compressed() divides the delalloc range into 512 KB sized
> > chunks and queues a worker for each chunk. The device of the above test
> > case has 672 KB zone_append_max_bytes, which is equal to
> > fs_info->max_extent_size. Thus, one run_delalloc_compressed() call can
> > build only two chunks at most, which is really smaller than
> > BTRFS_MAX_EXTENT_SIZE / 512KB = 256, making it inefficient.
> > 
> > Also, as 672 KB is not aligned to 512 KB, it is creating ceil(16G / 672K) *
> > 2 = 49934 async chunks for the above case. OTOH, with BTRFS_MAX_EXTENT_SIZE
> > delalloced, we will create 32768 chunks, which is 34% reduced.
> > 
> > This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does
> > not directly corresponds to the size of one extent. Instead, this patch
> > will limit the allocation size at cow_file_range() for the zoned mode.
> > 
> > As shown above, with this patch applied, the "sync" time is reduced from
> > almost 4 minutes to 12 seconds.
> > 
> > Fixes: f7b12a62f008 ("btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size")
> > CC: stable@vger.kernel.org # 6.1+
> > ---
> >  fs/btrfs/extent-tree.c |  3 +++
> >  fs/btrfs/extent_io.c   | 11 ++++++++---
> >  fs/btrfs/inode.c       |  5 ++++-
> >  3 files changed, 15 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 911908ea5f6f..e6944c4db18b 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -4452,6 +4452,9 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
> >  	bool for_treelog = (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID);
> >  	bool for_data_reloc = (btrfs_is_data_reloc_root(root) && is_data);
> >  
> > +	if (btrfs_is_zoned(fs_info))
> > +		ASSERT(num_bytes <= fs_info->max_extent_size);
> > +
> 
> BTW, this ASSERT is problematic when it is called from the direct IO
> context. So, we need to rework this patch anyway. Sorry for a confusion.

Ok, patch removed from misc-next.
Naohiro Aota June 19, 2023, 4:35 a.m. UTC | #9
On Tue, Jun 13, 2023 at 10:44:27PM -0700, hch@infradead.org wrote:
> On Wed, Jun 14, 2023 at 01:59:35AM +0000, Naohiro Aota wrote:
> > On Tue, Jun 13, 2023 at 02:53:04PM +0900, Naohiro Aota wrote:
> > > On Mon, Jun 12, 2023 at 10:17:23PM -0700, Christoph Hellwig wrote:
> > > > On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote:
> > > > > This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does
> > > > > not directly corresponds to the size of one extent. Instead, this patch
> > > > > will limit the allocation size at cow_file_range() for the zoned mode.
> > > > 
> > > > Maybe I'm missing something, but that limitation also seems wrong or at
> > > > least suboptimal.  There is absolutely no problem in creating a large
> > > > allocation in cow_file_range.  btrfs_submit_bio will split it into max
> > > > appens size chunks for I/O, and depending on if they got reordered or
> > > > not we might even be able to record the entire big allocation as a
> > > > single extent on disk.
> > > > 
> > > 
> > > The issue corresponds to per-inode metadata reservation pool. For each
> > > outstanding extent, it reserves 16 * node_size to insert the extent item
> > > considering the worst case.
> > > 
> > > If we allocate one large extent, it releases the unnecessary bytes from the
> > > pool as it thinks it will only do only one insertion. Then, that extent is
> > > split again, and it inserts several extents. For that insertion, btrfs
> > > consumes the reserved bytes from the per-inode pool, which is now ready
> > > only for one extent. So, with a big filesystem and a large extent write
> > > out, we can exhaust that pool and hit a WARN.
> > > 
> > > And, re-charging the pool on split time is impossible, I think... But,
> > > things might change as we moved the split time.
> > 
> > I'm considering this again. We need to take care of the reservation pool to
> > ensure metadata insertion will succeed.
> > 
> > But, if we can keep the reservation pool for N (= delalloc size /
> > fs_info->max_extent_size) extents even the allocation is done for single
> > extent, the reservation should be OK and we can allocate one large extent.
> > 
> > So, I'm testing the patch below on top of current misc-next.
> 
> I like the idea, but we need to be very careful to not run into rounding
> errors for the nr reserved extents calculation.  Maybe we should store
> the number of reserved extents in the ordered_extent, and then steal
> one for each split and justreturn the ones accounted for the
> ordered_extent when removing it?

Actually, I think we don't need to do that. We can do as the following
patch, which adds extent count for the split out extent (by
alloc_ordered_extent()), decreases the original extent counts, and adds
back the count for the left part.

Technically, when the split size is not max_extent_size, the metadata pool
will be short for the outstanding_extents. But, that is still acceptable
because even on regular btrfs we can exceed count_max_extents() depending
on the free space state, and the pool is prepared for the max depth tree.

I'm testing the patch below, but still seeing outstanding_extents
unreleased when it deletes an inode.

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a629532283bc..b140c05e7770 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1155,6 +1169,7 @@ struct btrfs_ordered_extent *btrfs_split_ordered_extent(
 	struct btrfs_ordered_extent *new;
 	struct rb_node *node;
 	u64 offset = 0;
+	int nr_extents;
 
 	trace_btrfs_ordered_extent_split(inode, ordered);
 
@@ -1181,6 +1196,9 @@ struct btrfs_ordered_extent *btrfs_split_ordered_extent(
 	if (IS_ERR(new))
 		return new;
 
+	nr_extents = count_max_extents(fs_info, ordered->num_bytes);
+	btrfs_mod_outstanding_extents(inode, -nr_extents);
+
 	/* One ref for the tree. */
 	refcount_inc(&new->refs);
 
@@ -1198,6 +1216,9 @@ struct btrfs_ordered_extent *btrfs_split_ordered_extent(
 	ordered->num_bytes -= len;
 	ordered->disk_num_bytes -= len;
 
+	nr_extents = count_max_extents(fs_info, ordered->num_bytes);
+	btrfs_mod_outstanding_extents(inode, nr_extents);
+
 	if (test_bit(BTRFS_ORDERED_IO_DONE, &ordered->flags)) {
 		ASSERT(ordered->bytes_left == 0);
 		new->bytes_left = 0;
diff mbox series

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 911908ea5f6f..e6944c4db18b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4452,6 +4452,9 @@  int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 	bool for_treelog = (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID);
 	bool for_data_reloc = (btrfs_is_data_reloc_root(root) && is_data);
 
+	if (btrfs_is_zoned(fs_info))
+		ASSERT(num_bytes <= fs_info->max_extent_size);
+
 	flags = get_alloc_profile_by_root(root, is_data);
 again:
 	WARN_ON(num_bytes < fs_info->sectorsize);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a91d5ad27984..bd86d9fc2b31 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -371,12 +371,17 @@  noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
 				    struct page *locked_page, u64 *start,
 				    u64 *end)
 {
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	const u64 orig_start = *start;
 	const u64 orig_end = *end;
-	/* The sanity tests may not set a valid fs_info. */
-	u64 max_bytes = fs_info ? fs_info->max_extent_size : BTRFS_MAX_EXTENT_SIZE;
+	/*
+	 * We don't use fs_info->max_extent_size here. The delalloc range will
+	 * not directly corresponds to the size of an extent. The allocation
+	 * size will be capped by either cow_file_range() (only for zoned) or
+	 * run_delalloc_compressed(). We can give large enough size to collect
+	 * delalloc pages.
+	 */
+	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
 	u64 delalloc_start;
 	u64 delalloc_end;
 	bool found;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 27bb1298f73c..3823ecc836c2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1494,7 +1494,10 @@  static noinline int cow_file_range(struct btrfs_inode *inode,
 	while (num_bytes > 0) {
 		struct btrfs_ordered_extent *ordered;
 
-		cur_alloc_size = num_bytes;
+		if (btrfs_is_zoned(fs_info))
+			cur_alloc_size = min_t(u64, num_bytes, fs_info->max_extent_size);
+		else
+			cur_alloc_size = num_bytes;
 		ret = btrfs_reserve_extent(root, cur_alloc_size, cur_alloc_size,
 					   min_alloc_size, 0, alloc_hint,
 					   &ins, 1, 1);