Message ID | a2f4a2162fdc3457830fa997c70ffa7c231759ad.1686582572.git.naohiro.aota@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: zoned: do not limit delalloc size to fs_info->max_extent_size | expand |
On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote: > There is an issue writing out huge buffered data with the compressed mount > option on zoned mode. For example, when you buffered write 16GB data and > call "sync" command on "mount -o compress" file-system on a zoned device, > it takes more than 3 minutes to finish the sync, invoking the hung_task > timeout. > > Before the patch: > > + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo > wrote 17179869184/17179869184 bytes at offset 0 > 16.000 GiB, 2048 ops; 0:00:23.74 (690.013 MiB/sec and 86.2516 ops/sec) > > real 0m23.770s > user 0m0.005s > sys 0m23.599s > + sync > > real 3m55.921s > user 0m0.000s > sys 0m0.134s > > After the patch: > > + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo > wrote 17179869184/17179869184 bytes at offset 0 > 16.000 GiB, 2048 ops; 0:00:28.11 (582.648 MiB/sec and 72.8311 ops/sec) > > real 0m28.146s > user 0m0.004s > sys 0m27.951s > + sync > > real 0m12.310s > user 0m0.000s > sys 0m0.048s > > This slow "sync" comes from inefficient async chunk building due to > needlessly limited delalloc size. > > find_lock_delalloc_range() looks for pages for the delayed allocation at > most fs_info->max_extent_size in its size. For the non-compress write case, > that range directly corresponds to num_bytes at cow_file_range() (= size of > allocation). So, limiting the size to the max_extent_size seems no harm as > we will split the extent even if we can have a larger allocation. > > However, things are different with the compression case. The > run_delalloc_compressed() divides the delalloc range into 512 KB sized > chunks and queues a worker for each chunk. The device of the above test > case has 672 KB zone_append_max_bytes, which is equal to > fs_info->max_extent_size. Thus, one run_delalloc_compressed() call can > build only two chunks at most, which is really smaller than > BTRFS_MAX_EXTENT_SIZE / 512KB = 256, making it inefficient. > > Also, as 672 KB is not aligned to 512 KB, it is creating ceil(16G / 672K) * > 2 = 49934 async chunks for the above case. OTOH, with BTRFS_MAX_EXTENT_SIZE > delalloced, we will create 32768 chunks, which is 34% reduced. > > This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does > not directly corresponds to the size of one extent. Instead, this patch > will limit the allocation size at cow_file_range() for the zoned mode. > > As shown above, with this patch applied, the "sync" time is reduced from > almost 4 minutes to 12 seconds. > > Fixes: f7b12a62f008 ("btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size") > CC: stable@vger.kernel.org # 6.1+ Missing S-O-B, I've used the one with @wdc.com address. Added to misc-next, thanks.
On Tue, Jun 13, 2023 at 12:22:49AM +0200, David Sterba wrote: > On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote: > > There is an issue writing out huge buffered data with the compressed mount > > option on zoned mode. For example, when you buffered write 16GB data and > > call "sync" command on "mount -o compress" file-system on a zoned device, > > it takes more than 3 minutes to finish the sync, invoking the hung_task > > timeout. > > > > Before the patch: > > > > + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo > > wrote 17179869184/17179869184 bytes at offset 0 > > 16.000 GiB, 2048 ops; 0:00:23.74 (690.013 MiB/sec and 86.2516 ops/sec) > > > > real 0m23.770s > > user 0m0.005s > > sys 0m23.599s > > + sync > > > > real 3m55.921s > > user 0m0.000s > > sys 0m0.134s > > > > After the patch: > > > > + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo > > wrote 17179869184/17179869184 bytes at offset 0 > > 16.000 GiB, 2048 ops; 0:00:28.11 (582.648 MiB/sec and 72.8311 ops/sec) > > > > real 0m28.146s > > user 0m0.004s > > sys 0m27.951s > > + sync > > > > real 0m12.310s > > user 0m0.000s > > sys 0m0.048s > > > > This slow "sync" comes from inefficient async chunk building due to > > needlessly limited delalloc size. > > > > find_lock_delalloc_range() looks for pages for the delayed allocation at > > most fs_info->max_extent_size in its size. For the non-compress write case, > > that range directly corresponds to num_bytes at cow_file_range() (= size of > > allocation). So, limiting the size to the max_extent_size seems no harm as > > we will split the extent even if we can have a larger allocation. > > > > However, things are different with the compression case. The > > run_delalloc_compressed() divides the delalloc range into 512 KB sized > > chunks and queues a worker for each chunk. The device of the above test > > case has 672 KB zone_append_max_bytes, which is equal to > > fs_info->max_extent_size. Thus, one run_delalloc_compressed() call can > > build only two chunks at most, which is really smaller than > > BTRFS_MAX_EXTENT_SIZE / 512KB = 256, making it inefficient. > > > > Also, as 672 KB is not aligned to 512 KB, it is creating ceil(16G / 672K) * > > 2 = 49934 async chunks for the above case. OTOH, with BTRFS_MAX_EXTENT_SIZE > > delalloced, we will create 32768 chunks, which is 34% reduced. > > > > This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does > > not directly corresponds to the size of one extent. Instead, this patch > > will limit the allocation size at cow_file_range() for the zoned mode. > > > > As shown above, with this patch applied, the "sync" time is reduced from > > almost 4 minutes to 12 seconds. > > > > Fixes: f7b12a62f008 ("btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size") > > CC: stable@vger.kernel.org # 6.1+ > > Missing S-O-B, I've used the one with @wdc.com address. Added to > misc-next, thanks. Ah, I forgot it. Thank you. Just in case: Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote: > This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does > not directly corresponds to the size of one extent. Instead, this patch > will limit the allocation size at cow_file_range() for the zoned mode. Maybe I'm missing something, but that limitation also seems wrong or at least suboptimal. There is absolutely no problem in creating a large allocation in cow_file_range. btrfs_submit_bio will split it into max appens size chunks for I/O, and depending on if they got reordered or not we might even be able to record the entire big allocation as a single extent on disk.
On Mon, Jun 12, 2023 at 10:17:23PM -0700, Christoph Hellwig wrote: > On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote: > > This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does > > not directly corresponds to the size of one extent. Instead, this patch > > will limit the allocation size at cow_file_range() for the zoned mode. > > Maybe I'm missing something, but that limitation also seems wrong or at > least suboptimal. There is absolutely no problem in creating a large > allocation in cow_file_range. btrfs_submit_bio will split it into max > appens size chunks for I/O, and depending on if they got reordered or > not we might even be able to record the entire big allocation as a > single extent on disk. > The issue corresponds to per-inode metadata reservation pool. For each outstanding extent, it reserves 16 * node_size to insert the extent item considering the worst case. If we allocate one large extent, it releases the unnecessary bytes from the pool as it thinks it will only do only one insertion. Then, that extent is split again, and it inserts several extents. For that insertion, btrfs consumes the reserved bytes from the per-inode pool, which is now ready only for one extent. So, with a big filesystem and a large extent write out, we can exhaust that pool and hit a WARN. And, re-charging the pool on split time is impossible, I think... But, things might change as we moved the split time. Please check the original commit f7b12a62f008 ("btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size") and btrfs_calculate_inode_block_rsv_file() for detail.
On Tue, Jun 13, 2023 at 02:53:04PM +0900, Naohiro Aota wrote: > On Mon, Jun 12, 2023 at 10:17:23PM -0700, Christoph Hellwig wrote: > > On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote: > > > This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does > > > not directly corresponds to the size of one extent. Instead, this patch > > > will limit the allocation size at cow_file_range() for the zoned mode. > > > > Maybe I'm missing something, but that limitation also seems wrong or at > > least suboptimal. There is absolutely no problem in creating a large > > allocation in cow_file_range. btrfs_submit_bio will split it into max > > appens size chunks for I/O, and depending on if they got reordered or > > not we might even be able to record the entire big allocation as a > > single extent on disk. > > > > The issue corresponds to per-inode metadata reservation pool. For each > outstanding extent, it reserves 16 * node_size to insert the extent item > considering the worst case. > > If we allocate one large extent, it releases the unnecessary bytes from the > pool as it thinks it will only do only one insertion. Then, that extent is > split again, and it inserts several extents. For that insertion, btrfs > consumes the reserved bytes from the per-inode pool, which is now ready > only for one extent. So, with a big filesystem and a large extent write > out, we can exhaust that pool and hit a WARN. > > And, re-charging the pool on split time is impossible, I think... But, > things might change as we moved the split time. I'm considering this again. We need to take care of the reservation pool to ensure metadata insertion will succeed. But, if we can keep the reservation pool for N (= delalloc size / fs_info->max_extent_size) extents even the allocation is done for single extent, the reservation should be OK and we can allocate one large extent. So, I'm testing the patch below on top of current misc-next. diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e6944c4db18b..911908ea5f6f 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4452,9 +4452,6 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, bool for_treelog = (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID); bool for_data_reloc = (btrfs_is_data_reloc_root(root) && is_data); - if (btrfs_is_zoned(fs_info)) - ASSERT(num_bytes <= fs_info->max_extent_size); - flags = get_alloc_profile_by_root(root, is_data); again: WARN_ON(num_bytes < fs_info->sectorsize); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5f98bc8d9196..b92c814cec97 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1494,10 +1494,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode, while (num_bytes > 0) { struct btrfs_ordered_extent *ordered; - if (btrfs_is_zoned(fs_info)) - cur_alloc_size = min_t(u64, num_bytes, fs_info->max_extent_size); - else - cur_alloc_size = num_bytes; + cur_alloc_size = num_bytes; ret = btrfs_reserve_extent(root, cur_alloc_size, cur_alloc_size, min_alloc_size, 0, alloc_hint, &ins, 1, 1); diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index a629532283bc..2d12547e4d64 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -151,7 +151,9 @@ static struct btrfs_ordered_extent *alloc_ordered_extent( u64 ram_bytes, u64 disk_bytenr, u64 disk_num_bytes, u64 offset, unsigned long flags, int compress_type) { + struct btrfs_fs_info *fs_info = inode->root->fs_info; struct btrfs_ordered_extent *entry; + int nr_extents = 1; int ret; if (flags & @@ -193,13 +195,21 @@ static struct btrfs_ordered_extent *alloc_ordered_extent( INIT_LIST_HEAD(&entry->work_list); init_completion(&entry->completion); + /* + * We may split the extent due to a device's max_zone_append_size. + * Keep the reservation for the worst case, even we are allocating + * only one ordered extent. + */ + if (btrfs_is_zoned(fs_info)) + nr_extents = count_max_extents(fs_info, num_bytes); + /* * We don't need the count_max_extents here, we can assume that all of * that work has been done at higher layers, so this is truly the * smallest the extent is going to get. */ spin_lock(&inode->lock); - btrfs_mod_outstanding_extents(inode, 1); + btrfs_mod_outstanding_extents(inode, nr_extents); spin_unlock(&inode->lock); return entry; @@ -580,6 +590,7 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode, struct rb_node *node; bool pending; bool freespace_inode; + int nr_extents = 1; /* * If this is a free space inode the thread has not acquired the ordered @@ -588,9 +599,12 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode, freespace_inode = btrfs_is_free_space_inode(btrfs_inode); btrfs_lockdep_acquire(fs_info, btrfs_trans_pending_ordered); + /* This is paired with btrfs_alloc_ordered_extent. */ + if (btrfs_is_zoned(fs_info)) + nr_extents = count_max_extents(fs_info, entry->num_bytes); spin_lock(&btrfs_inode->lock); - btrfs_mod_outstanding_extents(btrfs_inode, -1); + btrfs_mod_outstanding_extents(btrfs_inode, -nr_extents); spin_unlock(&btrfs_inode->lock); if (root != fs_info->tree_root) { u64 release;
On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote: > There is an issue writing out huge buffered data with the compressed mount > option on zoned mode. For example, when you buffered write 16GB data and > call "sync" command on "mount -o compress" file-system on a zoned device, > it takes more than 3 minutes to finish the sync, invoking the hung_task > timeout. > > Before the patch: > > + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo > wrote 17179869184/17179869184 bytes at offset 0 > 16.000 GiB, 2048 ops; 0:00:23.74 (690.013 MiB/sec and 86.2516 ops/sec) > > real 0m23.770s > user 0m0.005s > sys 0m23.599s > + sync > > real 3m55.921s > user 0m0.000s > sys 0m0.134s > > After the patch: > > + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo > wrote 17179869184/17179869184 bytes at offset 0 > 16.000 GiB, 2048 ops; 0:00:28.11 (582.648 MiB/sec and 72.8311 ops/sec) > > real 0m28.146s > user 0m0.004s > sys 0m27.951s > + sync > > real 0m12.310s > user 0m0.000s > sys 0m0.048s > > This slow "sync" comes from inefficient async chunk building due to > needlessly limited delalloc size. > > find_lock_delalloc_range() looks for pages for the delayed allocation at > most fs_info->max_extent_size in its size. For the non-compress write case, > that range directly corresponds to num_bytes at cow_file_range() (= size of > allocation). So, limiting the size to the max_extent_size seems no harm as > we will split the extent even if we can have a larger allocation. > > However, things are different with the compression case. The > run_delalloc_compressed() divides the delalloc range into 512 KB sized > chunks and queues a worker for each chunk. The device of the above test > case has 672 KB zone_append_max_bytes, which is equal to > fs_info->max_extent_size. Thus, one run_delalloc_compressed() call can > build only two chunks at most, which is really smaller than > BTRFS_MAX_EXTENT_SIZE / 512KB = 256, making it inefficient. > > Also, as 672 KB is not aligned to 512 KB, it is creating ceil(16G / 672K) * > 2 = 49934 async chunks for the above case. OTOH, with BTRFS_MAX_EXTENT_SIZE > delalloced, we will create 32768 chunks, which is 34% reduced. > > This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does > not directly corresponds to the size of one extent. Instead, this patch > will limit the allocation size at cow_file_range() for the zoned mode. > > As shown above, with this patch applied, the "sync" time is reduced from > almost 4 minutes to 12 seconds. > > Fixes: f7b12a62f008 ("btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size") > CC: stable@vger.kernel.org # 6.1+ > --- > fs/btrfs/extent-tree.c | 3 +++ > fs/btrfs/extent_io.c | 11 ++++++++--- > fs/btrfs/inode.c | 5 ++++- > 3 files changed, 15 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 911908ea5f6f..e6944c4db18b 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4452,6 +4452,9 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, > bool for_treelog = (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID); > bool for_data_reloc = (btrfs_is_data_reloc_root(root) && is_data); > > + if (btrfs_is_zoned(fs_info)) > + ASSERT(num_bytes <= fs_info->max_extent_size); > + BTW, this ASSERT is problematic when it is called from the direct IO context. So, we need to rework this patch anyway. Sorry for a confusion.
On Wed, Jun 14, 2023 at 01:59:35AM +0000, Naohiro Aota wrote: > On Tue, Jun 13, 2023 at 02:53:04PM +0900, Naohiro Aota wrote: > > On Mon, Jun 12, 2023 at 10:17:23PM -0700, Christoph Hellwig wrote: > > > On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote: > > > > This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does > > > > not directly corresponds to the size of one extent. Instead, this patch > > > > will limit the allocation size at cow_file_range() for the zoned mode. > > > > > > Maybe I'm missing something, but that limitation also seems wrong or at > > > least suboptimal. There is absolutely no problem in creating a large > > > allocation in cow_file_range. btrfs_submit_bio will split it into max > > > appens size chunks for I/O, and depending on if they got reordered or > > > not we might even be able to record the entire big allocation as a > > > single extent on disk. > > > > > > > The issue corresponds to per-inode metadata reservation pool. For each > > outstanding extent, it reserves 16 * node_size to insert the extent item > > considering the worst case. > > > > If we allocate one large extent, it releases the unnecessary bytes from the > > pool as it thinks it will only do only one insertion. Then, that extent is > > split again, and it inserts several extents. For that insertion, btrfs > > consumes the reserved bytes from the per-inode pool, which is now ready > > only for one extent. So, with a big filesystem and a large extent write > > out, we can exhaust that pool and hit a WARN. > > > > And, re-charging the pool on split time is impossible, I think... But, > > things might change as we moved the split time. > > I'm considering this again. We need to take care of the reservation pool to > ensure metadata insertion will succeed. > > But, if we can keep the reservation pool for N (= delalloc size / > fs_info->max_extent_size) extents even the allocation is done for single > extent, the reservation should be OK and we can allocate one large extent. > > So, I'm testing the patch below on top of current misc-next. I like the idea, but we need to be very careful to not run into rounding errors for the nr reserved extents calculation. Maybe we should store the number of reserved extents in the ordered_extent, and then steal one for each split and justreturn the ones accounted for the ordered_extent when removing it?
On Wed, Jun 14, 2023 at 02:05:48AM +0000, Naohiro Aota wrote: > On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote: > > There is an issue writing out huge buffered data with the compressed mount > > option on zoned mode. For example, when you buffered write 16GB data and > > call "sync" command on "mount -o compress" file-system on a zoned device, > > it takes more than 3 minutes to finish the sync, invoking the hung_task > > timeout. > > > > Before the patch: > > > > + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo > > wrote 17179869184/17179869184 bytes at offset 0 > > 16.000 GiB, 2048 ops; 0:00:23.74 (690.013 MiB/sec and 86.2516 ops/sec) > > > > real 0m23.770s > > user 0m0.005s > > sys 0m23.599s > > + sync > > > > real 3m55.921s > > user 0m0.000s > > sys 0m0.134s > > > > After the patch: > > > > + xfs_io -f -c 'pwrite -b 8M 0 16G' /mnt/test/foo > > wrote 17179869184/17179869184 bytes at offset 0 > > 16.000 GiB, 2048 ops; 0:00:28.11 (582.648 MiB/sec and 72.8311 ops/sec) > > > > real 0m28.146s > > user 0m0.004s > > sys 0m27.951s > > + sync > > > > real 0m12.310s > > user 0m0.000s > > sys 0m0.048s > > > > This slow "sync" comes from inefficient async chunk building due to > > needlessly limited delalloc size. > > > > find_lock_delalloc_range() looks for pages for the delayed allocation at > > most fs_info->max_extent_size in its size. For the non-compress write case, > > that range directly corresponds to num_bytes at cow_file_range() (= size of > > allocation). So, limiting the size to the max_extent_size seems no harm as > > we will split the extent even if we can have a larger allocation. > > > > However, things are different with the compression case. The > > run_delalloc_compressed() divides the delalloc range into 512 KB sized > > chunks and queues a worker for each chunk. The device of the above test > > case has 672 KB zone_append_max_bytes, which is equal to > > fs_info->max_extent_size. Thus, one run_delalloc_compressed() call can > > build only two chunks at most, which is really smaller than > > BTRFS_MAX_EXTENT_SIZE / 512KB = 256, making it inefficient. > > > > Also, as 672 KB is not aligned to 512 KB, it is creating ceil(16G / 672K) * > > 2 = 49934 async chunks for the above case. OTOH, with BTRFS_MAX_EXTENT_SIZE > > delalloced, we will create 32768 chunks, which is 34% reduced. > > > > This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does > > not directly corresponds to the size of one extent. Instead, this patch > > will limit the allocation size at cow_file_range() for the zoned mode. > > > > As shown above, with this patch applied, the "sync" time is reduced from > > almost 4 minutes to 12 seconds. > > > > Fixes: f7b12a62f008 ("btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size") > > CC: stable@vger.kernel.org # 6.1+ > > --- > > fs/btrfs/extent-tree.c | 3 +++ > > fs/btrfs/extent_io.c | 11 ++++++++--- > > fs/btrfs/inode.c | 5 ++++- > > 3 files changed, 15 insertions(+), 4 deletions(-) > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > > index 911908ea5f6f..e6944c4db18b 100644 > > --- a/fs/btrfs/extent-tree.c > > +++ b/fs/btrfs/extent-tree.c > > @@ -4452,6 +4452,9 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, > > bool for_treelog = (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID); > > bool for_data_reloc = (btrfs_is_data_reloc_root(root) && is_data); > > > > + if (btrfs_is_zoned(fs_info)) > > + ASSERT(num_bytes <= fs_info->max_extent_size); > > + > > BTW, this ASSERT is problematic when it is called from the direct IO > context. So, we need to rework this patch anyway. Sorry for a confusion. Ok, patch removed from misc-next.
On Tue, Jun 13, 2023 at 10:44:27PM -0700, hch@infradead.org wrote: > On Wed, Jun 14, 2023 at 01:59:35AM +0000, Naohiro Aota wrote: > > On Tue, Jun 13, 2023 at 02:53:04PM +0900, Naohiro Aota wrote: > > > On Mon, Jun 12, 2023 at 10:17:23PM -0700, Christoph Hellwig wrote: > > > > On Tue, Jun 13, 2023 at 12:10:29AM +0900, Naohiro Aota wrote: > > > > > This patch reverts the delalloc size to BTRFS_MAX_EXTENT_SIZE, as it does > > > > > not directly corresponds to the size of one extent. Instead, this patch > > > > > will limit the allocation size at cow_file_range() for the zoned mode. > > > > > > > > Maybe I'm missing something, but that limitation also seems wrong or at > > > > least suboptimal. There is absolutely no problem in creating a large > > > > allocation in cow_file_range. btrfs_submit_bio will split it into max > > > > appens size chunks for I/O, and depending on if they got reordered or > > > > not we might even be able to record the entire big allocation as a > > > > single extent on disk. > > > > > > > > > > The issue corresponds to per-inode metadata reservation pool. For each > > > outstanding extent, it reserves 16 * node_size to insert the extent item > > > considering the worst case. > > > > > > If we allocate one large extent, it releases the unnecessary bytes from the > > > pool as it thinks it will only do only one insertion. Then, that extent is > > > split again, and it inserts several extents. For that insertion, btrfs > > > consumes the reserved bytes from the per-inode pool, which is now ready > > > only for one extent. So, with a big filesystem and a large extent write > > > out, we can exhaust that pool and hit a WARN. > > > > > > And, re-charging the pool on split time is impossible, I think... But, > > > things might change as we moved the split time. > > > > I'm considering this again. We need to take care of the reservation pool to > > ensure metadata insertion will succeed. > > > > But, if we can keep the reservation pool for N (= delalloc size / > > fs_info->max_extent_size) extents even the allocation is done for single > > extent, the reservation should be OK and we can allocate one large extent. > > > > So, I'm testing the patch below on top of current misc-next. > > I like the idea, but we need to be very careful to not run into rounding > errors for the nr reserved extents calculation. Maybe we should store > the number of reserved extents in the ordered_extent, and then steal > one for each split and justreturn the ones accounted for the > ordered_extent when removing it? Actually, I think we don't need to do that. We can do as the following patch, which adds extent count for the split out extent (by alloc_ordered_extent()), decreases the original extent counts, and adds back the count for the left part. Technically, when the split size is not max_extent_size, the metadata pool will be short for the outstanding_extents. But, that is still acceptable because even on regular btrfs we can exceed count_max_extents() depending on the free space state, and the pool is prepared for the max depth tree. I'm testing the patch below, but still seeing outstanding_extents unreleased when it deletes an inode. diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index a629532283bc..b140c05e7770 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -1155,6 +1169,7 @@ struct btrfs_ordered_extent *btrfs_split_ordered_extent( struct btrfs_ordered_extent *new; struct rb_node *node; u64 offset = 0; + int nr_extents; trace_btrfs_ordered_extent_split(inode, ordered); @@ -1181,6 +1196,9 @@ struct btrfs_ordered_extent *btrfs_split_ordered_extent( if (IS_ERR(new)) return new; + nr_extents = count_max_extents(fs_info, ordered->num_bytes); + btrfs_mod_outstanding_extents(inode, -nr_extents); + /* One ref for the tree. */ refcount_inc(&new->refs); @@ -1198,6 +1216,9 @@ struct btrfs_ordered_extent *btrfs_split_ordered_extent( ordered->num_bytes -= len; ordered->disk_num_bytes -= len; + nr_extents = count_max_extents(fs_info, ordered->num_bytes); + btrfs_mod_outstanding_extents(inode, nr_extents); + if (test_bit(BTRFS_ORDERED_IO_DONE, &ordered->flags)) { ASSERT(ordered->bytes_left == 0); new->bytes_left = 0;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 911908ea5f6f..e6944c4db18b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4452,6 +4452,9 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, bool for_treelog = (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID); bool for_data_reloc = (btrfs_is_data_reloc_root(root) && is_data); + if (btrfs_is_zoned(fs_info)) + ASSERT(num_bytes <= fs_info->max_extent_size); + flags = get_alloc_profile_by_root(root, is_data); again: WARN_ON(num_bytes < fs_info->sectorsize); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index a91d5ad27984..bd86d9fc2b31 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -371,12 +371,17 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, struct page *locked_page, u64 *start, u64 *end) { - struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; const u64 orig_start = *start; const u64 orig_end = *end; - /* The sanity tests may not set a valid fs_info. */ - u64 max_bytes = fs_info ? fs_info->max_extent_size : BTRFS_MAX_EXTENT_SIZE; + /* + * We don't use fs_info->max_extent_size here. The delalloc range will + * not directly corresponds to the size of an extent. The allocation + * size will be capped by either cow_file_range() (only for zoned) or + * run_delalloc_compressed(). We can give large enough size to collect + * delalloc pages. + */ + u64 max_bytes = BTRFS_MAX_EXTENT_SIZE; u64 delalloc_start; u64 delalloc_end; bool found; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 27bb1298f73c..3823ecc836c2 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1494,7 +1494,10 @@ static noinline int cow_file_range(struct btrfs_inode *inode, while (num_bytes > 0) { struct btrfs_ordered_extent *ordered; - cur_alloc_size = num_bytes; + if (btrfs_is_zoned(fs_info)) + cur_alloc_size = min_t(u64, num_bytes, fs_info->max_extent_size); + else + cur_alloc_size = num_bytes; ret = btrfs_reserve_extent(root, cur_alloc_size, cur_alloc_size, min_alloc_size, 0, alloc_hint, &ins, 1, 1);