Message ID | 20191213040915.3502922-16-naohiro.aota@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: zoned block device support | expand |
On 12/12/19 11:09 PM, Naohiro Aota wrote: > To preserve sequential write pattern on the drives, we must serialize > allocation and submit_bio. This commit add per-block group mutex > "zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept > even after returning from find_free_extent(). It is released when submiting > IOs corresponding to the allocation is completed. > > Implementing such behavior under __extent_writepage_io() is almost > impossible because once pages are unlocked we are not sure when submiting > IOs for an allocated region is finished or not. Instead, this commit add > run_delalloc_hmzoned() to write out non-compressed data IOs at once using > extent_write_locked_rage(). After the write, we can call > btrfs_hmzoned_data_io_unlock() to unlock the block group for new > allocation. > > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Have you actually tested these patches with lock debugging on? The submit_compressed_extents stuff is async, so the unlocker owner will not be the lock owner, and that'll make all sorts of things blow up. This is just straight up broken. I would really rather see a hmzoned block scheduler that just doesn't submit the bio's until they are aligned with the WP, that way this intellligence doesn't have to be dealt with at the file system layer. I get allocating in line with the WP, but this whole forcing us to allocate and submit the bio in lock step is just nuts, and broken in your subsequent patches. This whole approach needs to be reworked. Thanks, Josef
On Tue, Dec 17, 2019 at 02:49:44PM -0500, Josef Bacik wrote: >On 12/12/19 11:09 PM, Naohiro Aota wrote: >>To preserve sequential write pattern on the drives, we must serialize >>allocation and submit_bio. This commit add per-block group mutex >>"zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept >>even after returning from find_free_extent(). It is released when submiting >>IOs corresponding to the allocation is completed. >> >>Implementing such behavior under __extent_writepage_io() is almost >>impossible because once pages are unlocked we are not sure when submiting >>IOs for an allocated region is finished or not. Instead, this commit add >>run_delalloc_hmzoned() to write out non-compressed data IOs at once using >>extent_write_locked_rage(). After the write, we can call >>btrfs_hmzoned_data_io_unlock() to unlock the block group for new >>allocation. >> >>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> > >Have you actually tested these patches with lock debugging on? The >submit_compressed_extents stuff is async, so the unlocker owner will >not be the lock owner, and that'll make all sorts of things blow up. >This is just straight up broken. Yes, I have ran xfstests on this patch series with lockdeps and KASAN. There was no problem with that. For non-compressed writes, both allocation and submit is done in run_delalloc_zoned(). Allocation is done in cow_file_range() and submit is done in extent_write_locked_range(), so both are in the same context, so both locking and unlocking are done by the same execution context. For compressed writes, again, allocation/lock is done under cow_file_range() and submit is done in extent_write_locked_range() and unlocked all in submit_compressed_extents() (this is called after compression), so they are all in the same context and the lock owner does the unlock. >I would really rather see a hmzoned block scheduler that just doesn't >submit the bio's until they are aligned with the WP, that way this >intellligence doesn't have to be dealt with at the file system layer. >I get allocating in line with the WP, but this whole forcing us to >allocate and submit the bio in lock step is just nuts, and broken in >your subsequent patches. This whole approach needs to be reworked. >Thanks, > >Josef We tried this approach by modifying mq-deadline to wait if the first queued request is not aligned at the write pointer of a zone. However, running btrfs without the allocate+submit lock with this modified IO scheduler did not work well at all. With write intensive workloads, we observed that a very long wait time was very often necessary to get a fully sequential stream of requests starting at the write pointer of a zone. The wait time we observed was sometimes in larger than 60 seconds, at which point we gave up. While we did not extensively dig into the fundamental root cause, these potentially long wait times can come from a large number of reasons: page cache writeback behavior, kernel process scheduling, device IO congestion and writeback throttling, sync, transaction commit of btrfs, and cgroup use could make everything even worse. In the worst case scenario, a number of out-of-ordered requests could get stuck in the IO scheduler, preventing forward progress in the case of a memory reclaim writeback, causing the OOM killer to start happily killing application processes. Furthermore, IO error handling becomes a nightmare as the block layer scheduler would need to issue report zones commands to re-sync the zone wp in case of write error. And that is also in addition to having to track other zone commands that change a zone wp such as reset zone and finish zone. Considering all this, handling the sequential write constraint at the file system layer by ensuring that write BIOs are issued in the correct order starting from a zone WP is far simpler and removes dependencies on other features such as cgroup, congestion control and other throttling mechanisms. The IO scheduler can always dispatch to the device the requests it received without any waiting time, ensuring forward progress. The mq-deadline IO scheduler supports not only regular block devices but also zoned block devices and it is the default scheduler for them, and other schedulers that are not zone compliant cannot be selected (one cannot change to kyber nor bfq). This ensure that the default system behavior will be correct as long as the user (the FS) respects the sequential write rule. The previous approach I proposed using a btrfs request reordering stage was indeed very invasive, and similarly the block layer scheduler changes, could cause problems with cgroups etc. The new approach of this path using locking to have atomic allocate+bio issuing results in per-zone sequential write patterns, no matter what happens around it. It is less invasive and rely on the sequential allocation of blocks for the ordering of write IOs, so there is no explicit reordering, so no additional overhead. f2fs implementation uses a similar approach since kernel 4.10 and has proven to be very solid. In light of these arguments and explanation, do you still think the allocate zone locking approach is still not acceptable ?
On 12/19/19 1:54 AM, Naohiro Aota wrote: > On Tue, Dec 17, 2019 at 02:49:44PM -0500, Josef Bacik wrote: >> On 12/12/19 11:09 PM, Naohiro Aota wrote: >>> To preserve sequential write pattern on the drives, we must serialize >>> allocation and submit_bio. This commit add per-block group mutex >>> "zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept >>> even after returning from find_free_extent(). It is released when submiting >>> IOs corresponding to the allocation is completed. >>> >>> Implementing such behavior under __extent_writepage_io() is almost >>> impossible because once pages are unlocked we are not sure when submiting >>> IOs for an allocated region is finished or not. Instead, this commit add >>> run_delalloc_hmzoned() to write out non-compressed data IOs at once using >>> extent_write_locked_rage(). After the write, we can call >>> btrfs_hmzoned_data_io_unlock() to unlock the block group for new >>> allocation. >>> >>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> >> >> Have you actually tested these patches with lock debugging on? The >> submit_compressed_extents stuff is async, so the unlocker owner will not be >> the lock owner, and that'll make all sorts of things blow up. This is just >> straight up broken. > > Yes, I have ran xfstests on this patch series with lockdeps and > KASAN. There was no problem with that. > > For non-compressed writes, both allocation and submit is done in > run_delalloc_zoned(). Allocation is done in cow_file_range() and > submit is done in extent_write_locked_range(), so both are in the same > context, so both locking and unlocking are done by the same execution > context. > > For compressed writes, again, allocation/lock is done under > cow_file_range() and submit is done in extent_write_locked_range() and > unlocked all in submit_compressed_extents() (this is called after > compression), so they are all in the same context and the lock owner > does the unlock. > >> I would really rather see a hmzoned block scheduler that just doesn't submit >> the bio's until they are aligned with the WP, that way this intellligence >> doesn't have to be dealt with at the file system layer. I get allocating in >> line with the WP, but this whole forcing us to allocate and submit the bio in >> lock step is just nuts, and broken in your subsequent patches. This whole >> approach needs to be reworked. Thanks, >> >> Josef > > We tried this approach by modifying mq-deadline to wait if the first > queued request is not aligned at the write pointer of a zone. However, > running btrfs without the allocate+submit lock with this modified IO > scheduler did not work well at all. With write intensive workloads, we > observed that a very long wait time was very often necessary to get a > fully sequential stream of requests starting at the write pointer of a > zone. The wait time we observed was sometimes in larger than 60 seconds, > at which point we gave up. This is because we will only write out the pages we've been handed but do cow_file_range() for a possibly larger delalloc range, so as you say there can be a large gap in time between writing one part of the range and writing the next part. You actually solve this with your patch, by doing the cow_file_range and then following it up with the extent_write_locked_range() for the range you just cow'ed. There is no need for the locking in this case, you could simply do that and then have a modified block scheduler that keeps the bio's in the correct order. I imagine if you just did this with your original block layer approach it would work fine. Thanks, Josef
On Thu, Dec 19, 2019 at 09:01:35AM -0500, Josef Bacik wrote: >On 12/19/19 1:54 AM, Naohiro Aota wrote: >>On Tue, Dec 17, 2019 at 02:49:44PM -0500, Josef Bacik wrote: >>>On 12/12/19 11:09 PM, Naohiro Aota wrote: >>>>To preserve sequential write pattern on the drives, we must serialize >>>>allocation and submit_bio. This commit add per-block group mutex >>>>"zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept >>>>even after returning from find_free_extent(). It is released when submiting >>>>IOs corresponding to the allocation is completed. >>>> >>>>Implementing such behavior under __extent_writepage_io() is almost >>>>impossible because once pages are unlocked we are not sure when submiting >>>>IOs for an allocated region is finished or not. Instead, this commit add >>>>run_delalloc_hmzoned() to write out non-compressed data IOs at once using >>>>extent_write_locked_rage(). After the write, we can call >>>>btrfs_hmzoned_data_io_unlock() to unlock the block group for new >>>>allocation. >>>> >>>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> >>> >>>Have you actually tested these patches with lock debugging on? >>>The submit_compressed_extents stuff is async, so the unlocker >>>owner will not be the lock owner, and that'll make all sorts of >>>things blow up. This is just straight up broken. >> >>Yes, I have ran xfstests on this patch series with lockdeps and >>KASAN. There was no problem with that. >> >>For non-compressed writes, both allocation and submit is done in >>run_delalloc_zoned(). Allocation is done in cow_file_range() and >>submit is done in extent_write_locked_range(), so both are in the same >>context, so both locking and unlocking are done by the same execution >>context. >> >>For compressed writes, again, allocation/lock is done under >>cow_file_range() and submit is done in extent_write_locked_range() and >>unlocked all in submit_compressed_extents() (this is called after >>compression), so they are all in the same context and the lock owner >>does the unlock. >> >>>I would really rather see a hmzoned block scheduler that just >>>doesn't submit the bio's until they are aligned with the WP, that >>>way this intellligence doesn't have to be dealt with at the file >>>system layer. I get allocating in line with the WP, but this whole >>>forcing us to allocate and submit the bio in lock step is just >>>nuts, and broken in your subsequent patches. This whole approach >>>needs to be reworked. Thanks, >>> >>>Josef >> >>We tried this approach by modifying mq-deadline to wait if the first >>queued request is not aligned at the write pointer of a zone. However, >>running btrfs without the allocate+submit lock with this modified IO >>scheduler did not work well at all. With write intensive workloads, we >>observed that a very long wait time was very often necessary to get a >>fully sequential stream of requests starting at the write pointer of a >>zone. The wait time we observed was sometimes in larger than 60 seconds, >>at which point we gave up. > >This is because we will only write out the pages we've been handed but >do cow_file_range() for a possibly larger delalloc range, so as you >say there can be a large gap in time between writing one part of the >range and writing the next part. > >You actually solve this with your patch, by doing the cow_file_range >and then following it up with the extent_write_locked_range() for the >range you just cow'ed. > >There is no need for the locking in this case, you could simply do >that and then have a modified block scheduler that keeps the bio's in >the correct order. I imagine if you just did this with your original >block layer approach it would work fine. Thanks, > >Josef We have once again tried the btrfs SMR (Zoned Block Device) support series without the locking around extent allocation and bio issuing, with a modified version of mq-deadline as the scheduler for the block layer. As you already know, mq-deadline will order read and write requests separately in increasing sector order, which is essential for SMR sequential writing. However, mq-deadline does not provide guarantees regarding the completeness of a sequential write stream. If there are missing requests ("holes" in the write stream), mq-deadline will still dispatch the next write request in order, leading to write errors on SMR drives. The modifications we added to mq-deadline is the addition of a wait time when a hole in a sequential write stream is discovered. This is reminiscent of the old anticipatory scheduler, somewhat. The wait time is limited, so if a hole is not filled up by newly inserted requests after a timeout elapses, write requests are issued as is (and errors happen on SMR). The default timeout we used initially was set to the value of "/sys/block/<dev>/queue/iosched/write_expire" which is 5 seconds. With this, tests show that unaligned write errors happen with a simple workload of 48 threads simultaneously doing write() to their dedicated file and fdatasync() (Code of the application doing this is attached to this email). Despite the wait time of 5 seconds, the holes in a zone sequential write stream are not filled up by issued BIOs because of a "buffer bloat." First, bio whose LBA is not aligned with the write pointer reaches the IO scheduler (call it bio#1). For proceeding with bio#1, the IO scheduler must wait for a hole filling bio aligned with the write pointer (call it bio#0). If the size of bio#1 is large, the scheduler needs to split the bio#1 into many numbers of requests. Each request must first obtain a scheduler tag to be inserted into the scheduler queue. Since the number of the scheduler tag is limited and tags are freed only with the completion of queued and inflight requests, requests in bio#1 can fully use all the tags. This is not a problem if forward progress is made (i.e., requests dispatched to the disk), but if all requests in the scheduler using tags are bio#1 and subsequent writes in sequence, these are all waiting for bio#0 to be issued. We thus end up with a soft deadlock for request issuing and no possibility of progress. That results in the timeout to trigger, no matter how large we set it, and in unaligned write errors. Large bios needing lots of requests for processing will trigger this problem all the time. In addition to unaligned write error, we also observed hung_task timeout with a larger timeout. The reason is the same as above: writing threads get stuck with blk_mq_get_tag() to acquire its scheduler tag. We more often hit hung_task than unaligned write by increasing the timeout seconds. Jan 07 11:17:11 naota-devel kernel: INFO: task multi-proc-writ:2202 blocked for more than 122 seconds. Jan 07 11:17:11 naota-devel kernel: Not tainted 5.4.0-rc8-BTRFS-ZNS+ #165 Jan 07 11:17:11 naota-devel kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 07 11:17:11 naota-devel kernel: multi-proc-writ D 0 2202 2168 0x00004000 Jan 07 11:17:11 naota-devel kernel: Call Trace: Jan 07 11:17:11 naota-devel kernel: __schedule+0x8ab/0x1db0 Jan 07 11:17:11 naota-devel kernel: ? pci_mmcfg_check_reserved+0x130/0x130 Jan 07 11:17:11 naota-devel kernel: ? blk_insert_cloned_request+0x3e0/0x3e0 Jan 07 11:17:11 naota-devel kernel: schedule+0xdb/0x260 Jan 07 11:17:11 naota-devel kernel: io_schedule+0x21/0x70 Jan 07 11:17:11 naota-devel kernel: blk_mq_get_tag+0x3b6/0x940 Jan 07 11:17:11 naota-devel kernel: ? __blk_mq_tag_idle+0x80/0x80 Jan 07 11:17:11 naota-devel kernel: ? finish_wait+0x270/0x270 Jan 07 11:17:11 naota-devel kernel: blk_mq_get_request+0x340/0x1750 Jan 07 11:17:11 naota-devel kernel: blk_mq_make_request+0x339/0x1bd0 Jan 07 11:17:11 naota-devel kernel: ? blk_queue_enter+0x8a4/0xa30 Jan 07 11:17:11 naota-devel kernel: ? blk_mq_try_issue_directly+0x150/0x150 Jan 07 11:17:11 naota-devel kernel: generic_make_request+0x20c/0xa70 Jan 07 11:17:11 naota-devel kernel: ? blk_queue_enter+0xa30/0xa30 Jan 07 11:17:11 naota-devel kernel: ? find_held_lock+0x35/0x130 Jan 07 11:17:11 naota-devel kernel: ? __kasan_check_read+0x11/0x20 Jan 07 11:17:11 naota-devel kernel: submit_bio+0xd5/0x3c0 Jan 07 11:17:11 naota-devel kernel: ? submit_bio+0xd5/0x3c0 Jan 07 11:17:11 naota-devel kernel: ? generic_make_request+0xa70/0xa70 Jan 07 11:17:11 naota-devel kernel: btrfs_map_bio+0x5f5/0xfb0 [btrfs] Jan 07 11:17:11 naota-devel kernel: ? btrfs_rmap_block+0x820/0x820 [btrfs] Jan 07 11:17:11 naota-devel kernel: ? unlock_page+0x9f/0x110 Jan 07 11:17:11 naota-devel kernel: ? __extent_writepage+0x5aa/0x800 [btrfs] Jan 07 11:17:11 naota-devel kernel: ? lock_downgrade+0x770/0x770 Jan 07 11:17:11 naota-devel kernel: btrfs_submit_bio_hook+0x336/0x600 [btrfs] Jan 07 11:17:11 naota-devel kernel: ? btrfs_fiemap+0x50/0x50 [btrfs] Jan 07 11:17:11 naota-devel kernel: submit_one_bio+0xba/0x130 [btrfs] Jan 07 11:17:11 naota-devel kernel: extent_write_locked_range+0x2f9/0x3e0 [btrfs] Jan 07 11:17:11 naota-devel kernel: ? extent_write_full_page+0x1f0/0x1f0 [btrfs] Jan 07 11:17:11 naota-devel kernel: ? lock_downgrade+0x770/0x770 Jan 07 11:17:11 naota-devel kernel: ? account_page_redirty+0x2bb/0x490 Jan 07 11:17:11 naota-devel kernel: run_delalloc_zoned+0x108/0x2f0 [btrfs] Jan 07 11:17:11 naota-devel kernel: btrfs_run_delalloc_range+0xc4b/0x1170 [btrfs] Jan 07 11:17:11 naota-devel kernel: ? test_range_bit+0x360/0x360 [btrfs] Jan 07 11:17:11 naota-devel kernel: ? find_get_pages_range_tag+0x6f8/0x9d0 Jan 07 11:17:11 naota-devel kernel: ? sched_clock_cpu+0x1b/0x170 Jan 07 11:17:11 naota-devel kernel: ? mark_lock+0xc0/0x1160 Jan 07 11:17:11 naota-devel kernel: writepage_delalloc+0x11e/0x270 [btrfs] Jan 07 11:17:11 naota-devel kernel: ? find_lock_delalloc_range+0x400/0x400 [btrfs] Jan 07 11:17:11 naota-devel kernel: ? rcu_read_lock_sched_held+0xa1/0xd0 Jan 07 11:17:11 naota-devel kernel: ? rcu_read_lock_bh_held+0xb0/0xb0 Jan 07 11:17:11 naota-devel kernel: __extent_writepage+0x3a2/0x800 [btrfs] Jan 07 11:17:11 naota-devel kernel: ? lock_downgrade+0x770/0x770 Jan 07 11:17:11 naota-devel kernel: ? __do_readpage+0x13a0/0x13a0 [btrfs] Jan 07 11:17:11 naota-devel kernel: ? clear_page_dirty_for_io+0x32a/0x6e0 Jan 07 11:17:11 naota-devel kernel: ? __kasan_check_read+0x11/0x20 Jan 07 11:17:11 naota-devel kernel: extent_write_cache_pages+0x61c/0xaf0 [btrfs] Jan 07 11:17:11 naota-devel kernel: ? __extent_writepage+0x800/0x800 [btrfs] Jan 07 11:17:11 naota-devel kernel: ? __kasan_check_read+0x11/0x20 Jan 07 11:17:11 naota-devel kernel: ? mark_lock+0xc0/0x1160 Jan 07 11:17:11 naota-devel kernel: ? sched_clock_cpu+0x1b/0x170 Jan 07 11:17:11 naota-devel kernel: ? __kasan_check_read+0x11/0x20 Jan 07 11:17:11 naota-devel kernel: extent_writepages+0xf8/0x1a0 [btrfs] Jan 07 11:17:11 naota-devel kernel: ? __kasan_check_read+0x11/0x20 Jan 07 11:17:11 naota-devel kernel: ? extent_write_locked_range+0x3e0/0x3e0 [btrfs] Jan 07 11:17:12 naota-devel kernel: ? find_held_lock+0x35/0x130 Jan 07 11:17:12 naota-devel kernel: ? __kasan_check_read+0x11/0x20 Jan 07 11:17:12 naota-devel kernel: btrfs_writepages+0xe/0x10 [btrfs] Jan 07 11:17:12 naota-devel kernel: do_writepages+0xe0/0x270 Jan 07 11:17:12 naota-devel kernel: ? lock_downgrade+0x770/0x770 Jan 07 11:17:12 naota-devel kernel: ? page_writeback_cpu_online+0x20/0x20 Jan 07 11:17:12 naota-devel kernel: ? __kasan_check_read+0x11/0x20 Jan 07 11:17:12 naota-devel kernel: ? do_raw_spin_unlock+0x59/0x250 Jan 07 11:17:12 naota-devel kernel: ? _raw_spin_unlock+0x28/0x40 Jan 07 11:17:12 naota-devel kernel: ? wbc_attach_and_unlock_inode+0x432/0x840 Jan 07 11:17:12 naota-devel kernel: __filemap_fdatawrite_range+0x264/0x340 Jan 07 11:17:12 naota-devel kernel: ? tty_ldisc_deref+0x35/0x40 Jan 07 11:17:12 naota-devel kernel: ? delete_from_page_cache_batch+0xab0/0xab0 Jan 07 11:17:12 naota-devel kernel: filemap_fdatawrite_range+0x13/0x20 Jan 07 11:17:12 naota-devel kernel: btrfs_fdatawrite_range+0x4d/0xf0 [btrfs] Jan 07 11:17:12 naota-devel kernel: btrfs_sync_file+0x235/0xb30 [btrfs] Jan 07 11:17:12 naota-devel kernel: ? rcu_read_lock_sched_held+0xd0/0xd0 Jan 07 11:17:12 naota-devel kernel: ? btrfs_file_write_iter+0x1430/0x1430 [btrfs] Jan 07 11:17:12 naota-devel kernel: ? do_dup2+0x440/0x440 Jan 07 11:17:12 naota-devel kernel: ? __x64_sys_futex+0x29b/0x3f0 Jan 07 11:17:12 naota-devel kernel: ? ksys_write+0x1c3/0x220 Jan 07 11:17:12 naota-devel kernel: ? btrfs_file_write_iter+0x1430/0x1430 [btrfs] Jan 07 11:17:12 naota-devel kernel: vfs_fsync_range+0xf6/0x220 Jan 07 11:17:12 naota-devel kernel: ? __fget_light+0x184/0x1f0 Jan 07 11:17:12 naota-devel kernel: do_fsync+0x3d/0x70 Jan 07 11:17:12 naota-devel kernel: ? trace_hardirqs_on+0x28/0x190 Jan 07 11:17:12 naota-devel kernel: __x64_sys_fdatasync+0x36/0x50 Jan 07 11:17:12 naota-devel kernel: do_syscall_64+0xa4/0x4b0 Jan 07 11:17:12 naota-devel kernel: entry_SYSCALL_64_after_hwframe+0x49/0xbe Jan 07 11:17:12 naota-devel kernel: RIP: 0033:0x7f7ba395f9bf Jan 07 11:17:12 naota-devel kernel: Code: Bad RIP value. Jan 07 11:17:12 naota-devel kernel: RSP: 002b:00007f7ba385de80 EFLAGS: 00000293 ORIG_RAX: 000000000000004b Jan 07 11:17:12 naota-devel kernel: RAX: ffffffffffffffda RBX: 0000000000100000 RCX: 00007f7ba395f9bf Jan 07 11:17:12 naota-devel kernel: RDX: 0000000000000001 RSI: 0000000000000081 RDI: 0000000000000003 Jan 07 11:17:12 naota-devel kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000404198 Jan 07 11:17:12 naota-devel kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000100000 Jan 07 11:17:12 naota-devel kernel: R13: 0000000000000000 R14: 00007f7ba2f5d010 R15: 00000000008592a0 Considering the above cases, I do not think it is possible to implement such a "waiting IO scheduler" that would allow removing the mutex around block allocation and bio issuing. Such a method would require an intermediate bio reordering layer either using a device mapper, or as was implemented initially directly in btrfs (but that is now a layering violation, so we do not want that). Entirely relying on the block layer for achieving a perfect sequential write request sequence is fragile. The current block layer interface semantic for zoned block devices is: "If BIOs are issued sequentially, they will be dispatched to the drive in the same order, sequentially." That directly reflects the drive constraint, so this is compatible with other regular block devices in the sense that no intelligence is added for trying to create sequential streams of requests when the issuer is not issuing the said request in perfect order. Trying to change this interface to something like: "OK, I can accept some out-of-ordered writes, but you must fill the hole quickly in the stream" cannot be implemented directly in the block layer. Device mapper should be used for that, but if we do so, then one could argue that all SMR support can simply rely on dm-zoned, which is really sub-optimal from a performance perspective. We can do much better than dm-zoned with direct support in btrfs, but that support requires guarantees of sequential write BIO issuing. The current implementation relies on a mutex for that, which considering the complexity of dm-zoned, is a *very* simple and clean solution.
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index e78d34a4fb56..6f7d29171adf 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1642,6 +1642,7 @@ static struct btrfs_block_group *btrfs_create_block_group_cache( btrfs_init_free_space_ctl(cache); atomic_set(&cache->trimming, 0); mutex_init(&cache->free_space_lock); + mutex_init(&cache->zone_io_lock); btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root); return cache; diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 347605654021..57c8d6f4b3d1 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -165,6 +165,7 @@ struct btrfs_block_group { * enabled. */ u64 alloc_offset; + struct mutex zone_io_lock; }; #ifdef CONFIG_BTRFS_DEBUG diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e61f69eef4a8..d1f326b6c4d4 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3699,6 +3699,7 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache, ASSERT(btrfs_fs_incompat(cache->fs_info, HMZONED)); + btrfs_hmzoned_data_io_lock(cache); spin_lock(&space_info->lock); spin_lock(&cache->lock); @@ -3729,6 +3730,9 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache, out: spin_unlock(&cache->lock); spin_unlock(&space_info->lock); + /* if succeeds, unlock after submit_bio */ + if (ret) + btrfs_hmzoned_data_io_unlock(cache); return ret; } diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index ddec6aed7283..f6682ead575b 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -12,6 +12,7 @@ #include <linux/blkdev.h> #include "volumes.h" #include "disk-io.h" +#include "block-group.h" struct btrfs_zoned_device_info { /* @@ -48,6 +49,7 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, void btrfs_redirty_list_add(struct btrfs_transaction *trans, struct extent_buffer *eb); void btrfs_free_redirty_list(struct btrfs_transaction *trans); +void btrfs_hmzoned_data_io_unlock_at(struct inode *inode, u64 start, u64 len); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -116,6 +118,8 @@ static inline int btrfs_reset_device_zone(struct btrfs_device *device, static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans, struct extent_buffer *eb) { } static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { } +static inline void btrfs_hmzoned_data_io_unlock_at(struct inode *inode, + u64 start, u64 len) { } #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -218,4 +222,36 @@ static inline bool btrfs_can_zone_reset(struct btrfs_device *device, return true; } +static inline void btrfs_hmzoned_data_io_lock( + struct btrfs_block_group *cache) +{ + /* No need to lock metadata BGs or non-sequential BGs */ + if (!btrfs_fs_incompat(cache->fs_info, HMZONED) || + !(cache->flags & BTRFS_BLOCK_GROUP_DATA)) + return; + mutex_lock(&cache->zone_io_lock); +} + +static inline void btrfs_hmzoned_data_io_unlock( + struct btrfs_block_group *cache) +{ + if (!btrfs_fs_incompat(cache->fs_info, HMZONED) || + !(cache->flags & BTRFS_BLOCK_GROUP_DATA)) + return; + mutex_unlock(&cache->zone_io_lock); +} + +static inline void btrfs_hmzoned_data_io_unlock_logical( + struct btrfs_fs_info *fs_info, u64 logical) +{ + struct btrfs_block_group *cache; + + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return; + + cache = btrfs_lookup_block_group(fs_info, logical); + btrfs_hmzoned_data_io_unlock(cache); + btrfs_put_block_group(cache); +} + #endif diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 56032c518b26..3677c36999d8 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -49,6 +49,7 @@ #include "qgroup.h" #include "delalloc-space.h" #include "block-group.h" +#include "hmzoned.h" struct btrfs_iget_args { struct btrfs_key *location; @@ -1325,6 +1326,39 @@ static int cow_file_range_async(struct inode *inode, return 0; } +static noinline int run_delalloc_hmzoned(struct inode *inode, + struct page *locked_page, u64 start, + u64 end, int *page_started, + unsigned long *nr_written) +{ + struct extent_map *em; + u64 logical; + int ret; + + ret = cow_file_range(inode, locked_page, start, end, + page_started, nr_written, 0); + if (ret) + return ret; + + if (*page_started) + return 0; + + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, end - start + 1, + 0); + ASSERT(em != NULL && em->block_start < EXTENT_MAP_LAST_BYTE); + logical = em->block_start; + free_extent_map(em); + + __set_page_dirty_nobuffers(locked_page); + account_page_redirty(locked_page); + extent_write_locked_range(inode, start, end, WB_SYNC_ALL); + *page_started = 1; + + btrfs_hmzoned_data_io_unlock_logical(btrfs_sb(inode->i_sb), logical); + + return 0; +} + static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info, u64 bytenr, u64 num_bytes) { @@ -1737,17 +1771,24 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page, { int ret; int force_cow = need_force_cow(inode, start, end); + int do_compress = inode_can_compress(inode) && + inode_need_compress(inode, start, end); + int hmzoned = btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED); if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW && !force_cow) { + ASSERT(!hmzoned); ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 1, nr_written); } else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) { + ASSERT(!hmzoned); ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 0, nr_written); - } else if (!inode_can_compress(inode) || - !inode_need_compress(inode, start, end)) { + } else if (!do_compress && !hmzoned) { ret = cow_file_range(inode, locked_page, start, end, page_started, nr_written, 1); + } else if (!do_compress && hmzoned) { + ret = run_delalloc_hmzoned(inode, locked_page, start, end, + page_started, nr_written); } else { set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &BTRFS_I(inode)->runtime_flags);
To preserve sequential write pattern on the drives, we must serialize allocation and submit_bio. This commit add per-block group mutex "zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept even after returning from find_free_extent(). It is released when submiting IOs corresponding to the allocation is completed. Implementing such behavior under __extent_writepage_io() is almost impossible because once pages are unlocked we are not sure when submiting IOs for an allocated region is finished or not. Instead, this commit add run_delalloc_hmzoned() to write out non-compressed data IOs at once using extent_write_locked_rage(). After the write, we can call btrfs_hmzoned_data_io_unlock() to unlock the block group for new allocation. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> --- fs/btrfs/block-group.c | 1 + fs/btrfs/block-group.h | 1 + fs/btrfs/extent-tree.c | 4 ++++ fs/btrfs/hmzoned.h | 36 +++++++++++++++++++++++++++++++++ fs/btrfs/inode.c | 45 ++++++++++++++++++++++++++++++++++++++++-- 5 files changed, 85 insertions(+), 2 deletions(-)