[v6,15/28] btrfs: serialize data allocation and submit IOs

Message ID	20191213040915.3502922-16-naohiro.aota@wdc.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=17CA=2D=vger.kernel.org=linux-btrfs-owner@kernel.org> IronPort-SDR: IOEZ2P9+Lb79GaozaH1fP8VrS8bHFZo+akwGnlPMZeLtHJu7y1v6dew1eFkpwkd+371jLqaPVX yQ3FBpM8+U4pr61qzc4gWzw4rF21GsD9aQvzth+zp0B7kx6iYN7A9bZMfVQw7Zl48Hq+nFIPIa NtPTGX7Vj9m/f2hTQ0Wi+FWgraP8D89Zq1gomY/TrFU55yVtBP5ci22Befv/1Sep4thOTMMuOX S2azHgbvfr1xEqUk8/uNb6JcWqcAeg3WQyt+e4o+Nz/johmEb/vpTHJUH8r4Dyd9P/PnVk3ef5 gz8= IronPort-SDR: vF+AhnCBORDj873Y4UprxlxrdUBnu0y84xF/QVuDzDR3Wsf/LTU0Spbfv++wyYwFJJWcmhl6ML ZLyZ/rbcnswyzbbaDipAUoPPx3X9yRtqZSXQ3WWGz25G5KjZYYI1JbSQFRBUFCZc9btp5Eh5xV 4XQMESyEWQLKh+3lw/oes0bi9xqiVMdFgDJe+1ZINu3Hx7/eUosMq98pfLl8qBzYB9crJrCY40 PMQ4bDMCbwTSg4AUDyeLRnDYJrK/zHW63YAQtcrBPJ0ws7ugp+5EfKUGijmQ/BQHCcAbBFG9a9 /HP/4TRppl8rGJ0t/rD5g1ri IronPort-SDR: ZTzz9VCglhU/Ls2DR6EHRwkmxXAZOv+yRIMDMk5QEH0NaSaUkV7v4bqrYeXA2Ya0/P9byxzvHE 2aESUAmt9eaiO5e0wZD6tS5g0goyrwwvJUMcfm8OEigOMlwb0B8292+657qGB1rvB2nLBjmVQW WpSWHrNeI8N4j8sfVgdl/CYP6r7K+Vh3CHENjhLt3TWdpOmOD4RUvMDWjgzT0ckFkiFrJJcvYU AGwavSmMs078GbflzPlxhU/6eGTxGDJ1UINQg1UxOK4ytwwgleay/1wBGnTx1zl2eXxX5HztLj ef0= WDCIronportException: Internal From: Naohiro Aota <naohiro.aota@wdc.com> To: linux-btrfs@vger.kernel.org, David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, Nikolay Borisov <nborisov@suse.com>, Damien Le Moal <damien.lemoal@wdc.com>, Johannes Thumshirn <jthumshirn@suse.de>, Hannes Reinecke <hare@suse.com>, Anand Jain <anand.jain@oracle.com>, linux-fsdevel@vger.kernel.org, Naohiro Aota <naohiro.aota@wdc.com> Subject: [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs Date: Fri, 13 Dec 2019 13:09:02 +0900 Message-Id: <20191213040915.3502922-16-naohiro.aota@wdc.com> In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	btrfs: zoned block device support \| expand [v6,00/28] btrfs: zoned block device support [v6,01/28] btrfs: introduce HMZONED feature flag [v6,02/28] btrfs: Get zone information of zoned block devices [v6,03/28] btrfs: Check and enable HMZONED mode [v6,04/28] btrfs: disallow RAID5/6 in HMZONED mode [v6,05/28] btrfs: disallow space_cache in HMZONED mode [v6,06/28] btrfs: disallow NODATACOW in HMZONED mode [v6,07/28] btrfs: disable fallocate in HMZONED mode [v6,08/28] btrfs: implement log-structured superblock for HMZONED mode [v6,09/28] btrfs: align device extent allocation to zone boundary [v6,10/28] btrfs: do sequential extent allocation in HMZONED mode [v6,11/28] btrfs: make unmirroed BGs readonly only if we have at least one writable BG [v6,12/28] btrfs: ensure metadata space available on/after degraded mount in HMZONED [v6,13/28] btrfs: reset zones of unused block groups [v6,14/28] btrfs: redirty released extent buffers in HMZONED mode [v6,15/28] btrfs: serialize data allocation and submit IOs [v6,16/28] btrfs: implement atomic compressed IO submission [v6,17/28] btrfs: support direct write IO in HMZONED [v6,18/28] btrfs: serialize meta IOs on HMZONED mode [v6,19/28] btrfs: wait existing extents before truncating [v6,20/28] btrfs: avoid async checksum on HMZONED mode [v6,21/28] btrfs: disallow mixed-bg in HMZONED mode [v6,22/28] btrfs: disallow inode_cache in HMZONED mode [v6,23/28] btrfs: support dev-replace in HMZONED mode [v6,24/28] btrfs: enable relocation in HMZONED mode [v6,25/28] btrfs: relocate block group to repair IO failure in HMZONED [v6,26/28] btrfs: split alloc_log_tree() [v6,27/28] btrfs: enable tree-log on HMZONED mode [v6,28/28] btrfs: enable to mount HMZONED incompat flag

Message ID

20191213040915.3502922-16-naohiro.aota@wdc.com (mailing list archive)

State

New, archived

Headers

IronPort-SDR: 
 IOEZ2P9+Lb79GaozaH1fP8VrS8bHFZo+akwGnlPMZeLtHJu7y1v6dew1eFkpwkd+371jLqaPVX
 yQ3FBpM8+U4pr61qzc4gWzw4rF21GsD9aQvzth+zp0B7kx6iYN7A9bZMfVQw7Zl48Hq+nFIPIa
 NtPTGX7Vj9m/f2hTQ0Wi+FWgraP8D89Zq1gomY/TrFU55yVtBP5ci22Befv/1Sep4thOTMMuOX
 S2azHgbvfr1xEqUk8/uNb6JcWqcAeg3WQyt+e4o+Nz/johmEb/vpTHJUH8r4Dyd9P/PnVk3ef5
 gz8=
IronPort-SDR: 
 vF+AhnCBORDj873Y4UprxlxrdUBnu0y84xF/QVuDzDR3Wsf/LTU0Spbfv++wyYwFJJWcmhl6ML
 ZLyZ/rbcnswyzbbaDipAUoPPx3X9yRtqZSXQ3WWGz25G5KjZYYI1JbSQFRBUFCZc9btp5Eh5xV
 4XQMESyEWQLKh+3lw/oes0bi9xqiVMdFgDJe+1ZINu3Hx7/eUosMq98pfLl8qBzYB9crJrCY40
 PMQ4bDMCbwTSg4AUDyeLRnDYJrK/zHW63YAQtcrBPJ0ws7ugp+5EfKUGijmQ/BQHCcAbBFG9a9
 /HP/4TRppl8rGJ0t/rD5g1ri
IronPort-SDR: 
 ZTzz9VCglhU/Ls2DR6EHRwkmxXAZOv+yRIMDMk5QEH0NaSaUkV7v4bqrYeXA2Ya0/P9byxzvHE
 2aESUAmt9eaiO5e0wZD6tS5g0goyrwwvJUMcfm8OEigOMlwb0B8292+657qGB1rvB2nLBjmVQW
 WpSWHrNeI8N4j8sfVgdl/CYP6r7K+Vh3CHENjhLt3TWdpOmOD4RUvMDWjgzT0ckFkiFrJJcvYU
 AGwavSmMs078GbflzPlxhU/6eGTxGDJ1UINQg1UxOK4ytwwgleay/1wBGnTx1zl2eXxX5HztLj
 ef0=
WDCIronportException: Internal
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>,
        Nikolay Borisov <nborisov@suse.com>,
        Damien Le Moal <damien.lemoal@wdc.com>,
        Johannes Thumshirn <jthumshirn@suse.de>,
        Hannes Reinecke <hare@suse.com>,
        Anand Jain <anand.jain@oracle.com>,
        linux-fsdevel@vger.kernel.org, Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs
Date: Fri, 13 Dec 2019 13:09:02 +0900
Message-Id: <20191213040915.3502922-16-naohiro.aota@wdc.com>
In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com>
References: <20191213040915.3502922-1-naohiro.aota@wdc.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk

Series

btrfs: zoned block device support | expand

Commit Message

Naohiro Aota Dec. 13, 2019, 4:09 a.m. UTC

To preserve sequential write pattern on the drives, we must serialize
allocation and submit_bio. This commit add per-block group mutex
"zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept
even after returning from find_free_extent(). It is released when submiting
IOs corresponding to the allocation is completed.

Implementing such behavior under __extent_writepage_io() is almost
impossible because once pages are unlocked we are not sure when submiting
IOs for an allocated region is finished or not. Instead, this commit add
run_delalloc_hmzoned() to write out non-compressed data IOs at once using
extent_write_locked_rage(). After the write, we can call
btrfs_hmzoned_data_io_unlock() to unlock the block group for new
allocation.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  1 +
 fs/btrfs/block-group.h |  1 +
 fs/btrfs/extent-tree.c |  4 ++++
 fs/btrfs/hmzoned.h     | 36 +++++++++++++++++++++++++++++++++
 fs/btrfs/inode.c       | 45 ++++++++++++++++++++++++++++++++++++++++--
 5 files changed, 85 insertions(+), 2 deletions(-)

Comments

Josef Bacik Dec. 17, 2019, 7:49 p.m. UTC | #1

On 12/12/19 11:09 PM, Naohiro Aota wrote:
> To preserve sequential write pattern on the drives, we must serialize
> allocation and submit_bio. This commit add per-block group mutex
> "zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept
> even after returning from find_free_extent(). It is released when submiting
> IOs corresponding to the allocation is completed.
> 
> Implementing such behavior under __extent_writepage_io() is almost
> impossible because once pages are unlocked we are not sure when submiting
> IOs for an allocated region is finished or not. Instead, this commit add
> run_delalloc_hmzoned() to write out non-compressed data IOs at once using
> extent_write_locked_rage(). After the write, we can call
> btrfs_hmzoned_data_io_unlock() to unlock the block group for new
> allocation.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Have you actually tested these patches with lock debugging on?  The 
submit_compressed_extents stuff is async, so the unlocker owner will not be the 
lock owner, and that'll make all sorts of things blow up.  This is just straight 
up broken.

I would really rather see a hmzoned block scheduler that just doesn't submit the 
bio's until they are aligned with the WP, that way this intellligence doesn't 
have to be dealt with at the file system layer.  I get allocating in line with 
the WP, but this whole forcing us to allocate and submit the bio in lock step is 
just nuts, and broken in your subsequent patches.  This whole approach needs to 
be reworked.  Thanks,

Josef

Naohiro Aota Dec. 19, 2019, 6:54 a.m. UTC | #2

On Tue, Dec 17, 2019 at 02:49:44PM -0500, Josef Bacik wrote:
>On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>To preserve sequential write pattern on the drives, we must serialize
>>allocation and submit_bio. This commit add per-block group mutex
>>"zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept
>>even after returning from find_free_extent(). It is released when submiting
>>IOs corresponding to the allocation is completed.
>>
>>Implementing such behavior under __extent_writepage_io() is almost
>>impossible because once pages are unlocked we are not sure when submiting
>>IOs for an allocated region is finished or not. Instead, this commit add
>>run_delalloc_hmzoned() to write out non-compressed data IOs at once using
>>extent_write_locked_rage(). After the write, we can call
>>btrfs_hmzoned_data_io_unlock() to unlock the block group for new
>>allocation.
>>
>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>
>Have you actually tested these patches with lock debugging on?  The 
>submit_compressed_extents stuff is async, so the unlocker owner will 
>not be the lock owner, and that'll make all sorts of things blow up.  
>This is just straight up broken.

Yes, I have ran xfstests on this patch series with lockdeps and
KASAN. There was no problem with that.

For non-compressed writes, both allocation and submit is done in
run_delalloc_zoned(). Allocation is done in cow_file_range() and
submit is done in extent_write_locked_range(), so both are in the same
context, so both locking and unlocking are done by the same execution
context.

For compressed writes, again, allocation/lock is done under
cow_file_range() and submit is done in extent_write_locked_range() and
unlocked all in submit_compressed_extents() (this is called after
compression), so they are all in the same context and the lock owner
does the unlock.

>I would really rather see a hmzoned block scheduler that just doesn't 
>submit the bio's until they are aligned with the WP, that way this 
>intellligence doesn't have to be dealt with at the file system layer.  
>I get allocating in line with the WP, but this whole forcing us to 
>allocate and submit the bio in lock step is just nuts, and broken in 
>your subsequent patches.  This whole approach needs to be reworked.  
>Thanks,
>
>Josef

We tried this approach by modifying mq-deadline to wait if the first
queued request is not aligned at the write pointer of a zone. However,
running btrfs without the allocate+submit lock with this modified IO
scheduler did not work well at all. With write intensive workloads, we
observed that a very long wait time was very often necessary to get a
fully sequential stream of requests starting at the write pointer of a
zone. The wait time we observed was sometimes in larger than 60 seconds,
at which point we gave up.

While we did not extensively dig into the fundamental root cause,
these potentially long wait times can come from a large number of
reasons: page cache writeback behavior, kernel process scheduling,
device IO congestion and writeback throttling, sync, transaction
commit of btrfs, and cgroup use could make everything even worse. In
the worst case scenario, a number of out-of-ordered requests could get
stuck in the IO scheduler, preventing forward progress in the case of
a memory reclaim writeback, causing the OOM killer to start happily
killing application processes. Furthermore, IO error handling becomes
a nightmare as the block layer scheduler would need to issue report
zones commands to re-sync the zone wp in case of write error. And that
is also in addition to having to track other zone commands that change
a zone wp such as reset zone and finish zone.

Considering all this, handling the sequential write constraint at the
file system layer by ensuring that write BIOs are issued in the correct
order starting from a zone WP is far simpler and removes dependencies on
other features such as cgroup, congestion control and other throttling
mechanisms. The IO scheduler can always dispatch to the device the
requests it received without any waiting time, ensuring forward progress.

The mq-deadline IO scheduler supports not only regular block devices but
also zoned block devices and it is the default scheduler for them, and
other schedulers that are not zone compliant cannot be selected (one
cannot change to kyber nor bfq). This ensure that the default system
behavior will be correct as long as the user (the FS) respects the
sequential write rule.

The previous approach I proposed using a btrfs request reordering stage
was indeed very invasive, and similarly the block layer scheduler
changes, could cause problems with cgroups etc. The new approach of this
path using locking to have atomic allocate+bio issuing results in
per-zone sequential write patterns, no matter what happens around it. It
is less invasive and rely on the sequential allocation of blocks for the
ordering of write IOs, so there is no explicit reordering, so no
additional overhead. f2fs implementation uses a similar approach since
kernel 4.10 and has proven to be very solid.

In light of these arguments and explanation, do you still think the
allocate zone locking approach is still not acceptable ?

Josef Bacik Dec. 19, 2019, 2:01 p.m. UTC | #3

On 12/19/19 1:54 AM, Naohiro Aota wrote:
> On Tue, Dec 17, 2019 at 02:49:44PM -0500, Josef Bacik wrote:
>> On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>> To preserve sequential write pattern on the drives, we must serialize
>>> allocation and submit_bio. This commit add per-block group mutex
>>> "zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept
>>> even after returning from find_free_extent(). It is released when submiting
>>> IOs corresponding to the allocation is completed.
>>>
>>> Implementing such behavior under __extent_writepage_io() is almost
>>> impossible because once pages are unlocked we are not sure when submiting
>>> IOs for an allocated region is finished or not. Instead, this commit add
>>> run_delalloc_hmzoned() to write out non-compressed data IOs at once using
>>> extent_write_locked_rage(). After the write, we can call
>>> btrfs_hmzoned_data_io_unlock() to unlock the block group for new
>>> allocation.
>>>
>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>
>> Have you actually tested these patches with lock debugging on?  The 
>> submit_compressed_extents stuff is async, so the unlocker owner will not be 
>> the lock owner, and that'll make all sorts of things blow up. This is just 
>> straight up broken.
> 
> Yes, I have ran xfstests on this patch series with lockdeps and
> KASAN. There was no problem with that.
> 
> For non-compressed writes, both allocation and submit is done in
> run_delalloc_zoned(). Allocation is done in cow_file_range() and
> submit is done in extent_write_locked_range(), so both are in the same
> context, so both locking and unlocking are done by the same execution
> context.
> 
> For compressed writes, again, allocation/lock is done under
> cow_file_range() and submit is done in extent_write_locked_range() and
> unlocked all in submit_compressed_extents() (this is called after
> compression), so they are all in the same context and the lock owner
> does the unlock.
> 
>> I would really rather see a hmzoned block scheduler that just doesn't submit 
>> the bio's until they are aligned with the WP, that way this intellligence 
>> doesn't have to be dealt with at the file system layer. I get allocating in 
>> line with the WP, but this whole forcing us to allocate and submit the bio in 
>> lock step is just nuts, and broken in your subsequent patches.  This whole 
>> approach needs to be reworked. Thanks,
>>
>> Josef
> 
> We tried this approach by modifying mq-deadline to wait if the first
> queued request is not aligned at the write pointer of a zone. However,
> running btrfs without the allocate+submit lock with this modified IO
> scheduler did not work well at all. With write intensive workloads, we
> observed that a very long wait time was very often necessary to get a
> fully sequential stream of requests starting at the write pointer of a
> zone. The wait time we observed was sometimes in larger than 60 seconds,
> at which point we gave up.

This is because we will only write out the pages we've been handed but do 
cow_file_range() for a possibly larger delalloc range, so as you say there can 
be a large gap in time between writing one part of the range and writing the 
next part.

You actually solve this with your patch, by doing the cow_file_range and then 
following it up with the extent_write_locked_range() for the range you just cow'ed.

There is no need for the locking in this case, you could simply do that and then 
have a modified block scheduler that keeps the bio's in the correct order.  I 
imagine if you just did this with your original block layer approach it would 
work fine.  Thanks,

Josef

Naohiro Aota Jan. 21, 2020, 6:54 a.m. UTC | #4

On Thu, Dec 19, 2019 at 09:01:35AM -0500, Josef Bacik wrote:
>On 12/19/19 1:54 AM, Naohiro Aota wrote:
>>On Tue, Dec 17, 2019 at 02:49:44PM -0500, Josef Bacik wrote:
>>>On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>>>To preserve sequential write pattern on the drives, we must serialize
>>>>allocation and submit_bio. This commit add per-block group mutex
>>>>"zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept
>>>>even after returning from find_free_extent(). It is released when submiting
>>>>IOs corresponding to the allocation is completed.
>>>>
>>>>Implementing such behavior under __extent_writepage_io() is almost
>>>>impossible because once pages are unlocked we are not sure when submiting
>>>>IOs for an allocated region is finished or not. Instead, this commit add
>>>>run_delalloc_hmzoned() to write out non-compressed data IOs at once using
>>>>extent_write_locked_rage(). After the write, we can call
>>>>btrfs_hmzoned_data_io_unlock() to unlock the block group for new
>>>>allocation.
>>>>
>>>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>>
>>>Have you actually tested these patches with lock debugging on?  
>>>The submit_compressed_extents stuff is async, so the unlocker 
>>>owner will not be the lock owner, and that'll make all sorts of 
>>>things blow up. This is just straight up broken.
>>
>>Yes, I have ran xfstests on this patch series with lockdeps and
>>KASAN. There was no problem with that.
>>
>>For non-compressed writes, both allocation and submit is done in
>>run_delalloc_zoned(). Allocation is done in cow_file_range() and
>>submit is done in extent_write_locked_range(), so both are in the same
>>context, so both locking and unlocking are done by the same execution
>>context.
>>
>>For compressed writes, again, allocation/lock is done under
>>cow_file_range() and submit is done in extent_write_locked_range() and
>>unlocked all in submit_compressed_extents() (this is called after
>>compression), so they are all in the same context and the lock owner
>>does the unlock.
>>
>>>I would really rather see a hmzoned block scheduler that just 
>>>doesn't submit the bio's until they are aligned with the WP, that 
>>>way this intellligence doesn't have to be dealt with at the file 
>>>system layer. I get allocating in line with the WP, but this whole 
>>>forcing us to allocate and submit the bio in lock step is just 
>>>nuts, and broken in your subsequent patches.  This whole approach 
>>>needs to be reworked. Thanks,
>>>
>>>Josef
>>
>>We tried this approach by modifying mq-deadline to wait if the first
>>queued request is not aligned at the write pointer of a zone. However,
>>running btrfs without the allocate+submit lock with this modified IO
>>scheduler did not work well at all. With write intensive workloads, we
>>observed that a very long wait time was very often necessary to get a
>>fully sequential stream of requests starting at the write pointer of a
>>zone. The wait time we observed was sometimes in larger than 60 seconds,
>>at which point we gave up.
>
>This is because we will only write out the pages we've been handed but 
>do cow_file_range() for a possibly larger delalloc range, so as you 
>say there can be a large gap in time between writing one part of the 
>range and writing the next part.
>
>You actually solve this with your patch, by doing the cow_file_range 
>and then following it up with the extent_write_locked_range() for the 
>range you just cow'ed.
>
>There is no need for the locking in this case, you could simply do 
>that and then have a modified block scheduler that keeps the bio's in 
>the correct order.  I imagine if you just did this with your original 
>block layer approach it would work fine.  Thanks,
>
>Josef

We have once again tried the btrfs SMR (Zoned Block Device) support
series without the locking around extent allocation and bio issuing,
with a modified version of mq-deadline as the scheduler for the block
layer. As you already know, mq-deadline will order read and write
requests separately in increasing sector order, which is essential for
SMR sequential writing. However, mq-deadline does not provide
guarantees regarding the completeness of a sequential write stream. If
there are missing requests ("holes" in the write stream), mq-deadline
will still dispatch the next write request in order, leading to write
errors on SMR drives.

The modifications we added to mq-deadline is the addition of a wait
time when a hole in a sequential write stream is discovered. This is
reminiscent of the old anticipatory scheduler, somewhat. The wait time
is limited, so if a hole is not filled up by newly inserted requests
after a timeout elapses, write requests are issued as is (and errors
happen on SMR). The default timeout we used initially was set to the
value of "/sys/block/<dev>/queue/iosched/write_expire" which is 5
seconds.

With this, tests show that unaligned write errors happen with a simple
workload of 48 threads simultaneously doing write() to their dedicated
file and fdatasync() (Code of the application doing this is attached
to this email).

Despite the wait time of 5 seconds, the holes in a zone sequential
write stream are not filled up by issued BIOs because of a "buffer
bloat." First, bio whose LBA is not aligned with the write pointer
reaches the IO scheduler (call it bio#1). For proceeding with bio#1,
the IO scheduler must wait for a hole filling bio aligned with the
write pointer (call it bio#0).  If the size of bio#1 is large, the
scheduler needs to split the bio#1 into many numbers of requests. Each
request must first obtain a scheduler tag to be inserted into the
scheduler queue. Since the number of the scheduler tag is limited and
tags are freed only with the completion of queued and inflight
requests, requests in bio#1 can fully use all the tags. This is not a
problem if forward progress is made (i.e., requests dispatched to the
disk), but if all requests in the scheduler using tags are bio#1 and
subsequent writes in sequence, these are all waiting for bio#0 to be
issued. We thus end up with a soft deadlock for request issuing and no
possibility of progress. That results in the timeout to trigger, no
matter how large we set it, and in unaligned write errors. Large bios
needing lots of requests for processing will trigger this problem all
the time.

In addition to unaligned write error, we also observed hung_task
timeout with a larger timeout. The reason is the same as above:
writing threads get stuck with blk_mq_get_tag() to acquire its
scheduler tag. We more often hit hung_task than unaligned write by
increasing the timeout seconds.

Jan 07 11:17:11 naota-devel kernel: INFO: task multi-proc-writ:2202 blocked for more than 122 seconds.
Jan 07 11:17:11 naota-devel kernel:       Not tainted 5.4.0-rc8-BTRFS-ZNS+ #165
Jan 07 11:17:11 naota-devel kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 07 11:17:11 naota-devel kernel: multi-proc-writ D    0  2202   2168 0x00004000
Jan 07 11:17:11 naota-devel kernel: Call Trace:
Jan 07 11:17:11 naota-devel kernel:  __schedule+0x8ab/0x1db0
Jan 07 11:17:11 naota-devel kernel:  ? pci_mmcfg_check_reserved+0x130/0x130
Jan 07 11:17:11 naota-devel kernel:  ? blk_insert_cloned_request+0x3e0/0x3e0
Jan 07 11:17:11 naota-devel kernel:  schedule+0xdb/0x260
Jan 07 11:17:11 naota-devel kernel:  io_schedule+0x21/0x70
Jan 07 11:17:11 naota-devel kernel:  blk_mq_get_tag+0x3b6/0x940
Jan 07 11:17:11 naota-devel kernel:  ? __blk_mq_tag_idle+0x80/0x80
Jan 07 11:17:11 naota-devel kernel:  ? finish_wait+0x270/0x270
Jan 07 11:17:11 naota-devel kernel:  blk_mq_get_request+0x340/0x1750
Jan 07 11:17:11 naota-devel kernel:  blk_mq_make_request+0x339/0x1bd0
Jan 07 11:17:11 naota-devel kernel:  ? blk_queue_enter+0x8a4/0xa30
Jan 07 11:17:11 naota-devel kernel:  ? blk_mq_try_issue_directly+0x150/0x150
Jan 07 11:17:11 naota-devel kernel:  generic_make_request+0x20c/0xa70
Jan 07 11:17:11 naota-devel kernel:  ? blk_queue_enter+0xa30/0xa30
Jan 07 11:17:11 naota-devel kernel:  ? find_held_lock+0x35/0x130
Jan 07 11:17:11 naota-devel kernel:  ? __kasan_check_read+0x11/0x20
Jan 07 11:17:11 naota-devel kernel:  submit_bio+0xd5/0x3c0
Jan 07 11:17:11 naota-devel kernel:  ? submit_bio+0xd5/0x3c0
Jan 07 11:17:11 naota-devel kernel:  ? generic_make_request+0xa70/0xa70
Jan 07 11:17:11 naota-devel kernel:  btrfs_map_bio+0x5f5/0xfb0 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? btrfs_rmap_block+0x820/0x820 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? unlock_page+0x9f/0x110
Jan 07 11:17:11 naota-devel kernel:  ? __extent_writepage+0x5aa/0x800 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? lock_downgrade+0x770/0x770
Jan 07 11:17:11 naota-devel kernel:  btrfs_submit_bio_hook+0x336/0x600 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? btrfs_fiemap+0x50/0x50 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  submit_one_bio+0xba/0x130 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  extent_write_locked_range+0x2f9/0x3e0 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? extent_write_full_page+0x1f0/0x1f0 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? lock_downgrade+0x770/0x770
Jan 07 11:17:11 naota-devel kernel:  ? account_page_redirty+0x2bb/0x490
Jan 07 11:17:11 naota-devel kernel:  run_delalloc_zoned+0x108/0x2f0 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  btrfs_run_delalloc_range+0xc4b/0x1170 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? test_range_bit+0x360/0x360 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? find_get_pages_range_tag+0x6f8/0x9d0
Jan 07 11:17:11 naota-devel kernel:  ? sched_clock_cpu+0x1b/0x170
Jan 07 11:17:11 naota-devel kernel:  ? mark_lock+0xc0/0x1160
Jan 07 11:17:11 naota-devel kernel:  writepage_delalloc+0x11e/0x270 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? find_lock_delalloc_range+0x400/0x400 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? rcu_read_lock_sched_held+0xa1/0xd0
Jan 07 11:17:11 naota-devel kernel:  ? rcu_read_lock_bh_held+0xb0/0xb0
Jan 07 11:17:11 naota-devel kernel:  __extent_writepage+0x3a2/0x800 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? lock_downgrade+0x770/0x770
Jan 07 11:17:11 naota-devel kernel:  ? __do_readpage+0x13a0/0x13a0 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? clear_page_dirty_for_io+0x32a/0x6e0
Jan 07 11:17:11 naota-devel kernel:  ? __kasan_check_read+0x11/0x20
Jan 07 11:17:11 naota-devel kernel:  extent_write_cache_pages+0x61c/0xaf0 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? __extent_writepage+0x800/0x800 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? __kasan_check_read+0x11/0x20
Jan 07 11:17:11 naota-devel kernel:  ? mark_lock+0xc0/0x1160
Jan 07 11:17:11 naota-devel kernel:  ? sched_clock_cpu+0x1b/0x170
Jan 07 11:17:11 naota-devel kernel:  ? __kasan_check_read+0x11/0x20
Jan 07 11:17:11 naota-devel kernel:  extent_writepages+0xf8/0x1a0 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? __kasan_check_read+0x11/0x20
Jan 07 11:17:11 naota-devel kernel:  ? extent_write_locked_range+0x3e0/0x3e0 [btrfs]
Jan 07 11:17:12 naota-devel kernel:  ? find_held_lock+0x35/0x130
Jan 07 11:17:12 naota-devel kernel:  ? __kasan_check_read+0x11/0x20
Jan 07 11:17:12 naota-devel kernel:  btrfs_writepages+0xe/0x10 [btrfs]
Jan 07 11:17:12 naota-devel kernel:  do_writepages+0xe0/0x270
Jan 07 11:17:12 naota-devel kernel:  ? lock_downgrade+0x770/0x770
Jan 07 11:17:12 naota-devel kernel:  ? page_writeback_cpu_online+0x20/0x20
Jan 07 11:17:12 naota-devel kernel:  ? __kasan_check_read+0x11/0x20
Jan 07 11:17:12 naota-devel kernel:  ? do_raw_spin_unlock+0x59/0x250
Jan 07 11:17:12 naota-devel kernel:  ? _raw_spin_unlock+0x28/0x40
Jan 07 11:17:12 naota-devel kernel:  ? wbc_attach_and_unlock_inode+0x432/0x840
Jan 07 11:17:12 naota-devel kernel:  __filemap_fdatawrite_range+0x264/0x340
Jan 07 11:17:12 naota-devel kernel:  ? tty_ldisc_deref+0x35/0x40
Jan 07 11:17:12 naota-devel kernel:  ? delete_from_page_cache_batch+0xab0/0xab0
Jan 07 11:17:12 naota-devel kernel:  filemap_fdatawrite_range+0x13/0x20
Jan 07 11:17:12 naota-devel kernel:  btrfs_fdatawrite_range+0x4d/0xf0 [btrfs]
Jan 07 11:17:12 naota-devel kernel:  btrfs_sync_file+0x235/0xb30 [btrfs]
Jan 07 11:17:12 naota-devel kernel:  ? rcu_read_lock_sched_held+0xd0/0xd0
Jan 07 11:17:12 naota-devel kernel:  ? btrfs_file_write_iter+0x1430/0x1430 [btrfs]
Jan 07 11:17:12 naota-devel kernel:  ? do_dup2+0x440/0x440
Jan 07 11:17:12 naota-devel kernel:  ? __x64_sys_futex+0x29b/0x3f0
Jan 07 11:17:12 naota-devel kernel:  ? ksys_write+0x1c3/0x220
Jan 07 11:17:12 naota-devel kernel:  ? btrfs_file_write_iter+0x1430/0x1430 [btrfs]
Jan 07 11:17:12 naota-devel kernel:  vfs_fsync_range+0xf6/0x220
Jan 07 11:17:12 naota-devel kernel:  ? __fget_light+0x184/0x1f0
Jan 07 11:17:12 naota-devel kernel:  do_fsync+0x3d/0x70
Jan 07 11:17:12 naota-devel kernel:  ? trace_hardirqs_on+0x28/0x190
Jan 07 11:17:12 naota-devel kernel:  __x64_sys_fdatasync+0x36/0x50
Jan 07 11:17:12 naota-devel kernel:  do_syscall_64+0xa4/0x4b0
Jan 07 11:17:12 naota-devel kernel:  entry_SYSCALL_64_after_hwframe+0x49/0xbe
Jan 07 11:17:12 naota-devel kernel: RIP: 0033:0x7f7ba395f9bf
Jan 07 11:17:12 naota-devel kernel: Code: Bad RIP value.
Jan 07 11:17:12 naota-devel kernel: RSP: 002b:00007f7ba385de80 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
Jan 07 11:17:12 naota-devel kernel: RAX: ffffffffffffffda RBX: 0000000000100000 RCX: 00007f7ba395f9bf
Jan 07 11:17:12 naota-devel kernel: RDX: 0000000000000001 RSI: 0000000000000081 RDI: 0000000000000003
Jan 07 11:17:12 naota-devel kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000404198
Jan 07 11:17:12 naota-devel kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000100000
Jan 07 11:17:12 naota-devel kernel: R13: 0000000000000000 R14: 00007f7ba2f5d010 R15: 00000000008592a0

Considering the above cases, I do not think it is possible to
implement such a "waiting IO scheduler" that would allow removing the
mutex around block allocation and bio issuing. Such a method would
require an intermediate bio reordering layer either using a device
mapper, or as was implemented initially directly in btrfs (but that is
now a layering violation, so we do not want that).

Entirely relying on the block layer for achieving a perfect sequential
write request sequence is fragile. The current block layer interface
semantic for zoned block devices is: "If BIOs are issued sequentially,
they will be dispatched to the drive in the same order, sequentially."
That directly reflects the drive constraint, so this is compatible
with other regular block devices in the sense that no intelligence is
added for trying to create sequential streams of requests when the
issuer is not issuing the said request in perfect order. Trying to
change this interface to something like: "OK, I can accept some
out-of-ordered writes, but you must fill the hole quickly in the
stream" cannot be implemented directly in the block layer. Device
mapper should be used for that, but if we do so, then one could argue
that all SMR support can simply rely on dm-zoned, which is really
sub-optimal from a performance perspective. We can do much better than
dm-zoned with direct support in btrfs, but that support requires
guarantees of sequential write BIO issuing. The current implementation
relies on a mutex for that, which considering the complexity of
dm-zoned, is a *very* simple and clean solution.

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index e78d34a4fb56..6f7d29171adf 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1642,6 +1642,7 @@  static struct btrfs_block_group *btrfs_create_block_group_cache(
 	btrfs_init_free_space_ctl(cache);
 	atomic_set(&cache->trimming, 0);
 	mutex_init(&cache->free_space_lock);
+	mutex_init(&cache->zone_io_lock);
 	btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
 
 	return cache;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 347605654021..57c8d6f4b3d1 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -165,6 +165,7 @@  struct btrfs_block_group {
 	 * enabled.
 	 */
 	u64 alloc_offset;
+	struct mutex zone_io_lock;
 };
 
 #ifdef CONFIG_BTRFS_DEBUG
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e61f69eef4a8..d1f326b6c4d4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3699,6 +3699,7 @@  static int find_free_extent_zoned(struct btrfs_block_group *cache,
 
 	ASSERT(btrfs_fs_incompat(cache->fs_info, HMZONED));
 
+	btrfs_hmzoned_data_io_lock(cache);
 	spin_lock(&space_info->lock);
 	spin_lock(&cache->lock);
 
@@ -3729,6 +3730,9 @@  static int find_free_extent_zoned(struct btrfs_block_group *cache,
 out:
 	spin_unlock(&cache->lock);
 	spin_unlock(&space_info->lock);
+	/* if succeeds, unlock after submit_bio */
+	if (ret)
+		btrfs_hmzoned_data_io_unlock(cache);
 	return ret;
 }
 
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index ddec6aed7283..f6682ead575b 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -12,6 +12,7 @@ 
 #include <linux/blkdev.h>
 #include "volumes.h"
 #include "disk-io.h"
+#include "block-group.h"
 
 struct btrfs_zoned_device_info {
 	/*
@@ -48,6 +49,7 @@  int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 			    struct extent_buffer *eb);
 void btrfs_free_redirty_list(struct btrfs_transaction *trans);
+void btrfs_hmzoned_data_io_unlock_at(struct inode *inode, u64 start, u64 len);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -116,6 +118,8 @@  static inline int btrfs_reset_device_zone(struct btrfs_device *device,
 static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 					  struct extent_buffer *eb) { }
 static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { }
+static inline void btrfs_hmzoned_data_io_unlock_at(struct inode *inode,
+						   u64 start, u64 len) { }
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -218,4 +222,36 @@  static inline bool btrfs_can_zone_reset(struct btrfs_device *device,
 	return true;
 }
 
+static inline void btrfs_hmzoned_data_io_lock(
+	struct btrfs_block_group *cache)
+{
+	/* No need to lock metadata BGs or non-sequential BGs */
+	if (!btrfs_fs_incompat(cache->fs_info, HMZONED) ||
+	    !(cache->flags & BTRFS_BLOCK_GROUP_DATA))
+		return;
+	mutex_lock(&cache->zone_io_lock);
+}
+
+static inline void btrfs_hmzoned_data_io_unlock(
+	struct btrfs_block_group *cache)
+{
+	if (!btrfs_fs_incompat(cache->fs_info, HMZONED) ||
+	    !(cache->flags & BTRFS_BLOCK_GROUP_DATA))
+		return;
+	mutex_unlock(&cache->zone_io_lock);
+}
+
+static inline void btrfs_hmzoned_data_io_unlock_logical(
+	struct btrfs_fs_info *fs_info, u64 logical)
+{
+	struct btrfs_block_group *cache;
+
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return;
+
+	cache = btrfs_lookup_block_group(fs_info, logical);
+	btrfs_hmzoned_data_io_unlock(cache);
+	btrfs_put_block_group(cache);
+}
+
 #endif
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 56032c518b26..3677c36999d8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -49,6 +49,7 @@ 
 #include "qgroup.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "hmzoned.h"
 
 struct btrfs_iget_args {
 	struct btrfs_key *location;
@@ -1325,6 +1326,39 @@  static int cow_file_range_async(struct inode *inode,
 	return 0;
 }
 
+static noinline int run_delalloc_hmzoned(struct inode *inode,
+					 struct page *locked_page, u64 start,
+					 u64 end, int *page_started,
+					 unsigned long *nr_written)
+{
+	struct extent_map *em;
+	u64 logical;
+	int ret;
+
+	ret = cow_file_range(inode, locked_page, start, end,
+			     page_started, nr_written, 0);
+	if (ret)
+		return ret;
+
+	if (*page_started)
+		return 0;
+
+	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, end - start + 1,
+			      0);
+	ASSERT(em != NULL && em->block_start < EXTENT_MAP_LAST_BYTE);
+	logical = em->block_start;
+	free_extent_map(em);
+
+	__set_page_dirty_nobuffers(locked_page);
+	account_page_redirty(locked_page);
+	extent_write_locked_range(inode, start, end, WB_SYNC_ALL);
+	*page_started = 1;
+
+	btrfs_hmzoned_data_io_unlock_logical(btrfs_sb(inode->i_sb), logical);
+
+	return 0;
+}
+
 static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info,
 					u64 bytenr, u64 num_bytes)
 {
@@ -1737,17 +1771,24 @@  int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
 {
 	int ret;
 	int force_cow = need_force_cow(inode, start, end);
+	int do_compress = inode_can_compress(inode) &&
+		inode_need_compress(inode, start, end);
+	int hmzoned = btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED);
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW && !force_cow) {
+		ASSERT(!hmzoned);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 1, nr_written);
 	} else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) {
+		ASSERT(!hmzoned);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 0, nr_written);
-	} else if (!inode_can_compress(inode) ||
-		   !inode_need_compress(inode, start, end)) {
+	} else if (!do_compress && !hmzoned) {
 		ret = cow_file_range(inode, locked_page, start, end,
 				      page_started, nr_written, 1);
+	} else if (!do_compress && hmzoned) {
+		ret = run_delalloc_hmzoned(inode, locked_page, start, end,
+					   page_started, nr_written);
 	} else {
 		set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
 			&BTRFS_I(inode)->runtime_flags);

[v6,15/28] btrfs: serialize data allocation and submit IOs

Commit Message

Comments

Patch