[11/19] btrfs: introduce submit buffer
diff mbox series

Message ID 20190607131025.31996-12-naohiro.aota@wdc.com
State New
Headers show
Series
  • btrfs zoned block device support
Related show

Commit Message

Naohiro Aota June 7, 2019, 1:10 p.m. UTC
Sequential allocation is not enough to maintain sequential delivery of
write IOs to the device. Various features (async compress, async checksum,
...) of btrfs affect ordering of the IOs. This patch introduces submit
buffer to sort WRITE bios belonging to a block group and sort them out
sequentially in increasing block address to achieve sequential write
sequences with __btrfs_map_bio().

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h             |   3 +
 fs/btrfs/extent-tree.c       |   5 ++
 fs/btrfs/volumes.c           | 165 +++++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h           |   3 +
 include/trace/events/btrfs.h |  41 +++++++++
 5 files changed, 212 insertions(+), 5 deletions(-)

Comments

Josef Bacik June 13, 2019, 2:14 p.m. UTC | #1
On Fri, Jun 07, 2019 at 10:10:17PM +0900, Naohiro Aota wrote:
> Sequential allocation is not enough to maintain sequential delivery of
> write IOs to the device. Various features (async compress, async checksum,
> ...) of btrfs affect ordering of the IOs. This patch introduces submit
> buffer to sort WRITE bios belonging to a block group and sort them out
> sequentially in increasing block address to achieve sequential write
> sequences with __btrfs_map_bio().
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

I hate everything about this.  Can't we just use the plugging infrastructure for
this and then make sure it re-orders the bios before submitting them?  Also
what's to prevent the block layer scheduler from re-arranging these io's?
Thanks,

Josef
Damien Le Moal June 17, 2019, 3:16 a.m. UTC | #2
Josef,

On 2019/06/13 23:15, Josef Bacik wrote:
> On Fri, Jun 07, 2019 at 10:10:17PM +0900, Naohiro Aota wrote:
>> Sequential allocation is not enough to maintain sequential delivery of
>> write IOs to the device. Various features (async compress, async checksum,
>> ...) of btrfs affect ordering of the IOs. This patch introduces submit
>> buffer to sort WRITE bios belonging to a block group and sort them out
>> sequentially in increasing block address to achieve sequential write
>> sequences with __btrfs_map_bio().
>>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> 
> I hate everything about this.  Can't we just use the plugging infrastructure for
> this and then make sure it re-orders the bios before submitting them?  Also
> what's to prevent the block layer scheduler from re-arranging these io's?
> Thanks,

The block I/O scheduler reorders requests in LBA order, but that happens for a
newly inserted request against pending requests. If there are no pending
requests because all requests were already issued, no ordering happen, and even
worse, if the drive queue is not full yet (e.g. there are free tags), then the
newly inserted request will be dispatched almost immediately, preventing
reordering with subsequent incoming write requests to happen.

The other problem is that the mq-deadline scheduler does not track zone WP
position. Write request issuing is done regardless of the current WP value,
solely based on LBA ordering. This means that mq-deadline will not prevent
out-of-order, or rather, unaligned write requests. These will not be detected
and dispatched whenever possible. The reasons for this are that:
1) the disk user (the FS) has to manage zone WP positions anyway. So duplicating
that management at the block IO scheduler level is inefficient.
2) Adding zone WP management at the block IO scheduler level would also need a
write error processing path to resync the WP value in case of failed writes. But
the user/FS also needs that anyway. Again duplicated functionalities.
3) The block layer will need a timeout to force issue or cancel pending
unaligned write requests. This is necessary in case the drive user stops issuing
writes (for whatever reasons) or the scheduler is being switched. This would
unnecessarily cause write I/O errors or cause deadlocks if the request queue
quiesce mode is entered at the wrong time (and I do not see a good way to deal
with that).

blk-mq is already complicated enough. Adding this to the block IO scheduler will
unnecessarily complicate things further for no real benefits. I would like to
point out the dm-zoned device mapper and f2fs which are both already dealing
with write ordering and write error processing directly. Both are fairly
straightforward but completely different and each optimized for their own structure.

Naohiro changes to btrfs IO scheduler have the same intent, that is, efficiently
integrate and handle write ordering "a la btrfs". Would creating a different
"hmzoned" btrfs IO scheduler help address your concerns ?

Best regards.
David Sterba June 18, 2019, midnight UTC | #3
On Mon, Jun 17, 2019 at 03:16:05AM +0000, Damien Le Moal wrote:
> Josef,
> 
> On 2019/06/13 23:15, Josef Bacik wrote:
> > On Fri, Jun 07, 2019 at 10:10:17PM +0900, Naohiro Aota wrote:
> >> Sequential allocation is not enough to maintain sequential delivery of
> >> write IOs to the device. Various features (async compress, async checksum,
> >> ...) of btrfs affect ordering of the IOs. This patch introduces submit
> >> buffer to sort WRITE bios belonging to a block group and sort them out
> >> sequentially in increasing block address to achieve sequential write
> >> sequences with __btrfs_map_bio().
> >>
> >> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > 
> > I hate everything about this.  Can't we just use the plugging infrastructure for
> > this and then make sure it re-orders the bios before submitting them?  Also
> > what's to prevent the block layer scheduler from re-arranging these io's?
> > Thanks,
> 
> The block I/O scheduler reorders requests in LBA order, but that happens for a
> newly inserted request against pending requests. If there are no pending
> requests because all requests were already issued, no ordering happen, and even
> worse, if the drive queue is not full yet (e.g. there are free tags), then the
> newly inserted request will be dispatched almost immediately, preventing
> reordering with subsequent incoming write requests to happen.

This would be good to add to the changelog.
> 
> The other problem is that the mq-deadline scheduler does not track zone WP
> position. Write request issuing is done regardless of the current WP value,
> solely based on LBA ordering. This means that mq-deadline will not prevent
> out-of-order, or rather, unaligned write requests.

This seems to be the key point.

> These will not be detected
> and dispatched whenever possible. The reasons for this are that:
> 1) the disk user (the FS) has to manage zone WP positions anyway. So duplicating
> that management at the block IO scheduler level is inefficient.
> 2) Adding zone WP management at the block IO scheduler level would also need a
> write error processing path to resync the WP value in case of failed writes. But
> the user/FS also needs that anyway. Again duplicated functionalities.
> 3) The block layer will need a timeout to force issue or cancel pending
> unaligned write requests. This is necessary in case the drive user stops issuing
> writes (for whatever reasons) or the scheduler is being switched. This would
> unnecessarily cause write I/O errors or cause deadlocks if the request queue
> quiesce mode is entered at the wrong time (and I do not see a good way to deal
> with that).
> 
> blk-mq is already complicated enough. Adding this to the block IO scheduler will
> unnecessarily complicate things further for no real benefits. I would like to
> point out the dm-zoned device mapper and f2fs which are both already dealing
> with write ordering and write error processing directly. Both are fairly
> straightforward but completely different and each optimized for their own structure.

So the question is where on which layer the decision logic is. The
filesystem(s) or dm-zoned have enough information about the zones and
the writes can be pre-sorted. This is what the patch proposes.

From your explanation I get that the io scheduler can throw the wrench
in the sequential ordering, for various reasons depending on state of
internal structures od device queues. This is my simplified
interpretation as I don't understand all the magic below filesystem
layer.

I assume there are some guarantees about the ordering, eg. within one
plug, that apply to all schedulers (maybe not the noop one). Something
like that should be the least common functionality that the filesystem
layer can rely on.
 
> Naohiro changes to btrfs IO scheduler have the same intent, that is, efficiently
> integrate and handle write ordering "a la btrfs". Would creating a different
> "hmzoned" btrfs IO scheduler help address your concerns ?

IMHO this sounds both the same, all we care about is the sequential
ordering, which in some sense is "scheduling", but I would not call it
that way due to the simplicity.

As implemented, it's a list of bios, but I'd suggest using rb-tree or
xarray, the insertion is fast and submission is start to end traversal.
I'm not sure that the loop in __btrfs_map_bio_zoned after label
send_bios: has reasonable complexity, looks like an O(N^2).
Damien Le Moal June 18, 2019, 4:04 a.m. UTC | #4
David,

On 2019/06/18 8:59, David Sterba wrote:
> On Mon, Jun 17, 2019 at 03:16:05AM +0000, Damien Le Moal wrote:
>> Josef,
>>
>> On 2019/06/13 23:15, Josef Bacik wrote:
>>> On Fri, Jun 07, 2019 at 10:10:17PM +0900, Naohiro Aota wrote:
>>>> Sequential allocation is not enough to maintain sequential delivery of
>>>> write IOs to the device. Various features (async compress, async checksum,
>>>> ...) of btrfs affect ordering of the IOs. This patch introduces submit
>>>> buffer to sort WRITE bios belonging to a block group and sort them out
>>>> sequentially in increasing block address to achieve sequential write
>>>> sequences with __btrfs_map_bio().
>>>>
>>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>>
>>> I hate everything about this.  Can't we just use the plugging infrastructure for
>>> this and then make sure it re-orders the bios before submitting them?  Also
>>> what's to prevent the block layer scheduler from re-arranging these io's?
>>> Thanks,
>>
>> The block I/O scheduler reorders requests in LBA order, but that happens for a
>> newly inserted request against pending requests. If there are no pending
>> requests because all requests were already issued, no ordering happen, and even
>> worse, if the drive queue is not full yet (e.g. there are free tags), then the
>> newly inserted request will be dispatched almost immediately, preventing
>> reordering with subsequent incoming write requests to happen.
> 
> This would be good to add to the changelog.

Sure. No problem. We can add that explanation.

>> The other problem is that the mq-deadline scheduler does not track zone WP
>> position. Write request issuing is done regardless of the current WP value,
>> solely based on LBA ordering. This means that mq-deadline will not prevent
>> out-of-order, or rather, unaligned write requests.
> 
> This seems to be the key point.

Yes it is. We can also add this to the commit message explanation.

>> These will not be detected
>> and dispatched whenever possible. The reasons for this are that:
>> 1) the disk user (the FS) has to manage zone WP positions anyway. So duplicating
>> that management at the block IO scheduler level is inefficient.
>> 2) Adding zone WP management at the block IO scheduler level would also need a
>> write error processing path to resync the WP value in case of failed writes. But
>> the user/FS also needs that anyway. Again duplicated functionalities.
>> 3) The block layer will need a timeout to force issue or cancel pending
>> unaligned write requests. This is necessary in case the drive user stops issuing
>> writes (for whatever reasons) or the scheduler is being switched. This would
>> unnecessarily cause write I/O errors or cause deadlocks if the request queue
>> quiesce mode is entered at the wrong time (and I do not see a good way to deal
>> with that).
>>
>> blk-mq is already complicated enough. Adding this to the block IO scheduler will
>> unnecessarily complicate things further for no real benefits. I would like to
>> point out the dm-zoned device mapper and f2fs which are both already dealing
>> with write ordering and write error processing directly. Both are fairly
>> straightforward but completely different and each optimized for their own structure.
> 
> So the question is where on which layer the decision logic is. The
> filesystem(s) or dm-zoned have enough information about the zones and
> the writes can be pre-sorted. This is what the patch proposes.

Yes, exactly.

> From your explanation I get that the io scheduler can throw the wrench
> in the sequential ordering, for various reasons depending on state of
> internal structures od device queues. This is my simplified
> interpretation as I don't understand all the magic below filesystem
> layer.

Not exactly "throw the wrench". mq-deadline will guarantee per zone write order
to be exactly the order in which requests were inserted, that is, issued by the
FS. But mq-dealine will not "wait" if the write order is not purely sequential,
that is, there are holes/jumps in the LBA sequence for the zone. Order only is
guaranteed. The alignment to WP/contiguous sequential write issuing is the
responsibility of the issuer (FS or DM or application in the case of raw accesses).

> I assume there are some guarantees about the ordering, eg. within one
> plug, that apply to all schedulers (maybe not the noop one). Something
> like that should be the least common functionality that the filesystem
> layer can rely on.

The insertion side of the scheduler (upper level from FS to scheduler), which
include the per CPU software queues and plug control, will not reorder requests.
However, the dispatch side (lower level, from scheduler to HBA driver) can cause
reordering. This is what mq-deadline prevents using a per zone write lock to
avoid reordering of write requests per zone by allowing only a single write
request per zone to be dispatched to the device at any time. Overall order is
not guaranteed, nor is read request order. But per zone write requests will not
be reordered.

But again, this is only ordering. Nothing to do with trying to achieve a purely
sequential write stream per zone. This is the responsibility of the issuer to
deliver write request per zone without any gap, all requests sequential in LBA
within each zone. Overall, the stream of request does not have to be sequential,
e.g. if multiple zones are being written at the same time. But per zones, write
requests must be sequential.

>> Naohiro changes to btrfs IO scheduler have the same intent, that is, efficiently
>> integrate and handle write ordering "a la btrfs". Would creating a different
>> "hmzoned" btrfs IO scheduler help address your concerns ?
> 
> IMHO this sounds both the same, all we care about is the sequential
> ordering, which in some sense is "scheduling", but I would not call it
> that way due to the simplicity.

OK. And yes, it is only ordering of writes per zone. For all other requests,
e.g. reads, order does not matter. And the overall interleaving of write
requests to different zones can also be anything. No constraints there.

> As implemented, it's a list of bios, but I'd suggest using rb-tree or
> xarray, the insertion is fast and submission is start to end traversal.
> I'm not sure that the loop in __btrfs_map_bio_zoned after label
> send_bios: has reasonable complexity, looks like an O(N^2).

OK. We can change that. rbtree is simple enough to use. We can change the list
to that.

Thank you for your comments.

Best regards.
Josef Bacik June 18, 2019, 1:33 p.m. UTC | #5
On Mon, Jun 17, 2019 at 03:16:05AM +0000, Damien Le Moal wrote:
> Josef,
> 
> On 2019/06/13 23:15, Josef Bacik wrote:
> > On Fri, Jun 07, 2019 at 10:10:17PM +0900, Naohiro Aota wrote:
> >> Sequential allocation is not enough to maintain sequential delivery of
> >> write IOs to the device. Various features (async compress, async checksum,
> >> ...) of btrfs affect ordering of the IOs. This patch introduces submit
> >> buffer to sort WRITE bios belonging to a block group and sort them out
> >> sequentially in increasing block address to achieve sequential write
> >> sequences with __btrfs_map_bio().
> >>
> >> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > 
> > I hate everything about this.  Can't we just use the plugging infrastructure for
> > this and then make sure it re-orders the bios before submitting them?  Also
> > what's to prevent the block layer scheduler from re-arranging these io's?
> > Thanks,
> 
> The block I/O scheduler reorders requests in LBA order, but that happens for a
> newly inserted request against pending requests. If there are no pending
> requests because all requests were already issued, no ordering happen, and even
> worse, if the drive queue is not full yet (e.g. there are free tags), then the
> newly inserted request will be dispatched almost immediately, preventing
> reordering with subsequent incoming write requests to happen.
> 

This sounds like we're depending on specific behavior from the ioscheduler,
which means we're going to have a sad day at some point in the future.

> The other problem is that the mq-deadline scheduler does not track zone WP
> position. Write request issuing is done regardless of the current WP value,
> solely based on LBA ordering. This means that mq-deadline will not prevent
> out-of-order, or rather, unaligned write requests. These will not be detected
> and dispatched whenever possible. The reasons for this are that:
> 1) the disk user (the FS) has to manage zone WP positions anyway. So duplicating
> that management at the block IO scheduler level is inefficient.

I'm not saying it has to manage the WP pointer, and in fact I'm not saying the
scheduler has to do anything at all.  We just need a more generic way to make
sure that bio's submitted in order are kept in order.  So perhaps a hmzoned
scheduler that does just that, and is pinned for these devices.

> 2) Adding zone WP management at the block IO scheduler level would also need a
> write error processing path to resync the WP value in case of failed writes. But
> the user/FS also needs that anyway. Again duplicated functionalities.

Again, no not really.  My point is I want as little block layer knowledge in
btrfs as possible.  I accept we should probably keep track of the WP, it just
makes it easier on everybody if we allocate sequentially.  I'll even allow that
we need to handle the write errors and adjust our WP stuff internally when
things go wrong.

What I'm having a hard time swallowing is having a io scheduler in btrfs proper.
We just ripped out the old one we had because it broke cgroups.  It just adds
extra complexity to an already complex mess.

> 3) The block layer will need a timeout to force issue or cancel pending
> unaligned write requests. This is necessary in case the drive user stops issuing
> writes (for whatever reasons) or the scheduler is being switched. This would
> unnecessarily cause write I/O errors or cause deadlocks if the request queue
> quiesce mode is entered at the wrong time (and I do not see a good way to deal
> with that).

Again we could just pin the hmzoned scheduler to those devices so you can't
switch them.  Or make a hmzoned blk plug and pin no scheduler to these devices.

> 
> blk-mq is already complicated enough. Adding this to the block IO scheduler will
> unnecessarily complicate things further for no real benefits. I would like to
> point out the dm-zoned device mapper and f2fs which are both already dealing
> with write ordering and write error processing directly. Both are fairly
> straightforward but completely different and each optimized for their own structure.
> 

So we're duplicating this effort in 2 places already and adding a 3rd place
seems like a solid plan?  Device-mapper it makes sense, we're sitting squarely
in the block layer so moving around bio's/requests is its very reason for
existing.  I'm not sold on the file system needing to take up this behavior.
This needs to be handled in a more generic way so that all file systems can
share the same mechanism.

I'd even go so far as to say that you could just require using a dm device with
these hmzoned block devices and then handle all of that logic in there if you
didn't feel like doing it generically.  We're already talking about esoteric
devices that require special care to use, adding the extra requirement of
needing to go through device-mapper to use it wouldn't be that big of a stretch.
Thanks,

Josef
Damien Le Moal June 19, 2019, 10:32 a.m. UTC | #6
On 2019/06/18 22:34, Josef Bacik wrote:
> On Mon, Jun 17, 2019 at 03:16:05AM +0000, Damien Le Moal wrote:
>> The block I/O scheduler reorders requests in LBA order, but that happens for a
>> newly inserted request against pending requests. If there are no pending
>> requests because all requests were already issued, no ordering happen, and even
>> worse, if the drive queue is not full yet (e.g. there are free tags), then the
>> newly inserted request will be dispatched almost immediately, preventing
>> reordering with subsequent incoming write requests to happen.
>>
> 
> This sounds like we're depending on specific behavior from the ioscheduler,
> which means we're going to have a sad day at some point in the future.

In a sense yes, we are. But my team and I always make sure that such sad day do
not come. We are always making sure that HM-zoned drives can be used and work as
expected (all RCs and stable versions are tested weekly). For now, getting
guarantees on write requests order mandates the use of the mq-deadline scheduler
as it is currently the only one providing these guarantees. I just sent a patch
to ensure that this scheduler is always available with CONFIG_BLK_DEV_ZONED
enabled (see commit b9aef63aca77 "block: force select mq-deadline for zoned
block devices") and automatically configuring it for HM zoned devices is simply
a matter of adding an udev rule to the system (mq-deadline is the default
scheduler for spinning rust anyway).

>> The other problem is that the mq-deadline scheduler does not track zone WP
>> position. Write request issuing is done regardless of the current WP value,
>> solely based on LBA ordering. This means that mq-deadline will not prevent
>> out-of-order, or rather, unaligned write requests. These will not be detected
>> and dispatched whenever possible. The reasons for this are that:
>> 1) the disk user (the FS) has to manage zone WP positions anyway. So duplicating
>> that management at the block IO scheduler level is inefficient.
> 
> I'm not saying it has to manage the WP pointer, and in fact I'm not saying the
> scheduler has to do anything at all.  We just need a more generic way to make
> sure that bio's submitted in order are kept in order.  So perhaps a hmzoned
> scheduler that does just that, and is pinned for these devices.

This is exactly what mq-deadline does for HM devices: it guarantees that write
bio order submission is kept as is for request dispatching to the disk. The only
missing part is "pinned for these devices". This is not possible now. A user can
still change the scheduler to say BFQ. But in that case, unaligned write errors
will show up very quickly. So this is easy to debug. Not ideal I agree, but that
can be fixed independently of BtrFS support for hmzoned disks.

>> 2) Adding zone WP management at the block IO scheduler level would also need a
>> write error processing path to resync the WP value in case of failed writes. But
>> the user/FS also needs that anyway. Again duplicated functionalities.
> 
> Again, no not really.  My point is I want as little block layer knowledge in
> btrfs as possible.  I accept we should probably keep track of the WP, it just
> makes it easier on everybody if we allocate sequentially.  I'll even allow that
> we need to handle the write errors and adjust our WP stuff internally when
> things go wrong.
> 
> What I'm having a hard time swallowing is having a io scheduler in btrfs proper.
> We just ripped out the old one we had because it broke cgroups.  It just adds
> extra complexity to an already complex mess.

I understand your point. It makes perfect sense. The "IO scheduler" added for
hmzoned case is only the method proposed to implement sequential write issuing
guarantees. The sequential allocation was relatively easy to achieve, but what
is really needed is an atomic "sequential alloc blocks + issue write BIO for
these blocks" so that the block IO schedulker sees sequential write streams per
zone. If only the sequential allocation is achieved, write bios serving these
blocks may be reordered at the FS level and result in write failures since the
block layer scheduler only guarantees preserving the order without any
reordering guarantees for unaligned writes.

>> 3) The block layer will need a timeout to force issue or cancel pending
>> unaligned write requests. This is necessary in case the drive user stops issuing
>> writes (for whatever reasons) or the scheduler is being switched. This would
>> unnecessarily cause write I/O errors or cause deadlocks if the request queue
>> quiesce mode is entered at the wrong time (and I do not see a good way to deal
>> with that).
> 
> Again we could just pin the hmzoned scheduler to those devices so you can't
> switch them.  Or make a hmzoned blk plug and pin no scheduler to these devices.

That is not enough. Pinning the schedulers or using the plugs cannot guarantee
that write requests issued out of order will always be correctly reordered. Even
worse, we cannot implement this. For multiple reason as I stated before.

One example that may illustrates this more easily is this: imagine a user doing
buffered I/Os to an hm disk (e.g. dd if=/dev/zero of=/dev/sdX). The first part
of this execution, that is, allocate a free page, copy the user data and add the
page to the page cache as dirty, is in fact equivalent to an FS sequential block
allocation (the dirty pages are allocated in offset order and added to  the page
cache in that same order).

Most of the time, this will work just fine because the page cache dirty page
writeback code is mostly sequential. Dirty pages for an inode are found in
offset order, packed into write bios and issued sequentially. But start putting
memory pressure on the system, or executing "sync" or other applications in
parallel, and you will start seeing unaligned write errors because the page
cache atomicity is per page so different contexts may end up grabbing dirty
pages in order (as expected) but issuing interleaved write bios out of order.
And this type of problem *cannot* be handled in the block layer (plug or
scheduler) because stopping execution of a bio expecting that another bio will
come is very dangerous as there are no guarantees that such bio will ever be
issued. In the case of the page cache flush, this is actually a real eventuality
as memory allocation needed for issuing a bio may depend on the completion of
already issued bios, and if we cannot dispatch those, then we can deadlock.

This is an extreme example. This is unlikely but still a real possibility.
Similarly to your position, that is, the FS should not know anything about the
block layer, the block layer position is that it cannot rely on a specific
behavior from the upper layers. Essentially, all bios are independent and
treated as such.

For HM devices, we needed sequential write guarantees, but could not break the
independence of write requests. So what we did is simply guarantee that the
dispatch order is preserved from the issuing order, nothing else. There is no
"buffering" possible and no checks regarding the sequentiality of writes.

As a result, the sequential write constraint of the disks is directly exposed to
the disk user (FS or DM).

>> blk-mq is already complicated enough. Adding this to the block IO scheduler will
>> unnecessarily complicate things further for no real benefits. I would like to
>> point out the dm-zoned device mapper and f2fs which are both already dealing
>> with write ordering and write error processing directly. Both are fairly
>> straightforward but completely different and each optimized for their own structure.
>>
> 
> So we're duplicating this effort in 2 places already and adding a 3rd place
> seems like a solid plan?  Device-mapper it makes sense, we're sitting squarely
> in the block layer so moving around bio's/requests is its very reason for
> existing.  I'm not sold on the file system needing to take up this behavior.
> This needs to be handled in a more generic way so that all file systems can
> share the same mechanism.

I understand your point. But I am afraid it is not easily possible. The reason
is that for an FS, to achieve sequential write streams in zones, one need an
atomic (or serialized) execution of "block allocation + wrtite bio issuing".
Both combined achieve a sequential write stream that mq-deadline will preserve
and everything will work as intended. This is obviously not easily possible in a
generic manner for all FSes. In f2fs, this was rather easy to do without
changing a lot of code by simply using a mutex to have the 2 operations
atomically executed without any noticeable performance impact. A similar method
in BtrFS is not possible because of async checksum and async compression which
can result in btrfs_map_bio() execution in an order that is different from the
extent allocation order.

> 
> I'd even go so far as to say that you could just require using a dm device with
> these hmzoned block devices and then handle all of that logic in there if you
> didn't feel like doing it generically.  We're already talking about esoteric
> devices that require special care to use, adding the extra requirement of
> needing to go through device-mapper to use it wouldn't be that big of a stretch.

HM drives are not so "esoteric" anymore. Entire data centers are starting
running on them. And getting BtrFS to work natively on HM drives would be a huge
step toward facilitating their use, and remove this "esoteric" label :)

Back to your point, using a dm to do the reordering is possible, but requires
temporary persistent backup of the out-of-order BIOs due to the reasons pointed
out above (dependency of memory allocation failure/success on bio completion).
This is basically what dm-zoned does, using conventional zones to store
out-of-order writes in conventional zones. Such generic DM is enough to run any
file system (ext4 or XFS run perfectly fine on dm-zoned), but come at the cost
of needing garbage collection with a huge impact on performance. The simple
addition of Naohiro's write bio ordering feature in BtrFS avoids all this and
preserves performance. I really understand your desire to reduce complexity. But
in the end, this is only a "sorted list" that is well controlled within btrfs
itself and avoids dependency on the behavior of other components beside the
block IO scheduler.

We could envision to make such feature generic, implementing it as a block layer
object. But its use would still be needed in btrfs. Since f2fs and dm-zoned do
not require it, btrfs would be the sole user though, so for now at least, this
generic implementation has I think little value. We can work on trying to
isolate this bio reordering code more so  that it is easier to remove and use a
future generic implementation. Would that help in addressing your concerns ?

Thank you for your comments.

Best regards.

Patch
diff mbox series

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f4bcd2a6ec12..ade6d8243962 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -718,6 +718,9 @@  struct btrfs_block_group_cache {
 	 */
 	enum btrfs_alloc_type alloc_type;
 	u64 alloc_offset;
+	struct mutex submit_lock;
+	u64 submit_offset;
+	struct bio_list submit_buffer;
 };
 
 /* delayed seq elem */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ae2c895d08c4..ebdc7a6dbe01 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -124,6 +124,7 @@  void btrfs_put_block_group(struct btrfs_block_group_cache *cache)
 	if (atomic_dec_and_test(&cache->count)) {
 		WARN_ON(cache->pinned > 0);
 		WARN_ON(cache->reserved > 0);
+		WARN_ON(!bio_list_empty(&cache->submit_buffer));
 
 		/*
 		 * If not empty, someone is still holding mutex of
@@ -10511,6 +10512,8 @@  btrfs_get_block_group_alloc_offset(struct btrfs_block_group_cache *cache)
 		goto out;
 	}
 
+	cache->submit_offset = logical + cache->alloc_offset;
+
 out:
 	cache->alloc_type = alloc_type;
 	kfree(alloc_offsets);
@@ -10547,6 +10550,7 @@  btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
 
 	atomic_set(&cache->count, 1);
 	spin_lock_init(&cache->lock);
+	mutex_init(&cache->submit_lock);
 	init_rwsem(&cache->data_rwsem);
 	INIT_LIST_HEAD(&cache->list);
 	INIT_LIST_HEAD(&cache->cluster_list);
@@ -10554,6 +10558,7 @@  btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
 	INIT_LIST_HEAD(&cache->ro_list);
 	INIT_LIST_HEAD(&cache->dirty_list);
 	INIT_LIST_HEAD(&cache->io_list);
+	bio_list_init(&cache->submit_buffer);
 	btrfs_init_free_space_ctl(cache);
 	atomic_set(&cache->trimming, 0);
 	mutex_init(&cache->free_space_lock);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 52d0d458c0fd..26a64a53032f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -29,6 +29,11 @@ 
 #include "sysfs.h"
 #include "tree-checker.h"
 
+struct map_bio_data {
+	void *orig_bi_private;
+	int mirror_num;
+};
+
 const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 	[BTRFS_RAID_RAID10] = {
 		.sub_stripes	= 2,
@@ -523,6 +528,7 @@  static void requeue_list(struct btrfs_pending_bios *pending_bios,
 		pending_bios->tail = tail;
 }
 
+
 /*
  * we try to collect pending bios for a device so we don't get a large
  * number of procs sending bios down to the same device.  This greatly
@@ -606,6 +612,8 @@  static noinline void run_scheduled_bios(struct btrfs_device *device)
 	spin_unlock(&device->io_lock);
 
 	while (pending) {
+		struct btrfs_bio *bbio;
+		struct completion *sent = NULL;
 
 		rmb();
 		/* we want to work on both lists, but do more bios on the
@@ -643,7 +651,12 @@  static noinline void run_scheduled_bios(struct btrfs_device *device)
 			sync_pending = 0;
 		}
 
+		bbio = cur->bi_private;
+		if (bbio)
+			sent = bbio->sent;
 		btrfsic_submit_bio(cur);
+		if (sent)
+			complete(sent);
 		num_run++;
 		batch_run++;
 
@@ -5916,6 +5929,7 @@  static struct btrfs_bio *alloc_btrfs_bio(int total_stripes, int real_stripes)
 
 	atomic_set(&bbio->error, 0);
 	refcount_set(&bbio->refs, 1);
+	INIT_LIST_HEAD(&bbio->list);
 
 	return bbio;
 }
@@ -6730,7 +6744,7 @@  static void btrfs_end_bio(struct bio *bio)
  * the work struct is scheduled.
  */
 static noinline void btrfs_schedule_bio(struct btrfs_device *device,
-					struct bio *bio)
+					struct bio *bio, int need_seqwrite)
 {
 	struct btrfs_fs_info *fs_info = device->fs_info;
 	int should_queue = 1;
@@ -6738,7 +6752,12 @@  static noinline void btrfs_schedule_bio(struct btrfs_device *device,
 
 	/* don't bother with additional async steps for reads, right now */
 	if (bio_op(bio) == REQ_OP_READ) {
+		struct btrfs_bio *bbio = bio->bi_private;
+		struct completion *sent = bbio->sent;
+
 		btrfsic_submit_bio(bio);
+		if (sent)
+			complete(sent);
 		return;
 	}
 
@@ -6746,7 +6765,7 @@  static noinline void btrfs_schedule_bio(struct btrfs_device *device,
 	bio->bi_next = NULL;
 
 	spin_lock(&device->io_lock);
-	if (op_is_sync(bio->bi_opf))
+	if (op_is_sync(bio->bi_opf) && need_seqwrite == 0)
 		pending_bios = &device->pending_sync_bios;
 	else
 		pending_bios = &device->pending_bios;
@@ -6785,8 +6804,21 @@  static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
 
 	btrfs_bio_counter_inc_noblocked(fs_info);
 
+	/* queue all bios into scheduler if sequential write is required */
+	if (bbio->need_seqwrite) {
+		if (!async) {
+			DECLARE_COMPLETION_ONSTACK(sent);
+
+			bbio->sent = &sent;
+			btrfs_schedule_bio(dev, bio, bbio->need_seqwrite);
+			wait_for_completion_io(&sent);
+		} else {
+			btrfs_schedule_bio(dev, bio, bbio->need_seqwrite);
+		}
+		return;
+	}
 	if (async)
-		btrfs_schedule_bio(dev, bio);
+		btrfs_schedule_bio(dev, bio, bbio->need_seqwrite);
 	else
 		btrfsic_submit_bio(bio);
 }
@@ -6808,9 +6840,10 @@  static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
 	}
 }
 
+
 static blk_status_t __btrfs_map_bio(struct btrfs_fs_info *fs_info,
 				    struct bio *bio, int mirror_num,
-				    int async_submit)
+				    int async_submit, int need_seqwrite)
 {
 	struct btrfs_device *dev;
 	struct bio *first_bio = bio;
@@ -6838,6 +6871,7 @@  static blk_status_t __btrfs_map_bio(struct btrfs_fs_info *fs_info,
 	bbio->private = first_bio->bi_private;
 	bbio->end_io = first_bio->bi_end_io;
 	bbio->fs_info = fs_info;
+	bbio->need_seqwrite = need_seqwrite;
 	atomic_set(&bbio->stripes_pending, bbio->num_stripes);
 
 	if ((bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) &&
@@ -6885,10 +6919,131 @@  static blk_status_t __btrfs_map_bio(struct btrfs_fs_info *fs_info,
 	return BLK_STS_OK;
 }
 
+static blk_status_t __btrfs_map_bio_zoned(struct btrfs_fs_info *fs_info,
+					  struct bio *cur_bio, int mirror_num,
+					  int async_submit)
+{
+	u64 logical = (u64)cur_bio->bi_iter.bi_sector << SECTOR_SHIFT;
+	u64 length = cur_bio->bi_iter.bi_size;
+	struct bio *bio;
+	struct bio *next;
+	struct bio_list submit_list;
+	struct btrfs_block_group_cache *cache = NULL;
+	struct map_bio_data *map_private;
+	int sent;
+	blk_status_t ret;
+
+	WARN_ON(bio_op(cur_bio) != REQ_OP_WRITE);
+
+	cache = btrfs_lookup_block_group(fs_info, logical);
+	if (!cache || cache->alloc_type != BTRFS_ALLOC_SEQ) {
+		if (cache)
+			btrfs_put_block_group(cache);
+		return __btrfs_map_bio(fs_info, cur_bio, mirror_num,
+				       async_submit, 0);
+	}
+
+	mutex_lock(&cache->submit_lock);
+	if (cache->submit_offset == logical)
+		goto send_bios;
+
+	if (cache->submit_offset > logical) {
+		trace_btrfs_bio_before_write_pointer(cache, cur_bio);
+		mutex_unlock(&cache->submit_lock);
+		btrfs_put_block_group(cache);
+		WARN_ON_ONCE(1);
+		return BLK_STS_IOERR;
+	}
+
+	/* buffer the unaligned bio */
+	map_private = kmalloc(sizeof(*map_private), GFP_NOFS);
+	if (!map_private) {
+		mutex_unlock(&cache->submit_lock);
+		return errno_to_blk_status(-ENOMEM);
+	}
+
+	map_private->orig_bi_private = cur_bio->bi_private;
+	map_private->mirror_num = mirror_num;
+	cur_bio->bi_private = map_private;
+
+	bio_list_add(&cache->submit_buffer, cur_bio);
+	mutex_unlock(&cache->submit_lock);
+	btrfs_put_block_group(cache);
+
+	/* mimic a good result ... */
+	return BLK_STS_OK;
+
+send_bios:
+	mutex_unlock(&cache->submit_lock);
+	/* send this bio */
+	ret = __btrfs_map_bio(fs_info, cur_bio, mirror_num, 1, 1);
+	if (ret != BLK_STS_OK) {
+		/* TODO kill buffered bios */
+		return ret;
+	}
+
+loop:
+	/* and send previously buffered following bios */
+	mutex_lock(&cache->submit_lock);
+	cache->submit_offset += length;
+	length = 0;
+	bio_list_init(&submit_list);
+
+	/* collect sequential bios into submit_list */
+	do {
+		sent = 0;
+		bio = bio_list_get(&cache->submit_buffer);
+		while (bio) {
+			u64 logical =
+				(u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
+			struct bio_list *target;
+
+			next = bio->bi_next;
+			bio->bi_next = NULL;
+
+			if (logical == cache->submit_offset + length) {
+				sent = 1;
+				length += bio->bi_iter.bi_size;
+				target = &submit_list;
+			} else {
+				target = &cache->submit_buffer;
+			}
+			bio_list_add(target, bio);
+
+			bio = next;
+		}
+	} while (sent);
+	mutex_unlock(&cache->submit_lock);
+
+	/* send the collected bios */
+	while ((bio = bio_list_pop(&submit_list)) != NULL) {
+		map_private = (struct map_bio_data *)bio->bi_private;
+		mirror_num = map_private->mirror_num;
+		bio->bi_private = map_private->orig_bi_private;
+		kfree(map_private);
+
+		ret = __btrfs_map_bio(fs_info, bio, mirror_num, 1, 1);
+		if (ret) {
+			bio->bi_status = ret;
+			bio_endio(bio);
+		}
+	}
+
+	if (length)
+		goto loop;
+	btrfs_put_block_group(cache);
+
+	return BLK_STS_OK;
+}
+
 blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 			   int mirror_num, int async_submit)
 {
-	return __btrfs_map_bio(fs_info, bio, mirror_num, async_submit);
+	if (btrfs_fs_incompat(fs_info, HMZONED) && bio_op(bio) == REQ_OP_WRITE)
+		return __btrfs_map_bio_zoned(fs_info, bio, mirror_num,
+					     async_submit);
+
+	return __btrfs_map_bio(fs_info, bio, mirror_num, async_submit, 0);
 }
 
 /*
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index f66755e43669..e97d13cb1627 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -329,6 +329,9 @@  struct btrfs_bio {
 	int mirror_num;
 	int num_tgtdevs;
 	int *tgtdev_map;
+	int need_seqwrite;
+	struct list_head list;
+	struct completion *sent;
 	/*
 	 * logical block numbers for the start of each stripe
 	 * The last one or two are p/q.  These are sorted,
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index fe4d268028ee..2b4cd791bf24 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2091,6 +2091,47 @@  DEFINE_BTRFS_LOCK_EVENT(btrfs_try_tree_read_lock);
 DEFINE_BTRFS_LOCK_EVENT(btrfs_try_tree_write_lock);
 DEFINE_BTRFS_LOCK_EVENT(btrfs_tree_read_lock_atomic);
 
+DECLARE_EVENT_CLASS(btrfs_hmzoned_bio_buffer_events,
+	TP_PROTO(const struct btrfs_block_group_cache *cache,
+		 const struct bio *bio),
+
+	TP_ARGS(cache, bio),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	block_group	)
+		__field(	u64,	flags		)
+		__field(	u64,	submit_pos	)
+		__field(	u64,	logical	)
+		__field(	u64,	length		)
+	),
+
+	TP_fast_assign_btrfs(cache->fs_info,
+		__entry->block_group = cache->key.objectid;
+		__entry->flags = cache->flags;
+		__entry->submit_pos = cache->submit_offset;
+		__entry->logical = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
+		__entry->length = bio->bi_iter.bi_size;
+	),
+
+	TP_printk_btrfs(
+		"block_group=%llu(%s) submit_pos=%llu logical=%llu length=%llu",
+		__entry->block_group,
+		__print_flags((unsigned long)__entry->flags, "|",
+			      BTRFS_GROUP_FLAGS),
+		__entry->submit_pos, __entry->logical,
+		__entry->length)
+);
+
+#define DEFINE_BTRFS_HMZONED_BIO_BUF_EVENT(name)			\
+DEFINE_EVENT(btrfs_hmzoned_bio_buffer_events, name,			\
+	     TP_PROTO(const struct btrfs_block_group_cache *cache,	\
+		      const struct bio *bio),				\
+									\
+	     TP_ARGS(cache, bio)					\
+)
+
+DEFINE_BTRFS_HMZONED_BIO_BUF_EVENT(btrfs_bio_before_write_pointer);
+
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */