diff mbox series

[v2,2/3] block: Split and submit bios in LBA order

Message ID 20230320234905.3832131-3-bvanassche@acm.org (mailing list archive)
State New, archived
Headers show
Series Submit zoned requests in LBA order per zone | expand

Commit Message

Bart Van Assche March 20, 2023, 11:49 p.m. UTC
Submit the bio fragment with the lowest LBA first. This approach prevents
write errors when submitting large bios to host-managed zoned block devices.
This patch only modifies the behavior of drivers that call
bio_split_to_limits() directly. This includes DRBD, pktcdvd, dm, md and
the NVMe multipath code.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-merge.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

Comments

Christoph Hellwig March 21, 2023, 6:03 a.m. UTC | #1
On Mon, Mar 20, 2023 at 04:49:04PM -0700, Bart Van Assche wrote:
> Submit the bio fragment with the lowest LBA first. This approach prevents
> write errors when submitting large bios to host-managed zoned block devices.
> This patch only modifies the behavior of drivers that call
> bio_split_to_limits() directly. This includes DRBD, pktcdvd, dm, md and
> the NVMe multipath code.

Umm, doesn't it also change how blk-mq splits, which is the prime
reason why you're looking into that?

> +		if (current->bio_list) {
> +			/*
> +			 * The caller will submit the first half ('split')
> +			 * before the second half ('bio').
> +			 */
> +			bio_chain(split, bio);
> +			submit_bio_noacct(bio);
> +			return split;
> +		}
> +		/*
> +		 * Submit the first half ('split') let the caller submit the
> +		 * second half ('bio').
> +		 */
> +		*nr_segs = bio_chain_nr_segments(bio, lim);
> +		bio_chain(split, bio);
> +		submit_bio_noacct(split);

I'm really confused on why you want to change the behavior here
for the case where run in a stacking context vs not, and neither
the comments nor the commit log help me trying to figure out why.
Christoph Hellwig March 21, 2023, 6:05 a.m. UTC | #2
On Tue, Mar 21, 2023 at 07:03:07AM +0100, Christoph Hellwig wrote:
> > +		if (current->bio_list) {
> > +			/*
> > +			 * The caller will submit the first half ('split')
> > +			 * before the second half ('bio').
> > +			 */
> > +			bio_chain(split, bio);
> > +			submit_bio_noacct(bio);
> > +			return split;
> > +		}
> > +		/*
> > +		 * Submit the first half ('split') let the caller submit the
> > +		 * second half ('bio').
> > +		 */
> > +		*nr_segs = bio_chain_nr_segments(bio, lim);
> > +		bio_chain(split, bio);
> > +		submit_bio_noacct(split);
> 
> I'm really confused on why you want to change the behavior here
> for the case where run in a stacking context vs not, and neither
> the comments nor the commit log help me trying to figure out why.

In fact I wonder how you managed to get into __bio_split_to_limits
wtih a NULL current->bio_list at all.
Bart Van Assche March 21, 2023, 6:44 p.m. UTC | #3
On 3/20/23 23:05, Christoph Hellwig wrote:
> In fact I wonder how you managed to get into __bio_split_to_limits
> wtih a NULL current->bio_list at all.

Hi Christoph,

This patch series is what I came up with after having observed an 
UNALIGNED WRITE COMMAND response with a 5.15 kernel. In that kernel 
version (but probably not in the latest upstream kernel) the device 
mapper core submits split bios in the wrong order. I will check again 
whether this patch is really necessary.

Thanks,

Bart.
Christoph Hellwig March 23, 2023, 8:22 a.m. UTC | #4
On Tue, Mar 21, 2023 at 11:44:59AM -0700, Bart Van Assche wrote:
> On 3/20/23 23:05, Christoph Hellwig wrote:
>> In fact I wonder how you managed to get into __bio_split_to_limits
>> wtih a NULL current->bio_list at all.
>
> Hi Christoph,
>
> This patch series is what I came up with after having observed an UNALIGNED 
> WRITE COMMAND response with a 5.15 kernel. In that kernel version (but 
> probably not in the latest upstream kernel) the device mapper core submits 
> split bios in the wrong order. I will check again whether this patch is 
> really necessary.

How did that escape the upper layer serialization of zoned writes?

But my point is that all ->submit_bio instances as well as
blk_mq_submit_bio are only called with a non-NULL current->bio_list.
So I don't think you can reach the NULL current->bio_list case in this
patch.

Note that even without any specific zone device isss, submitting bios
in LBA oders works best for both spinning and solid state media,
but all of the arguments here don't add up.
diff mbox series

Patch

diff --git a/block/blk-merge.c b/block/blk-merge.c
index d6f8552ef209..7281f2d91b2f 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -345,8 +345,8 @@  EXPORT_SYMBOL_GPL(bio_split_rw);
  * @nr_segs: returns the number of segments in the returned bio
  *
  * Check if @bio needs splitting based on the queue limits, and if so split off
- * a bio fitting the limits from the beginning of @bio and return it.  @bio is
- * shortened to the remainder and re-submitted.
+ * a bio fitting the limits from the beginning of @bio. @bio is shortened to
+ * the remainder.
  *
  * The split bio is allocated from @q->bio_split, which is provided by the
  * block layer.
@@ -379,10 +379,23 @@  struct bio *__bio_split_to_limits(struct bio *bio,
 		split->bi_opf |= REQ_NOMERGE;
 
 		blkcg_bio_issue_init(split);
-		bio_chain(split, bio);
 		trace_block_split(split, bio->bi_iter.bi_sector);
-		submit_bio_noacct(bio);
-		return split;
+		if (current->bio_list) {
+			/*
+			 * The caller will submit the first half ('split')
+			 * before the second half ('bio').
+			 */
+			bio_chain(split, bio);
+			submit_bio_noacct(bio);
+			return split;
+		}
+		/*
+		 * Submit the first half ('split') let the caller submit the
+		 * second half ('bio').
+		 */
+		*nr_segs = bio_chain_nr_segments(bio, lim);
+		bio_chain(split, bio);
+		submit_bio_noacct(split);
 	}
 	return bio;
 }