diff mbox

[v5,01/11] block: make generic_make_request handle arbitrarily sized bios

Message ID 1438412290.26596.14.camel@hasee (mailing list archive)
State Accepted, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Ming Lin Aug. 1, 2015, 6:58 a.m. UTC
On Fri, 2015-07-31 at 17:38 -0400, Mike Snitzer wrote:
> On Fri, Jul 31 2015 at  5:19pm -0400,
> Ming Lin <mlin@kernel.org> wrote:
> 
> > On Fri, Jul 31, 2015 at 12:23 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> > > On Mon, Jul 06 2015 at  3:44P -0400,
> > > Ming Lin <mlin@kernel.org> wrote:
> > >
> > >> From: Kent Overstreet <kent.overstreet@gmail.com>
> > >>
> > >> The way the block layer is currently written, it goes to great lengths
> > >> to avoid having to split bios; upper layer code (such as bio_add_page())
> > >> checks what the underlying device can handle and tries to always create
> > >> bios that don't need to be split.
> > >>
> > >> But this approach becomes unwieldy and eventually breaks down with
> > >> stacked devices and devices with dynamic limits, and it adds a lot of
> > >> complexity. If the block layer could split bios as needed, we could
> > >> eliminate a lot of complexity elsewhere - particularly in stacked
> > >> drivers. Code that creates bios can then create whatever size bios are
> > >> convenient, and more importantly stacked drivers don't have to deal with
> > >> both their own bio size limitations and the limitations of the
> > >> (potentially multiple) devices underneath them.  In the future this will
> > >> let us delete merge_bvec_fn and a bunch of other code.
> > >>
> > >> We do this by adding calls to blk_queue_split() to the various
> > >> make_request functions that need it - a few can already handle arbitrary
> > >> size bios. Note that we add the call _after_ any call to
> > >> blk_queue_bounce(); this means that blk_queue_split() and
> > >> blk_recalc_rq_segments() don't need to be concerned with bouncing
> > >> affecting segment merging.
> > >>
> > >> Some make_request_fn() callbacks were simple enough to audit and verify
> > >> they don't need blk_queue_split() calls. The skipped ones are:
> > >>
> > >>  * nfhd_make_request (arch/m68k/emu/nfblock.c)
> > >>  * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
> > >>  * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
> > >>  * brd_make_request (ramdisk - drivers/block/brd.c)
> > >>  * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
> > >>  * loop_make_request
> > >>  * null_queue_bio
> > >>  * bcache's make_request fns
> > >>
> > >> Some others are almost certainly safe to remove now, but will be left
> > >> for future patches.
> > >>
> > >> Cc: Jens Axboe <axboe@kernel.dk>
> > >> Cc: Christoph Hellwig <hch@infradead.org>
> > >> Cc: Al Viro <viro@zeniv.linux.org.uk>
> > >> Cc: Ming Lei <ming.lei@canonical.com>
> > >> Cc: Neil Brown <neilb@suse.de>
> > >> Cc: Alasdair Kergon <agk@redhat.com>
> > >> Cc: Mike Snitzer <snitzer@redhat.com>
> > >> Cc: dm-devel@redhat.com
> > >> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
> > >> Cc: drbd-user@lists.linbit.com
> > >> Cc: Jiri Kosina <jkosina@suse.cz>
> > >> Cc: Geoff Levand <geoff@infradead.org>
> > >> Cc: Jim Paris <jim@jtan.com>
> > >> Cc: Joshua Morris <josh.h.morris@us.ibm.com>
> > >> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
> > >> Cc: Minchan Kim <minchan@kernel.org>
> > >> Cc: Nitin Gupta <ngupta@vflare.org>
> > >> Cc: Oleg Drokin <oleg.drokin@intel.com>
> > >> Cc: Andreas Dilger <andreas.dilger@intel.com>
> > >> Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits)
> > >> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> > >> [dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
> > >> Signed-off-by: Dongsu Park <dpark@posteo.net>
> > >> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
> > > ...
> > >> diff --git a/block/blk-merge.c b/block/blk-merge.c
> > >> index 30a0d9f..3707f30 100644
> > >> --- a/block/blk-merge.c
> > >> +++ b/block/blk-merge.c
> > >> @@ -9,12 +9,158 @@
> > >>
> > >>  #include "blk.h"
> > >>
> > >> +static struct bio *blk_bio_discard_split(struct request_queue *q,
> > >> +                                      struct bio *bio,
> > >> +                                      struct bio_set *bs)
> > >> +{
> > >> +     unsigned int max_discard_sectors, granularity;
> > >> +     int alignment;
> > >> +     sector_t tmp;
> > >> +     unsigned split_sectors;
> > >> +
> > >> +     /* Zero-sector (unknown) and one-sector granularities are the same.  */
> > >> +     granularity = max(q->limits.discard_granularity >> 9, 1U);
> > >> +
> > >> +     max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
> > >> +     max_discard_sectors -= max_discard_sectors % granularity;
> > >> +
> > >> +     if (unlikely(!max_discard_sectors)) {
> > >> +             /* XXX: warn */
> > >> +             return NULL;
> > >> +     }
> > >> +
> > >> +     if (bio_sectors(bio) <= max_discard_sectors)
> > >> +             return NULL;
> > >> +
> > >> +     split_sectors = max_discard_sectors;
> > >> +
> > >> +     /*
> > >> +      * If the next starting sector would be misaligned, stop the discard at
> > >> +      * the previous aligned sector.
> > >> +      */
> > >> +     alignment = (q->limits.discard_alignment >> 9) % granularity;
> > >> +
> > >> +     tmp = bio->bi_iter.bi_sector + split_sectors - alignment;
> > >> +     tmp = sector_div(tmp, granularity);
> > >> +
> > >> +     if (split_sectors > tmp)
> > >> +             split_sectors -= tmp;
> > >> +
> > >> +     return bio_split(bio, split_sectors, GFP_NOIO, bs);
> > >> +}
> > >
> > > This code to stop the discard at the previous aligned sector could be
> > > the reason why I have 2 device-mapper-test-suite tests in the
> > > 'thin-provisioning' testsuite failing due to this patchset:
> > 
> > I'm setting up the testsuite to debug.
> 
> OK, once setup, to run the 2 tests in question directly you'd do
> something like:
> 
> dmtest run --suite thin-provisioning -n discard_a_fragmented_device
> 
> dmtest run --suite thin-provisioning -n discard_fully_provisioned_device_benchmark
> 
> Again, these tests pass without this patchset.

It's caused by patch 4.
When discard size >=4G, the bio->bi_iter.bi_size overflows.
Below is the new patch.

Christoph,
Could you also help to review it?

Now we still do "misaligned" check in blkdev_issue_discard().
So the same code in blk_bio_discard_split() was removed.
Please see
https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/commit/?h=block-generic-req&id=dcc5d9c41

I have updated both patch 1 & 4 on my tree.

commit 9607f737de9c4ca1a81655c320a61c287bf77bf5
Author: Ming Lin <ming.l@ssi.samsung.com>
Date:   Fri May 22 00:46:56 2015 -0700

    block: remove split code in blkdev_issue_discard
    
    The split code in blkdev_issue_discard() can go away now
    that any driver that cares does the split, all we have
    to do is make sure bio size doesn't overflow.
    
    Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
 block/blk-lib.c | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Mike Snitzer Aug. 1, 2015, 4:33 p.m. UTC | #1
On Sat, Aug 01 2015 at  2:58am -0400,
Ming Lin <mlin@kernel.org> wrote:

> On Fri, 2015-07-31 at 17:38 -0400, Mike Snitzer wrote:
> > 
> > OK, once setup, to run the 2 tests in question directly you'd do
> > something like:
> > 
> > dmtest run --suite thin-provisioning -n discard_a_fragmented_device
> > 
> > dmtest run --suite thin-provisioning -n discard_fully_provisioned_device_benchmark
> > 
> > Again, these tests pass without this patchset.
> 
> It's caused by patch 4.
> When discard size >=4G, the bio->bi_iter.bi_size overflows.

Thanks for tracking this down!

> Below is the new patch.
> 
> Christoph,
> Could you also help to review it?
> 
> Now we still do "misaligned" check in blkdev_issue_discard().
> So the same code in blk_bio_discard_split() was removed.

But I don't agree with this approach.  One of the most meaningful
benefits of late bio splitting is the upper layers shouldn't _need_ to
depend on the intermediate devices' queue_limits being stacked properly.
Your solution to mix discard granularity/alignment checks at the upper
layer(s) but then split based on max_discard_sectors at the lower layer
defeats that benefit for discards.

This will translate to all intermediate layers that might split
discards needing to worry about granularity/alignment
too (e.g. how dm-thinp will have to care because it must generate
discard mappings with associated bios based on how blocks were mapped to
thinp).

Also, it is unfortunate that IO that doesn't have a payload is being
artificially split simply because bio->bi_iter.bi_size is 32bits.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin K. Petersen Aug. 8, 2015, 4:19 p.m. UTC | #2
>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> This will translate to all intermediate layers that might split
Mike> discards needing to worry about granularity/alignment too
Mike> (e.g. how dm-thinp will have to care because it must generate
Mike> discard mappings with associated bios based on how blocks were
Mike> mapped to thinp).

The fundamental issue here is that alignment and granularity should
never, ever have been enforced at the top of the stack. Horrendous idea
from the very beginning.

For the < handful of braindead devices that get confused when you do
partial or misaligned blocks we should have had a quirk that did any
range adjusting at the bottom in sd_setup_discard_cmnd().

There's a reason I turned discard_zeroes_data off for UNMAP!

Wrt. the range size I don't have a problem with capping at the 32-bit
bi_size limit. We probably don't want to send commands much bigger than
that anyway.
diff mbox

Patch

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3..b9e2fca 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -43,7 +43,7 @@  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	DECLARE_COMPLETION_ONSTACK(wait);
 	struct request_queue *q = bdev_get_queue(bdev);
 	int type = REQ_WRITE | REQ_DISCARD;
-	unsigned int max_discard_sectors, granularity;
+	unsigned int granularity;
 	int alignment;
 	struct bio_batch bb;
 	struct bio *bio;
@@ -60,17 +60,6 @@  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	granularity = max(q->limits.discard_granularity >> 9, 1U);
 	alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
 
-	/*
-	 * Ensure that max_discard_sectors is of the proper
-	 * granularity, so that requests stay aligned after a split.
-	 */
-	max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
-	max_discard_sectors -= max_discard_sectors % granularity;
-	if (unlikely(!max_discard_sectors)) {
-		/* Avoid infinite loop below. Being cautious never hurts. */
-		return -EOPNOTSUPP;
-	}
-
 	if (flags & BLKDEV_DISCARD_SECURE) {
 		if (!blk_queue_secdiscard(q))
 			return -EOPNOTSUPP;
@@ -92,7 +81,8 @@  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 			break;
 		}
 
-		req_sects = min_t(sector_t, nr_sects, max_discard_sectors);
+		/* Make sure bi_size doesn't overflow */
+		req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
 
 		/*
 		 * If splitting a request, and the next starting sector would be