diff mbox

[v5,01/11] block: make generic_make_request handle arbitrarily sized bios

Message ID 1438990806.24452.8.camel@ssi (mailing list archive)
State Accepted, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Ming Lin Aug. 7, 2015, 11:40 p.m. UTC
On Fri, 2015-08-07 at 09:30 +0200, Christoph Hellwig wrote:
> I'm for solution 3:
> 
>  - keep blk_bio_{discard,write_same}_split, but ensure we never built
>    a > 4GB bio in blkdev_issue_{discard,write_same}.

This has problem as I mentioned in solution 1.
We need to also make sure max discard size is of proper granularity.
See below example.

      4G: 8388608 sectors
UINT_MAX: 8388607 sectors

dm-thinp block size = default discard granularity = 128 sectors

blkdev_issue_discard(sector=0, nr_sectors=8388608)

1. Only ensure bi_size not overflow

It doesn't work.

[start_sector, end_sector]
[0, 8388607]
    [0, 8388606], then dm-thinp splits it to 2 bios
        [0, 8388479]
        [8388480, 8388606] ---> this has problem in process_discard_bio(),
                                because the discard size(7 sectors) covers less than a block(128 sectors)
    [8388607, 8388607] ---> same problem 

2. Ensure bi_size not overflow and max discard size is of proper granularity

It works.

[start_sector, end_sector]
[0, 8388607]
    [0, 8388479]
    [8388480, 8388607]


So how about below patch?

commit 1ca2ad977255efb3c339f4ca16fb798ed5ec54f7
Author: Ming Lin <ming.l@ssi.samsung.com>
Date:   Fri Aug 7 15:07:07 2015 -0700

    block: remove split code in blkdev_issue_{discard,write_same}
    
    The split code in blkdev_issue_{discard,write_same} can go away
    now that any driver that cares does the split. We have to make
    sure bio size doesn't overflow.
    
    For discard, we ensure max_discard_sectors is of the proper
    granularity. So if discard size > 4G, blkdev_issue_discard() always
    send multiple granularity requests to lower level, except that the
    last one may be not multiple granularity.
    
    Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
 block/blk-lib.c | 37 +++++++++----------------------------
 1 file changed, 9 insertions(+), 28 deletions(-)



> 
> Note that this isn't special casing, we can't build > 4GB bios for
> data either, it's just implemented as a side effect right now instead
> of checked explicitly.




--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Kent Overstreet Aug. 8, 2015, 12:30 a.m. UTC | #1
On Fri, Aug 07, 2015 at 04:40:06PM -0700, Ming Lin wrote:
> 
> On Fri, 2015-08-07 at 09:30 +0200, Christoph Hellwig wrote:
> > I'm for solution 3:
> > 
> >  - keep blk_bio_{discard,write_same}_split, but ensure we never built
> >    a > 4GB bio in blkdev_issue_{discard,write_same}.
> 
> This has problem as I mentioned in solution 1.
> We need to also make sure max discard size is of proper granularity.
> See below example.
> 
>       4G: 8388608 sectors
> UINT_MAX: 8388607 sectors
> 
> dm-thinp block size = default discard granularity = 128 sectors
> 
> blkdev_issue_discard(sector=0, nr_sectors=8388608)
> 
> 1. Only ensure bi_size not overflow
> 
> It doesn't work.
> 
> [start_sector, end_sector]
> [0, 8388607]
>     [0, 8388606], then dm-thinp splits it to 2 bios
>         [0, 8388479]
>         [8388480, 8388606] ---> this has problem in process_discard_bio(),
>                                 because the discard size(7 sectors) covers less than a block(128 sectors)
>     [8388607, 8388607] ---> same problem 
> 
> 2. Ensure bi_size not overflow and max discard size is of proper granularity

Ideally we'd get upper layers out of the business of knowing about the queue
limits at all - that was the point of the patch series, after all.

Instead of using UINT_MAX, would it work to just make the max 1 << 31 sectors?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin Aug. 8, 2015, 5:17 a.m. UTC | #2
On Fri, Aug 7, 2015 at 5:30 PM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Fri, Aug 07, 2015 at 04:40:06PM -0700, Ming Lin wrote:
>>
>> On Fri, 2015-08-07 at 09:30 +0200, Christoph Hellwig wrote:
>> > I'm for solution 3:
>> >
>> >  - keep blk_bio_{discard,write_same}_split, but ensure we never built
>> >    a > 4GB bio in blkdev_issue_{discard,write_same}.
>>
>> This has problem as I mentioned in solution 1.
>> We need to also make sure max discard size is of proper granularity.
>> See below example.
>>
>>       4G: 8388608 sectors
>> UINT_MAX: 8388607 sectors
>>
>> dm-thinp block size = default discard granularity = 128 sectors
>>
>> blkdev_issue_discard(sector=0, nr_sectors=8388608)
>>
>> 1. Only ensure bi_size not overflow
>>
>> It doesn't work.
>>
>> [start_sector, end_sector]
>> [0, 8388607]
>>     [0, 8388606], then dm-thinp splits it to 2 bios
>>         [0, 8388479]
>>         [8388480, 8388606] ---> this has problem in process_discard_bio(),
>>                                 because the discard size(7 sectors) covers less than a block(128 sectors)
>>     [8388607, 8388607] ---> same problem
>>
>> 2. Ensure bi_size not overflow and max discard size is of proper granularity
>
> Ideally we'd get upper layers out of the business of knowing about the queue
> limits at all - that was the point of the patch series, after all.
>
> Instead of using UINT_MAX, would it work to just make the max 1 << 31 sectors?'

1 << 31 = 2G bytes = 0x400000 sectors.

Yes, that works as long as it's multiple of granularity.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Aug. 8, 2015, 5:22 a.m. UTC | #3
On Fri, Aug 07, 2015 at 10:17:43PM -0700, Ming Lin wrote:
> On Fri, Aug 7, 2015 at 5:30 PM, Kent Overstreet
> > Ideally we'd get upper layers out of the business of knowing about the queue
> > limits at all - that was the point of the patch series, after all.
> >
> > Instead of using UINT_MAX, would it work to just make the max 1 << 31 sectors?'
> 
> 1 << 31 = 2G bytes = 0x400000 sectors.
> 
> Yes, that works as long as it's multiple of granularity.

Is granularity required to be a power of two? One would hope, but looking at the
code that doesn't appear to be a requirement... ugh, that's terrible...

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Hannes Reinecke Aug. 8, 2015, 8:52 a.m. UTC | #4
On 08/08/2015 01:40 AM, Ming Lin wrote:
> 
> On Fri, 2015-08-07 at 09:30 +0200, Christoph Hellwig wrote:
>> I'm for solution 3:
>>
>>  - keep blk_bio_{discard,write_same}_split, but ensure we never built
>>    a > 4GB bio in blkdev_issue_{discard,write_same}.
> 
> This has problem as I mentioned in solution 1.
> We need to also make sure max discard size is of proper granularity.
> See below example.
> 
>       4G: 8388608 sectors
> UINT_MAX: 8388607 sectors
> 
> dm-thinp block size = default discard granularity = 128 sectors
> 
> blkdev_issue_discard(sector=0, nr_sectors=8388608)
> 
> 1. Only ensure bi_size not overflow
> 
> It doesn't work.
> 
> [start_sector, end_sector]
> [0, 8388607]
>     [0, 8388606], then dm-thinp splits it to 2 bios
>         [0, 8388479]
>         [8388480, 8388606] ---> this has problem in process_discard_bio(),
>                                 because the discard size(7 sectors) covers less than a block(128 sectors)
>     [8388607, 8388607] ---> same problem 
> 
> 2. Ensure bi_size not overflow and max discard size is of proper granularity
> 
> It works.
> 
> [start_sector, end_sector]
> [0, 8388607]
>     [0, 8388479]
>     [8388480, 8388607]
> 
> 
> So how about below patch?
> 
> commit 1ca2ad977255efb3c339f4ca16fb798ed5ec54f7
> Author: Ming Lin <ming.l@ssi.samsung.com>
> Date:   Fri Aug 7 15:07:07 2015 -0700
> 
>     block: remove split code in blkdev_issue_{discard,write_same}
>     
>     The split code in blkdev_issue_{discard,write_same} can go away
>     now that any driver that cares does the split. We have to make
>     sure bio size doesn't overflow.
>     
>     For discard, we ensure max_discard_sectors is of the proper
>     granularity. So if discard size > 4G, blkdev_issue_discard() always
>     send multiple granularity requests to lower level, except that the
>     last one may be not multiple granularity.
>     
>     Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
> ---
>  block/blk-lib.c | 37 +++++++++----------------------------
>  1 file changed, 9 insertions(+), 28 deletions(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 7688ee3..e178a07 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -44,7 +44,6 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  	struct request_queue *q = bdev_get_queue(bdev);
>  	int type = REQ_WRITE | REQ_DISCARD;
>  	unsigned int max_discard_sectors, granularity;
> -	int alignment;
>  	struct bio_batch bb;
>  	struct bio *bio;
>  	int ret = 0;
> @@ -58,18 +57,15 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  
>  	/* Zero-sector (unknown) and one-sector granularities are the same.  */
>  	granularity = max(q->limits.discard_granularity >> 9, 1U);
> -	alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
>  
>  	/*
> -	 * Ensure that max_discard_sectors is of the proper
> -	 * granularity, so that requests stay aligned after a split.
> -	 */
> -	max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
> +	 * Ensure that max_discard_sectors doesn't overflow bi_size and is of
> +	 * the proper granularity. So if discard size > 4G, blkdev_issue_discard()
> +	 * always split and send multiple granularity requests to lower level,
> +	 * except that the last one may be not multiple granularity.
> +         */
> +	max_discard_sectors = UINT_MAX >> 9;
>  	max_discard_sectors -= max_discard_sectors % granularity;
> -	if (unlikely(!max_discard_sectors)) {
> -		/* Avoid infinite loop below. Being cautious never hurts. */
> -		return -EOPNOTSUPP;
> -	}
>  
>  	if (flags & BLKDEV_DISCARD_SECURE) {
>  		if (!blk_queue_secdiscard(q))
> @@ -84,7 +80,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  	blk_start_plug(&plug);
>  	while (nr_sects) {
>  		unsigned int req_sects;
> -		sector_t end_sect, tmp;
> +		sector_t end_sect;
>  
>  		bio = bio_alloc(gfp_mask, 1);
>  		if (!bio) {
> @@ -93,20 +89,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  		}
>  
>  		req_sects = min_t(sector_t, nr_sects, max_discard_sectors);
> -
> -		/*
> -		 * If splitting a request, and the next starting sector would be
> -		 * misaligned, stop the discard at the previous aligned sector.
> -		 */
>  		end_sect = sector + req_sects;
> -		tmp = end_sect;
> -		if (req_sects < nr_sects &&
> -		    sector_div(tmp, granularity) != alignment) {
> -			end_sect = end_sect - alignment;
> -			sector_div(end_sect, granularity);
> -			end_sect = end_sect * granularity + alignment;
> -			req_sects = end_sect - sector;
> -		}
>  
>  		bio->bi_iter.bi_sector = sector;
>  		bio->bi_end_io = bio_batch_end_io;
> @@ -166,10 +149,8 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
>  	if (!q)
>  		return -ENXIO;
>  
> -	max_write_same_sectors = q->limits.max_write_same_sectors;
> -
> -	if (max_write_same_sectors == 0)
> -		return -EOPNOTSUPP;
> +	/* Ensure that max_write_same_sectors doesn't overflow bi_size */
> +	max_write_same_sectors = UINT_MAX >> 9;
>  
>  	atomic_set(&bb.done, 1);
>  	bb.flags = 1 << BIO_UPTODATE;
> 
Wouldn't it be easier to move both max_write_same_sectors and
max_discard sectors to 64 bit (ie to type sector_t) and be done with the
overflow?
Seems to me this is far too much coding around self-imposed restrictions...

Cheers,

Hannes
Kent Overstreet Aug. 8, 2015, 9:02 a.m. UTC | #5
On Sat, Aug 08, 2015 at 10:52:24AM +0200, Hannes Reinecke wrote:
> Wouldn't it be easier to move both max_write_same_sectors and
> max_discard sectors to 64 bit (ie to type sector_t) and be done with the
> overflow?
> Seems to me this is far too much coding around self-imposed restrictions...

It's bio->bi_iter.bi_size that would have to be increased to 64 bits. Which I
suppose wouldn't actually increase the size of struct bio (when sector_t is 64
bits), since struct bvec_iter has padding right now...

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Hellwig Aug. 8, 2015, 12:35 p.m. UTC | #6
On Fri, Aug 07, 2015 at 09:22:26PM -0800, Kent Overstreet wrote:
> Is granularity required to be a power of two? One would hope, but looking at the
> code that doesn't appear to be a requirement... ugh, that's terrible...

If devices have an odd granularity we'll just have to split more than
nessecary.  Let's hope no major hardware does that.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Hannes Reinecke Aug. 13, 2015, 6:04 a.m. UTC | #7
On 08/08/2015 11:02 AM, Kent Overstreet wrote:
> On Sat, Aug 08, 2015 at 10:52:24AM +0200, Hannes Reinecke wrote:
>> Wouldn't it be easier to move both max_write_same_sectors and
>> max_discard sectors to 64 bit (ie to type sector_t) and be done with the
>> overflow?
>> Seems to me this is far too much coding around self-imposed restrictions...
> 
> It's bio->bi_iter.bi_size that would have to be increased to 64 bits. Which I
> suppose wouldn't actually increase the size of struct bio (when sector_t is 64
> bits), since struct bvec_iter has padding right now...
> 
Which I guess we should be doing anyway.
Devices are getting larger and larger, so in the long run in need to
be moved to 64 bits.

Cheers,

Hannes
diff mbox

Patch

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3..e178a07 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -44,7 +44,6 @@  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	struct request_queue *q = bdev_get_queue(bdev);
 	int type = REQ_WRITE | REQ_DISCARD;
 	unsigned int max_discard_sectors, granularity;
-	int alignment;
 	struct bio_batch bb;
 	struct bio *bio;
 	int ret = 0;
@@ -58,18 +57,15 @@  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 
 	/* Zero-sector (unknown) and one-sector granularities are the same.  */
 	granularity = max(q->limits.discard_granularity >> 9, 1U);
-	alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
 
 	/*
-	 * Ensure that max_discard_sectors is of the proper
-	 * granularity, so that requests stay aligned after a split.
-	 */
-	max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
+	 * Ensure that max_discard_sectors doesn't overflow bi_size and is of
+	 * the proper granularity. So if discard size > 4G, blkdev_issue_discard()
+	 * always split and send multiple granularity requests to lower level,
+	 * except that the last one may be not multiple granularity.
+         */
+	max_discard_sectors = UINT_MAX >> 9;
 	max_discard_sectors -= max_discard_sectors % granularity;
-	if (unlikely(!max_discard_sectors)) {
-		/* Avoid infinite loop below. Being cautious never hurts. */
-		return -EOPNOTSUPP;
-	}
 
 	if (flags & BLKDEV_DISCARD_SECURE) {
 		if (!blk_queue_secdiscard(q))
@@ -84,7 +80,7 @@  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	blk_start_plug(&plug);
 	while (nr_sects) {
 		unsigned int req_sects;
-		sector_t end_sect, tmp;
+		sector_t end_sect;
 
 		bio = bio_alloc(gfp_mask, 1);
 		if (!bio) {
@@ -93,20 +89,7 @@  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		}
 
 		req_sects = min_t(sector_t, nr_sects, max_discard_sectors);
-
-		/*
-		 * If splitting a request, and the next starting sector would be
-		 * misaligned, stop the discard at the previous aligned sector.
-		 */
 		end_sect = sector + req_sects;
-		tmp = end_sect;
-		if (req_sects < nr_sects &&
-		    sector_div(tmp, granularity) != alignment) {
-			end_sect = end_sect - alignment;
-			sector_div(end_sect, granularity);
-			end_sect = end_sect * granularity + alignment;
-			req_sects = end_sect - sector;
-		}
 
 		bio->bi_iter.bi_sector = sector;
 		bio->bi_end_io = bio_batch_end_io;
@@ -166,10 +149,8 @@  int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 	if (!q)
 		return -ENXIO;
 
-	max_write_same_sectors = q->limits.max_write_same_sectors;
-
-	if (max_write_same_sectors == 0)
-		return -EOPNOTSUPP;
+	/* Ensure that max_write_same_sectors doesn't overflow bi_size */
+	max_write_same_sectors = UINT_MAX >> 9;
 
 	atomic_set(&bb.done, 1);
 	bb.flags = 1 << BIO_UPTODATE;