diff mbox

[v5,01/11] block: make generic_make_request handle arbitrarily sized bios

Message ID 1439099990.7880.0.camel@hasee (mailing list archive)
State Accepted, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Ming Lin Aug. 9, 2015, 5:59 a.m. UTC
On Sat, 2015-08-08 at 12:19 -0400, Martin K. Petersen wrote:
> >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> 
> Mike> This will translate to all intermediate layers that might split
> Mike> discards needing to worry about granularity/alignment too
> Mike> (e.g. how dm-thinp will have to care because it must generate
> Mike> discard mappings with associated bios based on how blocks were
> Mike> mapped to thinp).
> 
> The fundamental issue here is that alignment and granularity should
> never, ever have been enforced at the top of the stack. Horrendous idea
> from the very beginning.
> 
> For the < handful of braindead devices that get confused when you do
> partial or misaligned blocks we should have had a quirk that did any
> range adjusting at the bottom in sd_setup_discard_cmnd().
> 
> There's a reason I turned discard_zeroes_data off for UNMAP!
> 
> Wrt. the range size I don't have a problem with capping at the 32-bit
> bi_size limit. We probably don't want to send commands much bigger than
> that anyway.

How about below?

commit b8ca440bd77653d4d2bac90b7fd1599e9e0e150a
Author: Ming Lin <ming.l@ssi.samsung.com>
Date:   Fri Aug 7 15:07:07 2015 -0700

    block: remove split code in blkdev_issue_{discard,write_same}
    
    The split code in blkdev_issue_{discard,write_same} can go away
    now that any driver that cares does the split. We have to make
    sure bio size doesn't overflow.
    
    For discard, we set max discard sectors to (1<<31)>>9 to ensure
    it doesn't overflow bi_size and hopefully it is of the proper
    granularity as long as the granularity is a power of two.
    
    Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
---
 block/blk-lib.c | 47 +++++++++++------------------------------------
 1 file changed, 11 insertions(+), 36 deletions(-)



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Christoph Hellwig Aug. 9, 2015, 6:41 a.m. UTC | #1
On Sat, Aug 08, 2015 at 10:59:50PM -0700, Ming Lin wrote:
> +/*
> + * Ensure that max discard sectors doesn't overflow bi_size and hopefully
> + * it is of the proper granularity as long as the granularity is a power
> + * of two.
> + */
> +#define MAX_DISCARD_SECTORS ((1U << 31) >> 9)

Thisn't isn't MAX_DISCARD_SECTORS, it's MAX_BIO_SECTORS.  If we ever
to something like Kent's multipage biovecs we'll actually need it for
regular read/write bios in addition to discard and write same.

Except for that the patch looks reasonable to me.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin Aug. 9, 2015, 6:55 a.m. UTC | #2
On Sat, 2015-08-08 at 23:41 -0700, Christoph Hellwig wrote:
> On Sat, Aug 08, 2015 at 10:59:50PM -0700, Ming Lin wrote:
> > +/*
> > + * Ensure that max discard sectors doesn't overflow bi_size and hopefully
> > + * it is of the proper granularity as long as the granularity is a power
> > + * of two.
> > + */
> > +#define MAX_DISCARD_SECTORS ((1U << 31) >> 9)
> 
> Thisn't isn't MAX_DISCARD_SECTORS, it's MAX_BIO_SECTORS.  If we ever
> to something like Kent's multipage biovecs we'll actually need it for
> regular read/write bios in addition to discard and write same.
> 
> Except for that the patch looks reasonable to me.

Will change it to MAX_BIO_SECTORS.
May I add your ACK?


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Hellwig Aug. 9, 2015, 7:01 a.m. UTC | #3
On Sat, Aug 08, 2015 at 11:55:47PM -0700, Ming Lin wrote:
> Will change it to MAX_BIO_SECTORS.
> May I add your ACK?

Yes, please go ahead.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin Aug. 9, 2015, 7:18 a.m. UTC | #4
On Sun, 2015-08-09 at 00:01 -0700, Christoph Hellwig wrote:
> On Sat, Aug 08, 2015 at 11:55:47PM -0700, Ming Lin wrote:
> > Will change it to MAX_BIO_SECTORS.
> > May I add your ACK?
> 
> Yes, please go ahead.

Thanks. I'll send a new version of the series once device-mapper guy
acks.

Hi Mike,

I have updated my tree. Could you pull and re-test?
https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-generic-req

The 2 thin-provisioning tests passed.
Hope I can have your ACK soon.

Thanks.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Aug. 10, 2015, 3:02 p.m. UTC | #5
On Sun, Aug 09 2015 at  3:18am -0400,
Ming Lin <mlin@kernel.org> wrote:

> On Sun, 2015-08-09 at 00:01 -0700, Christoph Hellwig wrote:
> > On Sat, Aug 08, 2015 at 11:55:47PM -0700, Ming Lin wrote:
> > > Will change it to MAX_BIO_SECTORS.
> > > May I add your ACK?
> > 
> > Yes, please go ahead.
> 
> Thanks. I'll send a new version of the series once device-mapper guy
> acks.
> 
> Hi Mike,
> 
> I have updated my tree. Could you pull and re-test?
> https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-generic-req
> 
> The 2 thin-provisioning tests passed.

I've merged your latest branch with my dm-4.3 branch, I had one conflict
in the merge due to the dm_merge_bvec() change from 4.2-rc6 (no
surprise).  I've published the result here:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=block-late-bio-splitting

It passes the device-mapper-test-suite's 'thin-provisioning' tests.

> Hope I can have your ACK soon.

Shouldn't we also be using MAX_BIO_SECTORS in blkdev_issue_write_same
(instead of UINT_MAX >> 9)?

Aside from that, I'm in favor of seeing this late bio splitting patchset
finally land upstream (hopefully in time for the 4.3 merge, Jens?):

Acked-by: Mike Snitzer <snitzer@redhat.com>

p.s. I'll be working with Joe Thornber on optimizing DM (particularly
dm-thinp and dm-cache) once this patchset is included upstream.  You'll
see I've already added a couple WIP dm-thinp patches ontop.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin Aug. 10, 2015, 4:14 p.m. UTC | #6
On Mon, 2015-08-10 at 11:02 -0400, Mike Snitzer wrote:
> On Sun, Aug 09 2015 at  3:18am -0400,
> Ming Lin <mlin@kernel.org> wrote:
> 
> > On Sun, 2015-08-09 at 00:01 -0700, Christoph Hellwig wrote:
> > > On Sat, Aug 08, 2015 at 11:55:47PM -0700, Ming Lin wrote:
> > > > Will change it to MAX_BIO_SECTORS.
> > > > May I add your ACK?
> > > 
> > > Yes, please go ahead.
> > 
> > Thanks. I'll send a new version of the series once device-mapper guy
> > acks.
> > 
> > Hi Mike,
> > 
> > I have updated my tree. Could you pull and re-test?
> > https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-generic-req
> > 
> > The 2 thin-provisioning tests passed.
> 
> I've merged your latest branch with my dm-4.3 branch, I had one conflict
> in the merge due to the dm_merge_bvec() change from 4.2-rc6 (no
> surprise).  I've published the result here:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=block-late-bio-splitting
> 
> It passes the device-mapper-test-suite's 'thin-provisioning' tests.
> 
> > Hope I can have your ACK soon.
> 
> Shouldn't we also be using MAX_BIO_SECTORS in blkdev_issue_write_same
> (instead of UINT_MAX >> 9)?

I also prefer using MAX_BIO_SECTORS.
Otherwise, we may have non page size aligned splits.

Say, write_same 8G.

Using UINT_MAX >> 9, we'll have 2 sector aligned splits in
blkdev_issue_write_same():
0 - (4G - 512 - 1)
(4G - 512, 8G -1)

This looks weired.

Using MAX_BIO_SECTORS, we'll have 4 page size aligned splits:
0 - (2G -1)
2G - (4G - 1)
4G - (6G - 1)
6G - (8G - 1)

I'll use MAX_BIO_SECTORS in blkdev_issue_write_same() if no objection.

> 
> Aside from that, I'm in favor of seeing this late bio splitting patchset
> finally land upstream (hopefully in time for the 4.3 merge, Jens?):
> 
> Acked-by: Mike Snitzer <snitzer@redhat.com>

Thanks!

May I add your Ack to "PATCH: block: kill merge_bvec_fn() completely"
also?

> 
> p.s. I'll be working with Joe Thornber on optimizing DM (particularly
> dm-thinp and dm-cache) once this patchset is included upstream.  You'll
> see I've already added a couple WIP dm-thinp patches ontop.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin Aug. 10, 2015, 4:18 p.m. UTC | #7
On Mon, 2015-08-10 at 09:14 -0700, Ming Lin wrote:
> On Mon, 2015-08-10 at 11:02 -0400, Mike Snitzer wrote:
> > On Sun, Aug 09 2015 at  3:18am -0400,
> > Ming Lin <mlin@kernel.org> wrote:
> > 
> > > On Sun, 2015-08-09 at 00:01 -0700, Christoph Hellwig wrote:
> > > > On Sat, Aug 08, 2015 at 11:55:47PM -0700, Ming Lin wrote:
> > > > > Will change it to MAX_BIO_SECTORS.
> > > > > May I add your ACK?
> > > > 
> > > > Yes, please go ahead.
> > > 
> > > Thanks. I'll send a new version of the series once device-mapper guy
> > > acks.
> > > 
> > > Hi Mike,
> > > 
> > > I have updated my tree. Could you pull and re-test?
> > > https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-generic-req
> > > 
> > > The 2 thin-provisioning tests passed.
> > 
> > I've merged your latest branch with my dm-4.3 branch, I had one conflict
> > in the merge due to the dm_merge_bvec() change from 4.2-rc6 (no
> > surprise).  I've published the result here:
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=block-late-bio-splitting
> > 
> > It passes the device-mapper-test-suite's 'thin-provisioning' tests.
> > 
> > > Hope I can have your ACK soon.
> > 
> > Shouldn't we also be using MAX_BIO_SECTORS in blkdev_issue_write_same
> > (instead of UINT_MAX >> 9)?
> 
> I also prefer using MAX_BIO_SECTORS.
> Otherwise, we may have non page size aligned splits.
> 
> Say, write_same 8G.
> 
> Using UINT_MAX >> 9, we'll have 2 sector aligned splits in
> blkdev_issue_write_same():
> 0 - (4G - 512 - 1)
> (4G - 512, 8G -1)

Actually, 3 sector aligned splits.

0 - (4G-512-1)
(4G-512), (8G-512-1)
(8G-512), (8G-1)

> 
> This looks weired.
> 
> Using MAX_BIO_SECTORS, we'll have 4 page size aligned splits:
> 0 - (2G -1)
> 2G - (4G - 1)
> 4G - (6G - 1)
> 6G - (8G - 1)
> 
> I'll use MAX_BIO_SECTORS in blkdev_issue_write_same() if no objection.
> 
> > 
> > Aside from that, I'm in favor of seeing this late bio splitting patchset
> > finally land upstream (hopefully in time for the 4.3 merge, Jens?):
> > 
> > Acked-by: Mike Snitzer <snitzer@redhat.com>
> 
> Thanks!
> 
> May I add your Ack to "PATCH: block: kill merge_bvec_fn() completely"
> also?
> 
> > 
> > p.s. I'll be working with Joe Thornber on optimizing DM (particularly
> > dm-thinp and dm-cache) once this patchset is included upstream.  You'll
> > see I've already added a couple WIP dm-thinp patches ontop.
> 


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin K. Petersen Aug. 10, 2015, 4:22 p.m. UTC | #8
>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> Shouldn't we also be using MAX_BIO_SECTORS in
Mike> blkdev_issue_write_same (instead of UINT_MAX >> 9)?

The granularity of WRITE SAME is always 1 logical block and there is no
reason to round it down to the next power of two.

+/*
+ * Ensure that max discard sectors doesn't overflow bi_size and hopefully
+ * it is of the proper granularity as long as the granularity is a power
+ * of two.
+ */
+#define MAX_DISCARD_SECTORS ((1U << 31) >> 9)
+

That's fine for SATA since we're already capping at 2TB minus change.
But it means we'll be capping unnecessarily on SCSI. And larger range
counts are impending in SATA as well.

So this goes back to my original comment: The only place there can be a
discard granularity is for SCSI UNMAP. And consequently, we should only
enforce alignment and granularity when that code path is taken in sd.c.

I'm OK with Ming's patch series in general. Let's leave the discard cap
at UINT_MAX and I'll twiddle the rest in SCSI.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Martin K. Petersen Aug. 10, 2015, 4:40 p.m. UTC | #9
>>>>> "Ming" == Ming Lin <mlin@kernel.org> writes:

Ming,

Ming> I also prefer using MAX_BIO_SECTORS.  Otherwise, we may have non
Ming> page size aligned splits.

This does not matter for write same and discard since there is only a
single logical block of payload. Also, given limitations in SATA we're
always issuing 2GB-32KB sized discards. Rounding those down to an even
1GB would impact performance.

I am sympathetic to wanting to issue I/Os that are aligned to powers of
two. But for most devices this matters little since the additional cost
is limited to misaligned head and tail blocks.

One thing that may be worth considering is switching bi_size from bytes
to blocks for REQ_FS. That would give us some headroom without
increasing bi_size beyond 32 bits. But I'm not entirely sure it's worth
the pain.
Mike Snitzer Aug. 10, 2015, 6:13 p.m. UTC | #10
On Mon, Aug 10 2015 at 12:14pm -0400,
Ming Lin <mlin@kernel.org> wrote:

> On Mon, 2015-08-10 at 11:02 -0400, Mike Snitzer wrote:
> > 
> > Aside from that, I'm in favor of seeing this late bio splitting patchset
> > finally land upstream (hopefully in time for the 4.3 merge, Jens?):
> > 
> > Acked-by: Mike Snitzer <snitzer@redhat.com>
> 
> Thanks!
> 
> May I add your Ack to "PATCH: block: kill merge_bvec_fn() completely"
> also?

Sure, but please fold in the removal of dm.c comments I made in this
merge commit:

http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=block-late-bio-splitting&id=d6df875bb65ef1ee10c91cf09cb58d009286321f

thanks.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin Aug. 10, 2015, 6:18 p.m. UTC | #11
On Mon, 2015-08-10 at 12:22 -0400, Martin K. Petersen wrote:
> >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> 
> Mike> Shouldn't we also be using MAX_BIO_SECTORS in
> Mike> blkdev_issue_write_same (instead of UINT_MAX >> 9)?
> 
> The granularity of WRITE SAME is always 1 logical block and there is no
> reason to round it down to the next power of two.
> 
> +/*
> + * Ensure that max discard sectors doesn't overflow bi_size and hopefully
> + * it is of the proper granularity as long as the granularity is a power
> + * of two.
> + */
> +#define MAX_DISCARD_SECTORS ((1U << 31) >> 9)
> +
> 
> That's fine for SATA since we're already capping at 2TB minus change.
> But it means we'll be capping unnecessarily on SCSI. And larger range
> counts are impending in SATA as well.
> 
> So this goes back to my original comment: The only place there can be a
> discard granularity is for SCSI UNMAP. And consequently, we should only
> enforce alignment and granularity when that code path is taken in sd.c.
> 
> I'm OK with Ming's patch series in general. Let's leave the discard cap
> at UINT_MAX and I'll twiddle the rest in SCSI.

Just to make sure I didn't misunderstand it.

Did you mean still use (UINT_MAX >> 9) in blkdev_issue_discard()?

	req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
instead of:
	req_sects = min_t(sector_t, nr_sects, MAX_DISCARD_SECTORS);

But that doesn't work for dm-thinp. See Kent's suggestion to use 1<<31.
https://www.redhat.com/archives/dm-devel/2015-August/msg00053.html

> 
> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> 

Thanks!


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin Aug. 10, 2015, 10:30 p.m. UTC | #12
On Mon, Aug 10, 2015 at 11:13 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Mon, Aug 10 2015 at 12:14pm -0400,
> Ming Lin <mlin@kernel.org> wrote:
>
>> On Mon, 2015-08-10 at 11:02 -0400, Mike Snitzer wrote:
>> >
>> > Aside from that, I'm in favor of seeing this late bio splitting patchset
>> > finally land upstream (hopefully in time for the 4.3 merge, Jens?):
>> >
>> > Acked-by: Mike Snitzer <snitzer@redhat.com>
>>
>> Thanks!
>>
>> May I add your Ack to "PATCH: block: kill merge_bvec_fn() completely"
>> also?
>
> Sure, but please fold in the removal of dm.c comments I made in this
> merge commit:
>
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=block-late-bio-splitting&id=d6df875bb65ef1ee10c91cf09cb58d009286321f

Rebased on top of latest Linus tree(4.2-rc6+).

https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/diff/drivers/md/dm.c?h=block-generic-req&id=3dd8509a3f05152cca83bc41c37a8ba3e9119736

>
> thanks.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin K. Petersen Aug. 11, 2015, 2 a.m. UTC | #13
>>>>> "Ming" == Ming Lin <mlin@kernel.org> writes:

Ming> Did you mean still use (UINT_MAX >> 9) in blkdev_issue_discard()?

Ming> But that doesn't work for dm-thinp. See Kent's suggestion to use
Ming> 1<<31.

I'm not sure why things are not working for dm-thinp. Presumably Kent's
code would split the discard at a granularity boundary so why would that
cause problems for dm?

In looking at this I just found out that we'll corrupt data on certain
SCSI configs with the granularity enforcement in place. I'll have to
conjure up a fix for that...
Mike Snitzer Aug. 11, 2015, 2:41 a.m. UTC | #14
On Mon, Aug 10 2015 at 10:00pm -0400,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> >>>>> "Ming" == Ming Lin <mlin@kernel.org> writes:
> 
> Ming> Did you mean still use (UINT_MAX >> 9) in blkdev_issue_discard()?
> 
> Ming> But that doesn't work for dm-thinp. See Kent's suggestion to use
> Ming> 1<<31.
> 
> I'm not sure why things are not working for dm-thinp. Presumably Kent's
> code would split the discard at a granularity boundary so why would that
> cause problems for dm?

DM-thinp processes discards internally before it passes them down (if
configured to do so).  If a discard is smaller than the granularity of a
thinp block (whose size is configurable) or if the start and end of the
discard's extent is misaligned (relative to the thinp blocks mapped to
the logical extent) then the discard won't actually discard partial
thinp blocks.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Aug. 11, 2015, 3:38 a.m. UTC | #15
On Mon, Aug 10, 2015 at 10:41:55PM -0400, Mike Snitzer wrote:
> On Mon, Aug 10 2015 at 10:00pm -0400,
> Martin K. Petersen <martin.petersen@oracle.com> wrote:
> 
> > >>>>> "Ming" == Ming Lin <mlin@kernel.org> writes:
> > 
> > Ming> Did you mean still use (UINT_MAX >> 9) in blkdev_issue_discard()?
> > 
> > Ming> But that doesn't work for dm-thinp. See Kent's suggestion to use
> > Ming> 1<<31.
> > 
> > I'm not sure why things are not working for dm-thinp. Presumably Kent's
> > code would split the discard at a granularity boundary so why would that
> > cause problems for dm?
> 
> DM-thinp processes discards internally before it passes them down (if
> configured to do so).  If a discard is smaller than the granularity of a
> thinp block (whose size is configurable) or if the start and end of the
> discard's extent is misaligned (relative to the thinp blocks mapped to
> the logical extent) then the discard won't actually discard partial
> thinp blocks.

This kind of logic really doesn't belong in dm - if it's needed, it really
belongs in bio_split() (which is supposed to work correctly for discards - so if
it is needed, then bio_split() needs fixing...)

IMO though it belongs in the driver - if a discard needs to be dropped because
it's too small and the hardware can't do it, that should be the driver's
responsibility.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Aug. 11, 2015, 2:08 p.m. UTC | #16
On Mon, Aug 10 2015 at 11:38pm -0400,
Kent Overstreet <kent.overstreet@gmail.com> wrote:

> On Mon, Aug 10, 2015 at 10:41:55PM -0400, Mike Snitzer wrote:
> > On Mon, Aug 10 2015 at 10:00pm -0400,
> > Martin K. Petersen <martin.petersen@oracle.com> wrote:
> > 
> > > >>>>> "Ming" == Ming Lin <mlin@kernel.org> writes:
> > > 
> > > Ming> Did you mean still use (UINT_MAX >> 9) in blkdev_issue_discard()?
> > > 
> > > Ming> But that doesn't work for dm-thinp. See Kent's suggestion to use
> > > Ming> 1<<31.
> > > 
> > > I'm not sure why things are not working for dm-thinp. Presumably Kent's
> > > code would split the discard at a granularity boundary so why would that
> > > cause problems for dm?
> > 
> > DM-thinp processes discards internally before it passes them down (if
> > configured to do so).  If a discard is smaller than the granularity of a
> > thinp block (whose size is configurable) or if the start and end of the
> > discard's extent is misaligned (relative to the thinp blocks mapped to
> > the logical extent) then the discard won't actually discard partial
> > thinp blocks.
> 
> This kind of logic really doesn't belong in dm - if it's needed, it really
> belongs in bio_split() (which is supposed to work correctly for discards - so if
> it is needed, then bio_split() needs fixing...)

DM thinp does advertise discard_granularity that reflects the thinp blocksize.
blk_queue_split() does look like it'd do the right thing.  But the
splitting that DM thinp is doing is a long standing implementation (in
DM core) that will need to be carefully reviewed/rewritten.  We can
tackle it after all this late splitting code lands.

> IMO though it belongs in the driver - if a discard needs to be dropped because
> it's too small and the hardware can't do it, that should be the driver's
> responsibility.

This isn't about the hardware's limits.  This is about the intermediate
remapping/stacking driver's own space management hooking off of the
discard bio too.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin K. Petersen Aug. 11, 2015, 5:36 p.m. UTC | #17
>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> DM-thinp processes discards internally before it passes them down
Mike> (if configured to do so).  If a discard is smaller than the
Mike> granularity of a thinp block (whose size is configurable) or if
Mike> the start and end of the discard's extent is misaligned (relative
Mike> to the thinp blocks mapped to the logical extent) then the discard
Mike> won't actually discard partial thinp blocks.

That's fine. You can throw away anything you don't like as long as
discard_zeroes_data=0.

But I don't understand why having an artificial cap at 2GB fixes
things. Other than making it less likely for you to receive a runt by
virtue of being aligned to a power of two.
Mike Snitzer Aug. 11, 2015, 5:47 p.m. UTC | #18
On Tue, Aug 11 2015 at  1:36pm -0400,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> 
> Mike> DM-thinp processes discards internally before it passes them down
> Mike> (if configured to do so).  If a discard is smaller than the
> Mike> granularity of a thinp block (whose size is configurable) or if
> Mike> the start and end of the discard's extent is misaligned (relative
> Mike> to the thinp blocks mapped to the logical extent) then the discard
> Mike> won't actually discard partial thinp blocks.
> 
> That's fine. You can throw away anything you don't like as long as
> discard_zeroes_data=0.

Correct, dm-thinp sets discard_zeroes_data=0

> But I don't understand why having an artificial cap at 2GB fixes
> things. Other than making it less likely for you to receive a runt by
> virtue of being aligned to a power of two.

That is the benefit.  And when coupled with the new default max_discard
of 64K (pending change from Jens for 4.3) this 2GB upper limit really
isn't such a big deal.  Unless I'm missing something...

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin K. Petersen Aug. 11, 2015, 5:49 p.m. UTC | #19
>>>>> "Kent" == Kent Overstreet <kent.overstreet@gmail.com> writes:

Kent> This kind of logic really doesn't belong in dm

Well it does in this case since the thinp personality actually
provisions and unprovisions space.

But there is a difference between what dm thinp acts on for its own
internal provisioning purposes and what it passes down the stack. I am
really against dropping information anywhere along the path. We don't
round off read/write requests either.

The queue limits were meant as hints to mkfs.* so that on-disk data
structures could be laid out in an aligned and storage friendly way. I
never intended for the hints to affect runtime behavior.

Kent> IMO though it belongs in the driver - if a discard needs to be
Kent> dropped because it's too small and the hardware can't do it, that
Kent> should be the driver's responsibility.

I agree except I really don't want to lop off anything unless the device
locks up if we send it partial blocks. There was an array that had
problems a while back but I believe they have been fixed.

The fundamental premise should be that we pass as comprehensive
information as we can. And the device can then decide to ignore all or
parts of the request. That's fundamentally how things work at the
protocol level in both SCSI and SATA. I don't see any reason why the
Linux I/O stack should behave in a different manner.
Martin K. Petersen Aug. 11, 2015, 6:01 p.m. UTC | #20
>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> That is the benefit.  And when coupled with the new default
Mike> max_discard of 64K (pending change from Jens for 4.3) this 2GB
Mike> upper limit really isn't such a big deal.  Unless I'm missing
Mike> something...

2GB is fine for current SATA due to the stupid range descriptors. But
there are some changes in the pipeline to support descriptors with
bigger ranges so that will change.

However, we currently do 4GB discards on SCSI so the proposed cap will
cut that in half. I'm OK with that as an interim solution. though.

But I'm a bit concerned about what might be lurking in dm thinp if you
trip over partial blocks like in Ming's example...
Martin K. Petersen Aug. 11, 2015, 6:05 p.m. UTC | #21
>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:

Martin> I agree except I really don't want to lop off anything unless
Martin> the device locks up if we send it partial blocks. There was an
Martin> array that had problems a while back but I believe they have
Martin> been fixed.

Oh, and there are several arrays out there that have allocation units
that are not powers of two. *sigh*
Ming Lin Aug. 11, 2015, 8:56 p.m. UTC | #22
On Tue, 2015-08-11 at 14:05 -0400, Martin K. Petersen wrote:
> >>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:
> 
> Martin> I agree except I really don't want to lop off anything unless
> Martin> the device locks up if we send it partial blocks. There was an
> Martin> array that had problems a while back but I believe they have
> Martin> been fixed.
> 
> Oh, and there are several arrays out there that have allocation units
> that are not powers of two. *sigh*

Do you still agree we cap discard to 2G as an interim solution?
https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/commit/?h=block-generic-req&id=39d2f5f

If OK, I'll send out new version asking Jens to apply.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin K. Petersen Aug. 12, 2015, 12:24 a.m. UTC | #23
>>>>> "Ming" == Ming Lin <mlin@kernel.org> writes:

Ming> Do you still agree we cap discard to 2G as an interim solution?

I can live with the 2G cap for 4.3 but would like to see it fixed
properly in 4.4.
Ming Lin Aug. 12, 2015, 4:41 a.m. UTC | #24
On Tue, 2015-08-11 at 20:24 -0400, Martin K. Petersen wrote:
> >>>>> "Ming" == Ming Lin <mlin@kernel.org> writes:
> 
> Ming> Do you still agree we cap discard to 2G as an interim solution?
> 
> I can live with the 2G cap for 4.3 but would like to see it fixed
> properly in 4.4.

Sure. I'd like to work on the fix.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin Aug. 18, 2015, 5:09 a.m. UTC | #25
On Mon, Aug 10, 2015 at 8:02 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> p.s. I'll be working with Joe Thornber on optimizing DM (particularly
> dm-thinp and dm-cache) once this patchset is included upstream.  You'll
> see I've already added a couple WIP dm-thinp patches ontop.

Hi Mike,

Just to avoid duplicated work.
Are you going to work on the dm-thinp/dm-cache discard rewritten?

Otherwise, I can work on it.

Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin Aug. 18, 2015, 7:04 a.m. UTC | #26
On Mon, Aug 17, 2015 at 10:09 PM, Ming Lin <mlin@kernel.org> wrote:
> On Mon, Aug 10, 2015 at 8:02 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>> p.s. I'll be working with Joe Thornber on optimizing DM (particularly
>> dm-thinp and dm-cache) once this patchset is included upstream.  You'll
>> see I've already added a couple WIP dm-thinp patches ontop.
>
> Hi Mike,
>
> Just to avoid duplicated work.
> Are you going to work on the dm-thinp/dm-cache discard rewritten?

Seems dm-stripe discard also needs rewrite.

>
> Otherwise, I can work on it.
>
> Thanks,
> Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Aug. 18, 2015, 2:45 p.m. UTC | #27
On Tue, Aug 18 2015 at  3:04am -0400,
Ming Lin <mlin@kernel.org> wrote:

> On Mon, Aug 17, 2015 at 10:09 PM, Ming Lin <mlin@kernel.org> wrote:
> > On Mon, Aug 10, 2015 at 8:02 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> >> p.s. I'll be working with Joe Thornber on optimizing DM (particularly
> >> dm-thinp and dm-cache) once this patchset is included upstream.  You'll
> >> see I've already added a couple WIP dm-thinp patches ontop.
> >
> > Hi Mike,
> >
> > Just to avoid duplicated work.
> > Are you going to work on the dm-thinp/dm-cache discard rewritten?
> 
> Seems dm-stripe discard also needs rewrite.

Can you elaborate on what you feel needs re-writing in these targets?

This is the basic initial cleanup I had in mind for dm-thinp:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=dm-4.4&id=cb0aca0a6bfad6b7f7146dde776f374082a73db6

A much more involved refactoring of the dm-cache and dm-thinp targets to
eliminate the need for splitting will involve bio-prison range locking
and a new metadata format for both targets to express ranges as opposed
to blocks.  This line of work is on Joe's radar but it is much further
out given the associated on-disk metadata format change.

That aside, I do need to look at DM core to see how we can do things
differently so that block core's bio_split() et al is doing the
splitting rather than DM core having a role.

I'd prefer to be the one working these DM changes.  But if you have
ideas of how things should be cleaned up I'd be happy to consider them.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3..4859e4b 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -26,6 +26,13 @@  static void bio_batch_end_io(struct bio *bio, int err)
 	bio_put(bio);
 }
 
+/*
+ * Ensure that max discard sectors doesn't overflow bi_size and hopefully
+ * it is of the proper granularity as long as the granularity is a power
+ * of two.
+ */
+#define MAX_DISCARD_SECTORS ((1U << 31) >> 9)
+
 /**
  * blkdev_issue_discard - queue a discard
  * @bdev:	blockdev to issue discard for
@@ -43,8 +50,6 @@  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	DECLARE_COMPLETION_ONSTACK(wait);
 	struct request_queue *q = bdev_get_queue(bdev);
 	int type = REQ_WRITE | REQ_DISCARD;
-	unsigned int max_discard_sectors, granularity;
-	int alignment;
 	struct bio_batch bb;
 	struct bio *bio;
 	int ret = 0;
@@ -56,21 +61,6 @@  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	if (!blk_queue_discard(q))
 		return -EOPNOTSUPP;
 
-	/* Zero-sector (unknown) and one-sector granularities are the same.  */
-	granularity = max(q->limits.discard_granularity >> 9, 1U);
-	alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
-
-	/*
-	 * Ensure that max_discard_sectors is of the proper
-	 * granularity, so that requests stay aligned after a split.
-	 */
-	max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
-	max_discard_sectors -= max_discard_sectors % granularity;
-	if (unlikely(!max_discard_sectors)) {
-		/* Avoid infinite loop below. Being cautious never hurts. */
-		return -EOPNOTSUPP;
-	}
-
 	if (flags & BLKDEV_DISCARD_SECURE) {
 		if (!blk_queue_secdiscard(q))
 			return -EOPNOTSUPP;
@@ -84,7 +74,7 @@  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	blk_start_plug(&plug);
 	while (nr_sects) {
 		unsigned int req_sects;
-		sector_t end_sect, tmp;
+		sector_t end_sect;
 
 		bio = bio_alloc(gfp_mask, 1);
 		if (!bio) {
@@ -92,21 +82,8 @@  int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 			break;
 		}
 
-		req_sects = min_t(sector_t, nr_sects, max_discard_sectors);
-
-		/*
-		 * If splitting a request, and the next starting sector would be
-		 * misaligned, stop the discard at the previous aligned sector.
-		 */
+		req_sects = min_t(sector_t, nr_sects, MAX_DISCARD_SECTORS);
 		end_sect = sector + req_sects;
-		tmp = end_sect;
-		if (req_sects < nr_sects &&
-		    sector_div(tmp, granularity) != alignment) {
-			end_sect = end_sect - alignment;
-			sector_div(end_sect, granularity);
-			end_sect = end_sect * granularity + alignment;
-			req_sects = end_sect - sector;
-		}
 
 		bio->bi_iter.bi_sector = sector;
 		bio->bi_end_io = bio_batch_end_io;
@@ -166,10 +143,8 @@  int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 	if (!q)
 		return -ENXIO;
 
-	max_write_same_sectors = q->limits.max_write_same_sectors;
-
-	if (max_write_same_sectors == 0)
-		return -EOPNOTSUPP;
+	/* Ensure that max_write_same_sectors doesn't overflow bi_size */
+	max_write_same_sectors = UINT_MAX >> 9;
 
 	atomic_set(&bb.done, 1);
 	bb.flags = 1 << BIO_UPTODATE;