diff mbox

[linux-4.7-rc7] blk_stack_limits() setting discard_granularity

Message ID CAHJVUehBe1zKvp==CUs0YkmRt5sT8qOES6bPnaeNKKsfuZ3Zrw@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Florian-Ewald Müller July 22, 2016, 2:49 p.m. UTC
blk_stack_limits() can set discard_granularity bigger than
max_discard_sectors (in bytes).
This makes blk_bio_discard_split() non-functional and can lead to data
corruption.

Comments

Martin K. Petersen July 22, 2016, 7:34 p.m. UTC | #1
>>>>> "Florian-Ewald" == Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes:

Florian-Ewald,

Florian-Ewald> blk_stack_limits() can set discard_granularity bigger
Florian-Ewald> than max_discard_sectors (in bytes).

discard_granularity must be smaller than max_discard_sectors. The
question is which storage device reported these values?
Florian-Ewald Müller July 23, 2016, 11:19 a.m. UTC | #2
Hi Martin,

My setup is a new developed distributed block device on top of dm-cache.
The dm-cache values are bigger than those needed by my (new developed)
block device.

Regards,
Florian

On Fri, Jul 22, 2016 at 9:34 PM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "Florian-Ewald" == Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes:
>
> Florian-Ewald,
>
> Florian-Ewald> blk_stack_limits() can set discard_granularity bigger
> Florian-Ewald> than max_discard_sectors (in bytes).
>
> discard_granularity must be smaller than max_discard_sectors. The
> question is which storage device reported these values?
>
> --
> Martin K. Petersen      Oracle Linux Engineering
Martin K. Petersen July 27, 2016, 1:51 a.m. UTC | #3
>>>>> "Florian-Ewald" == Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes:

Florian-Ewald,

Florian-Ewald> My setup is a new developed distributed block
Florian-Ewald> device on top of dm-cache.  The dm-cache values are
Florian-Ewald> bigger than those needed by my (new developed) block
Florian-Ewald> device.

What are the actual values for granularity and max? For your device and
dm-cache respectively.
Florian-Ewald Müller July 27, 2016, 8:13 a.m. UTC | #4
Hi Martin,

My (new developed) distributed block device uses a block size of 64K.
So, because of the distribution functionality to different underlying
slave (dm-cache) block devices, it has:
- max_discard_sectors: 128 (64K)
- discard_granularity: 65536 (64K)
The discard sizes of the (underlying) dm-cache devices are:
- max_discard_sectors: 536870912 (256G)
- discard_granularity: 268435456 (256M)

Regards,
Florian

On Wed, Jul 27, 2016 at 3:51 AM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "Florian-Ewald" == Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes:
>
> Florian-Ewald,
>
> Florian-Ewald> My setup is a new developed distributed block
> Florian-Ewald> device on top of dm-cache.  The dm-cache values are
> Florian-Ewald> bigger than those needed by my (new developed) block
> Florian-Ewald> device.
>
> What are the actual values for granularity and max? For your device and
> dm-cache respectively.
>
> --
> Martin K. Petersen      Oracle Linux Engineering
Martin K. Petersen July 29, 2016, 1:39 a.m. UTC | #5
>>>>> "Florian-Ewald" == Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes:

Florian-Ewald,

> My (new developed) distributed block device uses a block size of 64K.

> So, because of the distribution functionality to different underlying
> slave (dm-cache) block devices, it has:
> - max_discard_sectors: 128 (64K)
> - discard_granularity: 65536 (64K)

> The discard sizes of the (underlying) dm-cache devices are:
> - max_discard_sectors: 536870912 (256G)
> - discard_granularity: 268435456 (256M)

Why do you set max_discard_sectors for your device to 128? To my
knowledge dm-cache does not do partial allocation. And consequently any
discard less than an aligned 256MB will be a nop anyway.
Florian-Ewald Müller July 29, 2016, 7:59 a.m. UTC | #6
Hi Martin,

I'm trying to solve this "squaring the circle" problem:
- my distributed block device "distributes" the data with the
configured block size (e.g. 64K) to different of the many underlying
slave block devices (here: dm-cache).
- my block device evaluates the discard for logically discarding the
specified block (it doesn't have more than 64K in a row) for some
specific (logical) slave device.
- that's why, I have to consider splitting the discard myself (if the
block layer doesn't do it) for treating each 64K chunk on different
(logical) slave devices.
- on the underlying block device (e.g. dm-cache) the data is placed
also in 64K chunks (there is not more in row).
- that's why, ignoring this size and sending the original big size to
it will probably delete unrelated data.

Now my experiments show that, at least, dm-cache doesn't complain nor
rejects those smaller discards than its discard_granularity, but
possibly turning them into noop (?).
May be that the needed functionality of accumulating small discards to
a big one covering its own granularity (similar to SSDs block erasure)
should be done at that driver level.

What do you think ?

Florian

On Fri, Jul 29, 2016 at 3:39 AM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "Florian-Ewald" == Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes:
>
> Florian-Ewald,
>
>> My (new developed) distributed block device uses a block size of 64K.
>
>> So, because of the distribution functionality to different underlying
>> slave (dm-cache) block devices, it has:
>> - max_discard_sectors: 128 (64K)
>> - discard_granularity: 65536 (64K)
>
>> The discard sizes of the (underlying) dm-cache devices are:
>> - max_discard_sectors: 536870912 (256G)
>> - discard_granularity: 268435456 (256M)
>
> Why do you set max_discard_sectors for your device to 128? To my
> knowledge dm-cache does not do partial allocation. And consequently any
> discard less than an aligned 256MB will be a nop anyway.
>
> --
> Martin K. Petersen      Oracle Linux Engineering
Martin K. Petersen Aug. 2, 2016, 2:08 a.m. UTC | #7
>>>>> Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes:

Florian-Ewald,

> Now my experiments show that, at least, dm-cache doesn't complain nor
> rejects those smaller discards than its discard_granularity, but
> possibly turning them into noop (?).

Correct. Anything smaller than (an aligned) multiple of the discard
granularity will effectively be ignored.

In practice this means that your device should allocate things in
aligned units of the underlying device's discard granularity.

> May be that the needed functionality of accumulating small discards to
> a big one covering its own granularity (similar to SSDs block erasure)
> should be done at that driver level.

Do you allocate blocks in a predictable pattern between your nodes?

For MD RAID0, for instance, we issue many small discard requests. But
for I/Os that are bigger than the stripe width we'll wrap around and do
merging so that for instance blocks 0, n, 2*n, 3*n, etc. become part of
the same discard request sent to the device.

If you want discards smaller than the underlying granularity to have an
effect then I'm afraid the burden is on you to maintain a bitmap of each
granularity sized region. And then issue a deferred discard when all
blocks inside that region have been discarded by the application or
filesystem above.

If you want to pursue partial block tracking it would be good to come up
with a common block layer infrastructure for it. dm-thin could benefit
from it as well...
Florian-Ewald Müller Aug. 2, 2016, 11:31 a.m. UTC | #8
Hi Martin,

I totally agree with better having a common block layer infrastructure
to handle such discard misfit cases.
But, for now, I do not have a good idea of how to aggregate in the
block layer discard chunks (< discard_granularity) and issue later
only a big one (== discard_granularity) to underlying block device in
a generic and persistent fashion.

For me, the current handling of discards by the block layer
[blk_stack_limits() + blk_bio_discard_split()] seems to be
inconsistent with the handling of normal (rd/wr) IO.
It makes the life of block drivers developers harder as they can not
rely on blk_queue_split() doing its job on discard bio's.

Regards,
Florian

On Tue, Aug 2, 2016 at 4:08 AM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes:
>
> Florian-Ewald,
>
>> Now my experiments show that, at least, dm-cache doesn't complain nor
>> rejects those smaller discards than its discard_granularity, but
>> possibly turning them into noop (?).
>
> Correct. Anything smaller than (an aligned) multiple of the discard
> granularity will effectively be ignored.
>
> In practice this means that your device should allocate things in
> aligned units of the underlying device's discard granularity.
>
>> May be that the needed functionality of accumulating small discards to
>> a big one covering its own granularity (similar to SSDs block erasure)
>> should be done at that driver level.
>
> Do you allocate blocks in a predictable pattern between your nodes?
>
> For MD RAID0, for instance, we issue many small discard requests. But
> for I/Os that are bigger than the stripe width we'll wrap around and do
> merging so that for instance blocks 0, n, 2*n, 3*n, etc. become part of
> the same discard request sent to the device.
>
> If you want discards smaller than the underlying granularity to have an
> effect then I'm afraid the burden is on you to maintain a bitmap of each
> granularity sized region. And then issue a deferred discard when all
> blocks inside that region have been discarded by the application or
> filesystem above.
>
> If you want to pursue partial block tracking it would be good to come up
> with a common block layer infrastructure for it. dm-thin could benefit
> from it as well...
>
> --
> Martin K. Petersen      Oracle Linux Engineering
diff mbox

Patch

From: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>
Subject: [PATCH] blk_stack_limits()
blk_stack_limits() can set discard_granularity bigger than max_discard_sectors (in bytes).
This makes blk_bio_discard_split() non-functional and can lead to data corruption.

Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>

--- linux-4.7-rc7/block/blk-settings.c.orig	2016-07-21 19:39:44.831351396 +0200
+++ linux-4.7-rc7/block/blk-settings.c	2016-07-21 19:40:04.582968496 +0200
@@ -626,8 +626,10 @@  int blk_stack_limits(struct queue_limits
 							 b->max_hw_discard_sectors);
 		t->discard_granularity = max(t->discard_granularity,
 					     b->discard_granularity);
+		if (t->discard_granularity > (t->max_discard_sectors << 9))
+			t->discard_granularity = t->max_discard_sectors << 9;
 		t->discard_alignment = lcm_not_zero(t->discard_alignment, alignment) %
-			t->discard_granularity;
+			(t->discard_granularity ? : 1);
 	}
 
 	return ret;