Message ID | CAHJVUehBe1zKvp==CUs0YkmRt5sT8qOES6bPnaeNKKsfuZ3Zrw@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
>>>>> "Florian-Ewald" == Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes:
Florian-Ewald,
Florian-Ewald> blk_stack_limits() can set discard_granularity bigger
Florian-Ewald> than max_discard_sectors (in bytes).
discard_granularity must be smaller than max_discard_sectors. The
question is which storage device reported these values?
Hi Martin, My setup is a new developed distributed block device on top of dm-cache. The dm-cache values are bigger than those needed by my (new developed) block device. Regards, Florian On Fri, Jul 22, 2016 at 9:34 PM, Martin K. Petersen <martin.petersen@oracle.com> wrote: >>>>>> "Florian-Ewald" == Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes: > > Florian-Ewald, > > Florian-Ewald> blk_stack_limits() can set discard_granularity bigger > Florian-Ewald> than max_discard_sectors (in bytes). > > discard_granularity must be smaller than max_discard_sectors. The > question is which storage device reported these values? > > -- > Martin K. Petersen Oracle Linux Engineering
>>>>> "Florian-Ewald" == Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes:
Florian-Ewald,
Florian-Ewald> My setup is a new developed distributed block
Florian-Ewald> device on top of dm-cache. The dm-cache values are
Florian-Ewald> bigger than those needed by my (new developed) block
Florian-Ewald> device.
What are the actual values for granularity and max? For your device and
dm-cache respectively.
Hi Martin, My (new developed) distributed block device uses a block size of 64K. So, because of the distribution functionality to different underlying slave (dm-cache) block devices, it has: - max_discard_sectors: 128 (64K) - discard_granularity: 65536 (64K) The discard sizes of the (underlying) dm-cache devices are: - max_discard_sectors: 536870912 (256G) - discard_granularity: 268435456 (256M) Regards, Florian On Wed, Jul 27, 2016 at 3:51 AM, Martin K. Petersen <martin.petersen@oracle.com> wrote: >>>>>> "Florian-Ewald" == Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes: > > Florian-Ewald, > > Florian-Ewald> My setup is a new developed distributed block > Florian-Ewald> device on top of dm-cache. The dm-cache values are > Florian-Ewald> bigger than those needed by my (new developed) block > Florian-Ewald> device. > > What are the actual values for granularity and max? For your device and > dm-cache respectively. > > -- > Martin K. Petersen Oracle Linux Engineering
>>>>> "Florian-Ewald" == Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes: Florian-Ewald, > My (new developed) distributed block device uses a block size of 64K. > So, because of the distribution functionality to different underlying > slave (dm-cache) block devices, it has: > - max_discard_sectors: 128 (64K) > - discard_granularity: 65536 (64K) > The discard sizes of the (underlying) dm-cache devices are: > - max_discard_sectors: 536870912 (256G) > - discard_granularity: 268435456 (256M) Why do you set max_discard_sectors for your device to 128? To my knowledge dm-cache does not do partial allocation. And consequently any discard less than an aligned 256MB will be a nop anyway.
Hi Martin, I'm trying to solve this "squaring the circle" problem: - my distributed block device "distributes" the data with the configured block size (e.g. 64K) to different of the many underlying slave block devices (here: dm-cache). - my block device evaluates the discard for logically discarding the specified block (it doesn't have more than 64K in a row) for some specific (logical) slave device. - that's why, I have to consider splitting the discard myself (if the block layer doesn't do it) for treating each 64K chunk on different (logical) slave devices. - on the underlying block device (e.g. dm-cache) the data is placed also in 64K chunks (there is not more in row). - that's why, ignoring this size and sending the original big size to it will probably delete unrelated data. Now my experiments show that, at least, dm-cache doesn't complain nor rejects those smaller discards than its discard_granularity, but possibly turning them into noop (?). May be that the needed functionality of accumulating small discards to a big one covering its own granularity (similar to SSDs block erasure) should be done at that driver level. What do you think ? Florian On Fri, Jul 29, 2016 at 3:39 AM, Martin K. Petersen <martin.petersen@oracle.com> wrote: >>>>>> "Florian-Ewald" == Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes: > > Florian-Ewald, > >> My (new developed) distributed block device uses a block size of 64K. > >> So, because of the distribution functionality to different underlying >> slave (dm-cache) block devices, it has: >> - max_discard_sectors: 128 (64K) >> - discard_granularity: 65536 (64K) > >> The discard sizes of the (underlying) dm-cache devices are: >> - max_discard_sectors: 536870912 (256G) >> - discard_granularity: 268435456 (256M) > > Why do you set max_discard_sectors for your device to 128? To my > knowledge dm-cache does not do partial allocation. And consequently any > discard less than an aligned 256MB will be a nop anyway. > > -- > Martin K. Petersen Oracle Linux Engineering
>>>>> Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes: Florian-Ewald, > Now my experiments show that, at least, dm-cache doesn't complain nor > rejects those smaller discards than its discard_granularity, but > possibly turning them into noop (?). Correct. Anything smaller than (an aligned) multiple of the discard granularity will effectively be ignored. In practice this means that your device should allocate things in aligned units of the underlying device's discard granularity. > May be that the needed functionality of accumulating small discards to > a big one covering its own granularity (similar to SSDs block erasure) > should be done at that driver level. Do you allocate blocks in a predictable pattern between your nodes? For MD RAID0, for instance, we issue many small discard requests. But for I/Os that are bigger than the stripe width we'll wrap around and do merging so that for instance blocks 0, n, 2*n, 3*n, etc. become part of the same discard request sent to the device. If you want discards smaller than the underlying granularity to have an effect then I'm afraid the burden is on you to maintain a bitmap of each granularity sized region. And then issue a deferred discard when all blocks inside that region have been discarded by the application or filesystem above. If you want to pursue partial block tracking it would be good to come up with a common block layer infrastructure for it. dm-thin could benefit from it as well...
Hi Martin, I totally agree with better having a common block layer infrastructure to handle such discard misfit cases. But, for now, I do not have a good idea of how to aggregate in the block layer discard chunks (< discard_granularity) and issue later only a big one (== discard_granularity) to underlying block device in a generic and persistent fashion. For me, the current handling of discards by the block layer [blk_stack_limits() + blk_bio_discard_split()] seems to be inconsistent with the handling of normal (rd/wr) IO. It makes the life of block drivers developers harder as they can not rely on blk_queue_split() doing its job on discard bio's. Regards, Florian On Tue, Aug 2, 2016 at 4:08 AM, Martin K. Petersen <martin.petersen@oracle.com> wrote: >>>>>> Florian-Ewald Müller <florian-ewald.mueller@profitbricks.com> writes: > > Florian-Ewald, > >> Now my experiments show that, at least, dm-cache doesn't complain nor >> rejects those smaller discards than its discard_granularity, but >> possibly turning them into noop (?). > > Correct. Anything smaller than (an aligned) multiple of the discard > granularity will effectively be ignored. > > In practice this means that your device should allocate things in > aligned units of the underlying device's discard granularity. > >> May be that the needed functionality of accumulating small discards to >> a big one covering its own granularity (similar to SSDs block erasure) >> should be done at that driver level. > > Do you allocate blocks in a predictable pattern between your nodes? > > For MD RAID0, for instance, we issue many small discard requests. But > for I/Os that are bigger than the stripe width we'll wrap around and do > merging so that for instance blocks 0, n, 2*n, 3*n, etc. become part of > the same discard request sent to the device. > > If you want discards smaller than the underlying granularity to have an > effect then I'm afraid the burden is on you to maintain a bitmap of each > granularity sized region. And then issue a deferred discard when all > blocks inside that region have been discarded by the application or > filesystem above. > > If you want to pursue partial block tracking it would be good to come up > with a common block layer infrastructure for it. dm-thin could benefit > from it as well... > > -- > Martin K. Petersen Oracle Linux Engineering
From: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com> Subject: [PATCH] blk_stack_limits() blk_stack_limits() can set discard_granularity bigger than max_discard_sectors (in bytes). This makes blk_bio_discard_split() non-functional and can lead to data corruption. Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com> --- linux-4.7-rc7/block/blk-settings.c.orig 2016-07-21 19:39:44.831351396 +0200 +++ linux-4.7-rc7/block/blk-settings.c 2016-07-21 19:40:04.582968496 +0200 @@ -626,8 +626,10 @@ int blk_stack_limits(struct queue_limits b->max_hw_discard_sectors); t->discard_granularity = max(t->discard_granularity, b->discard_granularity); + if (t->discard_granularity > (t->max_discard_sectors << 9)) + t->discard_granularity = t->max_discard_sectors << 9; t->discard_alignment = lcm_not_zero(t->discard_alignment, alignment) % - t->discard_granularity; + (t->discard_granularity ? : 1); } return ret;